{ "info": { "author": "Anders Gon\u00e7alves da Silva", "author_email": "andersgs@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "## Sanity check a FASTA DB\n\nFASTA DB are easy to corrupt, and we are in need of a simple tool\nto double check the DB has no unforeseen entries.\n\nThis tool will check for:\n\n1. Duplicate entries with the same ID\n2. Entries with different IDs but with overlapping sequences (i.e., 100% overlap, or one sequence completely included in another)\n3. Optionally, it will try to parse the meaningful category or label off the FASTA header (e.g., ST allele number) and ensure that distinct labels are given to completely or partially overlapping sequences.\n\nThe tool outputs:\n\n1. A log to `stderr` signalling any potential issues.\n2. A report in Markdown straight to `stdout`. You can then add it to your GitHub, or post it as a GitHub Issue, or push it through `pandoc` to transform it in to any supported format you like (e.g., PDF for your paper).\n3. Optionally, the output files produced by `CD-HIT`\n\n## What does completely and/or partially overlapping sequences mean?\n\nUnder the hood, `db-check` is running `CD-HIT` (in particular, `cd-hit-est`). Here, `db-check` runs `cd-hit-est` to identify any sequences that have 100% identity and 100% coverage (i.e., have all the same bases and are exactly the same length, and thus are identical) OR where one sequence is 100% identical to another, but it is shorter (i.e., one sequence is exactly containted within another). The first case is what we call `completely overlapping` and the second case is what we call `partially overlapping`.\n\nObviously, neither case is ideal. The case of `completely overlapping` means your DB has redundant sequences that should be removed. It can be especially problematic if you have two identical sequences but they are identified as distinct categories.\n\nThe case of `partially overlapping` sequences is less clear cut whether it is an issue or not. We have seen cases where novel MLST alleles were added to a scheme as exact subsets of already established alleles. In those case, it turned out the new alleles were caused by a break in the assembly, and thus were **NOT** novel alleles at all. However, there are cases where partially overlapping sequences might be biologically real (e.g. indicating a real deletion or insertion), and you do wish to keep that in your DB. If that is the case, additional logic will have to be applied to your BLAST filtering steps to make sure you get the right variant.\n\nWhatever your case, hopefully, this tool will help you get a quick assessment of your FASTA DB, and be aware of any internal issues before releasing it in to the wild. Or, if you are a user of a FASTA DB, and you are getting funny results, hopefully this tool can help you troubleshoot the DB and pass on the developer any issues they may have with their FASTA DB.\n\n## Dependencies\n\n- `CD-HIT`\n- `Python >=3.6`\n\nOptionally, `pandoc`.\n\n## Installation\n\n### Conda\n\n```\nconda -c bioconda cd-hit\npip3 install db-check\n```\n\n### Brew\n\n```\nbrew install cd-hit\npip3 install db-check\n```\n\n_NOTE_: I found that `brew install` on my MacOSX Mojave machine failed because the bottled version was compiled with a different set of headers. I fixed it by forcing it to install from source like so:\n\n```\nbrew uninstall cd-hit\nbrew install -s cd-hit\n```\n\n## Running\n\n```\ndb-check > report.md\n```\n\n### Running an example\n\n```\ndb-check --example\n```\n\n### Parsing categories from sequence ID\n\n#### Using `--delimiter` and `--field`:\n\nIn some cases your category is encoded in the sequence ID with a particular delimiter (e.g., `|` or `~~`), and once the ID is split along this delimiter, the field of interest might be the second (`--field 1` --- `1` because it is 0-indexed) or, perhaps, the last one (`--field -1`).\n\nFor example, let us say you have IDs coded with the following pattern: `>seq1~~catA`. In this case, you would do the following:\n\n```\ndb-check --delimiter \"~~\" --field -1 /path/to/db\n```\n\n#### Using `--regex`:\n\nIn some cases your category might be encoded in a more convoluted manner, and perhaps a `regex` expression with a capture group would work best. The above case could thus be equally parsed with:\n\n```\ndb-check --regex \".*~~(.*)\" /path/to/db\n```\n\n#### Using a `callback` function\n\n_Implementation pending_\n\n#### Command-line options\n\n```\nUsage: db-check [OPTIONS] FASTA\n\n Check a FASTA DB for potential issues.\n\nOptions:\n -d, --delimiter TEXT When parsing a category from seqid, split on this\n delimiter (use -1 for last element, -2 for second to\n last, etc.).\n -f, --field INTEGER When parsing a category from seqid using a delimiter,\n keep this field number (0-index).\n -r, --regex TEXT When parsing a category from seqid extract using this\n regex.\n -a, --author TEXT Who is running the check. (default: $USER)\n -n, --db_name TEXT Name of the Database. (default: filename)\n -t, --threads INTEGER How many threads to give CD-HIT (default: 1)\n -p, --prefix TEXT Prefix of output files from CD-HIT (default: cdhit)\n -k, --keep_files Whether to keep CD-HIT output files (default: False)\n --example Run an example set\n --version Show the version and exit.\n -h, --help Show this message and exit.\n```\n\n## Examples using `pandoc` to convert the output to other formats\n\n### Convert to HTML\n\n```\ndb-check --example | pandoc --from markdown --to html5 > report.html\n```\n\n### Convert to PDF\n\n```\ndb-check --example | pandoc --from markdown --to latex -o report.pdf\n```\n\n### Conver to plain text\n\n```\ndb-check --example | pandoc --from markdown --to plain > report.txt\n```\n\n## TODO\n\n- add conda recipe\n- add brew recipe\n- add travisCI recipe\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/andersgs/db_check", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "db-check", "package_url": "https://pypi.org/project/db-check/", "platform": "", "project_url": "https://pypi.org/project/db-check/", "project_urls": { "Homepage": "https://github.com/andersgs/db_check" }, "release_url": "https://pypi.org/project/db-check/0.1.5/", "requires_dist": [ "click", "pandas", "sh", "tabulate (>=0.8.2)", "markdown-strings" ], "requires_python": "", "summary": "Sanity check your FASTA DB before release.", "version": "0.1.5" }, "last_serial": 4597320, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "a5eb2c9a730e24a45067158a40d5875e", "sha256": "ae9fbaf8196f38717b6904d13a0e953e506b8615b93f050f869944c47f2b8a1a" }, "downloads": -1, "filename": "db_check-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "a5eb2c9a730e24a45067158a40d5875e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 11836, "upload_time": "2018-11-21T21:13:38", "url": "https://files.pythonhosted.org/packages/99/f7/18bc8080094ff0a13e1238161b81718de9d7dc374fd05a9feab844c70e7e/db_check-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "23ecea013de171bc61e7a8916ee22e31", "sha256": "d92b91b5389990b9fa3a8b2741858df7e35ca961636ad83ffa57a82a5c3b5d52" }, "downloads": -1, "filename": "db_check-0.1.0.tar.gz", "has_sig": false, "md5_digest": "23ecea013de171bc61e7a8916ee22e31", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10260, "upload_time": "2018-11-21T21:13:40", "url": "https://files.pythonhosted.org/packages/ad/46/a9bf9300afba1c77406ff90ac84babd18dc7e5436b201f53b52d65ea3c14/db_check-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "c709acbcf05a389ff183d397819d5528", "sha256": "7bc8b3e32dc9936ef13a085b0a835cb71d18244c3ed3ae5295939fe7e94976a2" }, "downloads": -1, "filename": "db_check-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "c709acbcf05a389ff183d397819d5528", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 12244, "upload_time": "2018-11-21T21:53:06", "url": "https://files.pythonhosted.org/packages/c6/ae/fd80c53ccaa3aee4d024b651da6a2b67ac9182bab59603e8d368da3735d8/db_check-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c2e793f77530a5e4cd94b7414147ab54", "sha256": "7acf790dd555ae462926c4bf6dc9e6b33d3d53a611a1502508a02eb1cd98f3c3" }, "downloads": -1, "filename": "db_check-0.1.1.tar.gz", "has_sig": false, "md5_digest": "c2e793f77530a5e4cd94b7414147ab54", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12240, "upload_time": "2018-11-21T21:53:08", "url": "https://files.pythonhosted.org/packages/ba/9c/b2f8d13c316ecd4bbded7faf9294b922942e5952f11b1c3b46551498322a/db_check-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "0417f2a02dd58b38d8b61b528b16da01", "sha256": "88d76e84310151c76e0c795aa790396edc07a327f186d93423db9ae1565ec0d1" }, "downloads": -1, "filename": "db_check-0.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "0417f2a02dd58b38d8b61b528b16da01", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 12231, "upload_time": "2018-11-21T22:14:34", "url": "https://files.pythonhosted.org/packages/f7/1c/11dc4d376697950b3bb648225e8ae6cb5b5dc39cbcd0b9ba5f95bf8b87d0/db_check-0.1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9598a7980cb012aea169f971dc43c59d", "sha256": "9d94f03f2f2d9aec9ba82fa2b197452caa5a08fffcc3b93ed31ff2e9fcbe3ca2" }, "downloads": -1, "filename": "db_check-0.1.2.tar.gz", "has_sig": false, "md5_digest": "9598a7980cb012aea169f971dc43c59d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12215, "upload_time": "2018-11-21T22:14:36", "url": "https://files.pythonhosted.org/packages/55/90/79cf0461ce6ee6cafbab3121fcd429c4c3ae7f351a14a3870e6081c8e80e/db_check-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "e8c01d7b4bfa0fab909f6806753ef7c9", "sha256": "75dae3e9e41cb4521b3ec5793a1104e4a882db373e9fdffe5e999a2d4d2c6fd0" }, "downloads": -1, "filename": "db_check-0.1.3-py3-none-any.whl", "has_sig": false, "md5_digest": "e8c01d7b4bfa0fab909f6806753ef7c9", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 12227, "upload_time": "2018-11-21T22:22:42", "url": "https://files.pythonhosted.org/packages/f5/8a/5aeff99f42dfc4e6300fcae79ea79f55cb2c6fa2ba53f3b97267ec68e523/db_check-0.1.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d8a86a4e69669d2d38baa63c7db5b0ae", "sha256": "24a9e0f65148fa9e622be7f23dd3f87d9bace542d78937f0113600560689a07d" }, "downloads": -1, "filename": "db_check-0.1.3.tar.gz", "has_sig": false, "md5_digest": "d8a86a4e69669d2d38baa63c7db5b0ae", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12214, "upload_time": "2018-11-21T22:22:43", "url": "https://files.pythonhosted.org/packages/fe/38/abd562ce24d12da0be684ff302d8b41b7c6414347efcd8440fa793f314cc/db_check-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "ff1f8af5c19735bde940c3a3dfae3178", "sha256": "fdf2081d01b663e2489d91356479ca339f510d609bce02cc66c1815665bd2542" }, "downloads": -1, "filename": "db_check-0.1.4-py3-none-any.whl", "has_sig": false, "md5_digest": "ff1f8af5c19735bde940c3a3dfae3178", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 12641, "upload_time": "2018-11-21T22:55:49", "url": "https://files.pythonhosted.org/packages/b4/db/3911b722069c164d2976b40f1ee8362cbdd2ad1eafe8f5b9bd04ba9c41b3/db_check-0.1.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eb28f4fbdaaa409db62e480af42535bf", "sha256": "c0a3f1a92b9fa67cd940fd8d9c928f6105674c8dd446a32452bf10754463e04e" }, "downloads": -1, "filename": "db_check-0.1.4.tar.gz", "has_sig": false, "md5_digest": "eb28f4fbdaaa409db62e480af42535bf", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12288, "upload_time": "2018-11-21T22:55:50", "url": "https://files.pythonhosted.org/packages/87/e5/e0d0eadc26f8a9cf77f5f0274263121a412e0dd865e1c0e0f08b92b1bb23/db_check-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "9ad241f9aff786b0ff8797d8ba192264", "sha256": "834c9489e5fa8dfb7a18aa1d1de29f59ef3267069fc2614476a7589e1bba3113" }, "downloads": -1, "filename": "db_check-0.1.5-py3-none-any.whl", "has_sig": false, "md5_digest": "9ad241f9aff786b0ff8797d8ba192264", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 12755, "upload_time": "2018-12-13T23:30:45", "url": "https://files.pythonhosted.org/packages/63/4a/1dea6b198920efb8f509f2211ec656e0ee0d663378aa2daf1586ca7d08b8/db_check-0.1.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6251d70d17fdb8e50b1817bf83c3d345", "sha256": "6aafb368eebc2a4fdc3c31338ad55028017d7fe8e723f12fe6997ee9efcf6d87" }, "downloads": -1, "filename": "db_check-0.1.5.tar.gz", "has_sig": false, "md5_digest": "6251d70d17fdb8e50b1817bf83c3d345", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12399, "upload_time": "2018-12-13T23:30:47", "url": "https://files.pythonhosted.org/packages/b1/c8/78eae2915abe0031564a3cebfb3aea811eb7ad45cb0e68ac990ee724fb61/db_check-0.1.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "9ad241f9aff786b0ff8797d8ba192264", "sha256": "834c9489e5fa8dfb7a18aa1d1de29f59ef3267069fc2614476a7589e1bba3113" }, "downloads": -1, "filename": "db_check-0.1.5-py3-none-any.whl", "has_sig": false, "md5_digest": "9ad241f9aff786b0ff8797d8ba192264", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 12755, "upload_time": "2018-12-13T23:30:45", "url": "https://files.pythonhosted.org/packages/63/4a/1dea6b198920efb8f509f2211ec656e0ee0d663378aa2daf1586ca7d08b8/db_check-0.1.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6251d70d17fdb8e50b1817bf83c3d345", "sha256": "6aafb368eebc2a4fdc3c31338ad55028017d7fe8e723f12fe6997ee9efcf6d87" }, "downloads": -1, "filename": "db_check-0.1.5.tar.gz", "has_sig": false, "md5_digest": "6251d70d17fdb8e50b1817bf83c3d345", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12399, "upload_time": "2018-12-13T23:30:47", "url": "https://files.pythonhosted.org/packages/b1/c8/78eae2915abe0031564a3cebfb3aea811eb7ad45cb0e68ac990ee724fb61/db_check-0.1.5.tar.gz" } ] }