{ "info": { "author": "Andrew Sanchez", "author_email": "inbox.asanchez@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Environment :: Console", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Operating System :: POSIX :: Linux", "Programming Language :: Python :: 3.6", "Topic :: Scientific/Engineering :: Bio-Informatics" ], "description": ".. image:: https://api.travis-ci.org/andrewsanchez/GenBankQC.svg?branch=master \n\n=============================================\n GenBank Quality Control\n=============================================\n\nComplete documentation lives at `genbank-qc.readthedocs.io`_. It is a work in progress.\n\nGenBankQC is an effort to address the quality control problem for public databases such as the National Center for Biotechnology Information's `GenBank`_. The goal is to offer a simple, efficient, and automated solution for assessing the quality of your genomes.\n\nNote\n----\n\n Please note that GenbankQC is currently in alpha. As a proof of concept for a specific use case, it currently has limitations that users should be aware of. If there is interest, we will address the issues to make it more convenient to use. Please see `caveats <#caveats>`__ for more details.\n\n\nFeatures\n--------\n\n- Labelling/annotation-independent quality control based on:\n\n - Simple metrics\n\n - Genome distance estimation using `MASH`_\n\n- Flag potential outliers to exclude them from polluting your pipelines\n\nThe genbankqc work-flow consists of the following steps:\n\n#. Generate statistics for each genome based on the following metrics:\n\n * Number of unknown bases\n * Number of contigs\n * Assembly size\n * Average `MASH`_ distance compared to other genomes\n\n#. Flag potential outliers based on these statistics:\n\n * Flag genomes containing more than a certain number of unknown bases.\n\n * Flag genomes outside of a range based on the median absolute deviation.\n\n * Applies to number of contigs and assembly size\n\n * Flag genomes whose `MASH`_ distance is greater than the upper end of the median absolute deviation.\n\n#. Visualize the results with a color coded tree\n\nUsage\n-----\n\n::\n\n genbankqc /path/to/genomes\n open /path/to/genomes/Escherichia_coli/qc/200_3.0_3.0_3.0/tree.svg\n\n\nInstallation\n------------\n\nIf you don't yet have a functional conda environment, please download and install `Miniconda`_.\n\n.. code::\n\n conda create -n genbankqc -c etetoolkit -c biocore pip ete3 scikit-bio\n\n source activate genbankqc\n\n pip install genbankqc\n\n\n.. _caveats:\n\nCaveats\n--------\n\nThere are some arbitrary, hard-coded limitations regarding file names. This is because the project originally began as a part of the NCBI Tool Kit (`NCBITK`_) which we use for downloading genomes from NCBI. NCBITK generates a specific directory structure and file naming scheme which GenbankQC currently expects.\n\nIf you'd like to use GenBankQC without using NCBITK, all that is required is that your file names match the python regular expression ``re.compile('.*(GCA_\\d+\\.\\d.*)(.fasta)')``. You can quickly test this by following my example at `pythex.org`_.\n\n.. _pythex.org: https://pythex.org/?regex=.*(GCA_%5Cd%2B%5C.%5Cd.*)(.fasta)&test_string=GCA_002415405.1_Acinetobacter_nosocomialis_UBA5139_Scaffold.fasta&ignorecase=0&multiline=0&dotall=0&verbose=0\n\n.. _NCBITK: https://github.com/andrewsanchez/NCBITK\n.. _GenBank: https://www.ncbi.nlm.nih.gov/genbank/\n.. _ETE Toolkit: http://etetoolkit.org/ \n.. _Miniconda: https://conda.io/miniconda.html\n.. _MASH: http://mash.readthedocs.io/en/latest/\n.. _genbank-qc.readthedocs.io: http://genbank-qc.readthedocs.io/en/latest/\n\n.. image:: https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square \n :target: https://yangsu.github.io/pull-request-tutorial/\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/andrewsanchez/genbankqc", "keywords": "NCBI bioinformatics genomics", "license": "BSD-3-Clause", "maintainer": "", "maintainer_email": "", "name": "GenBankQC", "package_url": "https://pypi.org/project/GenBankQC/", "platform": "", "project_url": "https://pypi.org/project/GenBankQC/", "project_urls": { "Homepage": "https://github.com/andrewsanchez/genbankqc" }, "release_url": "https://pypi.org/project/GenBankQC/0.2a0/", "requires_dist": null, "requires_python": "", "summary": "\"Automated quality control for GenBank genomes.\"", "version": "0.2a0" }, "last_serial": 3404455, "releases": { "0.1a0": [ { "comment_text": "", "digests": { "md5": "ef2e58193a7d0ae26dd8eaa12436f585", "sha256": "9d492b8199b62a5396006f8234cb88e89091cfc6123221e30242c2c51242a806" }, "downloads": -1, "filename": "GenBankQC-0.1a0.tar.gz", "has_sig": false, "md5_digest": "ef2e58193a7d0ae26dd8eaa12436f585", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11587, "upload_time": "2017-12-09T00:38:45", "url": "https://files.pythonhosted.org/packages/ce/35/67c554d2f973ec6b67073043e008b821fb3d2083be393947998bb75cfc2e/GenBankQC-0.1a0.tar.gz" } ], "0.2a0": [ { "comment_text": "", "digests": { "md5": "142daca61f1e4119362fa97ca6fa5f44", "sha256": "f1aad07badc81af3a0b65c14d892fa8716ee8696263e6f99a5f795dd46038180" }, "downloads": -1, "filename": "GenBankQC-0.2a0.tar.gz", "has_sig": false, "md5_digest": "142daca61f1e4119362fa97ca6fa5f44", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11777, "upload_time": "2017-12-10T00:58:14", "url": "https://files.pythonhosted.org/packages/55/b8/f48c8cb01371078f0d8099eadc90fa27d3d65696c0edb7e87860c9e514b6/GenBankQC-0.2a0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "142daca61f1e4119362fa97ca6fa5f44", "sha256": "f1aad07badc81af3a0b65c14d892fa8716ee8696263e6f99a5f795dd46038180" }, "downloads": -1, "filename": "GenBankQC-0.2a0.tar.gz", "has_sig": false, "md5_digest": "142daca61f1e4119362fa97ca6fa5f44", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11777, "upload_time": "2017-12-10T00:58:14", "url": "https://files.pythonhosted.org/packages/55/b8/f48c8cb01371078f0d8099eadc90fa27d3d65696c0edb7e87860c9e514b6/GenBankQC-0.2a0.tar.gz" } ] }