{ "info": { "author": "Camille Scott", "author_email": "camille.scott.w@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "shmlast\n=======\n\n*An improved implementation of Conditional Reciprocal Best Hits with LAST and Python*\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n|Build Status| |codecov| |JOSS| |DOI|\n\nshmlast is a reimplementation of the `Conditional Reciprocal Best\nHits `__ algorithm for finding\npotential orthologs between a transcriptome and a species-specific\nprotein database. It uses the `LAST `__ aligner\nand the pydata stack to achieve much better performance while staying in\nthe Python ecosystem.\n\nAbout\n-----\n\nConditional Reciprocal Best Hits (CRBH) was originalyl described by\n`Aubry et al.\n2014 `__\nand implemented in the\n`crb-blast `__ package. CRBH\nbuilds on the traditional Reciprocal Best Hits (RBH) method for\northology assignment by training a simple model of the e-value cutoff\nfor a particular length of sequence on an initial set of RBH's. From its\ngithub repository:\n\n::\n\n \"Reciprocal best BLAST is a very conservative way to assign orthologs. The main innovation in\n CRB-BLAST is to learn an appropriate e-value cutoff to apply to each pairwise alignment by taking\n into account the overall relatedness of the two datasets being compared. This is done by fitting a\n function to the distribution of alignment e-values over sequence lengths. The function provides the\n e-value cutoff for a sequence of given length.\"\n\nUnfortunately, the original implementation uses NCBI BLAST+ (which is\nincredibly slow), and is implemented in Ruby, which requires users to\nleave the Python-dominated bioinformatics software system. shmlast makes\nthis algorithm available to users in Python-land, while also greatly\nimproving performance by using `LAST `__ for\ninitial homology searches. Additionally, shmlast outputs both the raw\nparameters and a plot of its model for inspection.\n\nshmlast is designed for finding orthologs between *transcriptomes* and\n*protein databases*. As such, it currently does not support\nnucleotide-nucleotide or protein-protein alignments. This may be changed\nin a future version, but for now, it remains focused on that task.\n\nAlso note that RBH, and by extension CRBH, is meant for comparing\nbetween *two species*. Neither of these methods should be used for\nannotating a transcriptome with a mixed protein database (like, for\nexample, uniref90).\n\nUsage\n-----\n\nFor some transcriptome ``transcripts.fa`` and some protein database\n``pep.faa``, the basic usage is:\n\n.. code:: bash\n\n shmlast crbl -q transcripts.fa -d pep.faa \n\nshmlast can be distributed across multiple cores using the\n``--n_threads`` option.\n\n.. code:: bash\n\n shmlast crbl -q transcripts.fa -d pep.faa --n_threads 8\n\nshmlast can also distribute across nodes with the ``--sshnodefile``\nflag. This should be a list of nodes in ``$NUM_CPUS/$SSH_ADDRESS``\nformat. For example, a configuration where two nodes are available, one\nwith 4 and the other with 12 cpus available, might look like so:\n\n::\n\n 4/csm-010\n 12/ifi-002\n\nWith Portable Batch System (PBS), this information is stored in the\n``$PBS_NODEFILE`` variable, though the format varies per machine. Often\ntimes a list will be returned without the number of cores per node but\ninstead with one entry per node; this can be converted to the proper\nformat and then used with:\n\n.. code:: bash\n\n sort $PBS_NODEFILE | uniq -c | awk '{print $1\"/\"$2}' > nodefile.txt\n shmlast crbl -q transcripts.fa -d pep.faa --sshloginfile nodefile.txt\n\nIt will be up to you to figure out how your local cluster formats its\nnode-file, as it varies greatly. Unfortunately, there isn't a reliable\nway to determine this programmatically.\n\nAnother use case is to perform simple Reciprocal Best Hits; this can be\ndone with the ``rbl`` subcommand. The maximum expectation-value can also\nbe specified with ``-e``.\n\n.. code:: bash\n\n shmlast rbl -q transcripts.fa -d pep.faa --e 0.000001\n\nOutput\n------\n\nshmlast outputs a plain CSV file with the CRBH's, which by default will\nbe named ``$QUERY.x.$DATABASE.crbl.csv``. This CSV file can be easily\nparsed with Pandas like so:\n\n.. code:: python\n\n import pandas as pd\n\n crbl_df = pd.read_csv('query.x.database.crbl.csv')\n\nThe columns are:\n\n1. *E*: The e-value.\n2. *EG2*: Expected alignments per square gigabase.\n3. *E\\_scaled*: E-value rescaled for the model (see below for details).\n4. *ID*: A unique ID for the alignment.\n5. *bitscore*: The bitscore, calculated as (lambda \\* score - ln[K]) /\n ln[2].\n6. *q\\_aln\\_len*: Query alignment length.\n7. *q\\_frame*: Frame in the query translation.\n8. *q\\_len*: Length of the query sequence.\n9. *q\\_name*: Name of the query sequence.\n10. *q\\_start*: Start of query alignment. 11.\\ *q\\_strand*: Strand of\n query alignment.\n11. *s\\_aln\\_len*: Length of subject alignment.\n12. *s\\_len*: Length of subject sequence.\n13. *s\\_name*: Name of subject sequence.\n14. *s\\_start*: Start of subject alignment.\n15. *s\\_strand*: Strand of subject alignment.\n16. *score*: The alignment score.\n\nSee http://last.cbrc.jp/doc/last-evalues.html for more information on\ne-values and scores.\n\nModel Output\n^^^^^^^^^^^^\n\nshmlast also outputs its model, both in CSV format and as a plot. The\nCSV file is named ``$QUERY.x.$DATABASE.crbl.model.csv``, and has the\nfollowing columns:\n\n1. *center*: The center of the length bin.\n2. *size*: The size of the bin.\n3. *left*: The left of the bin.\n4. *right*: The right of the bin.\n5. *fit*: The scaled e-value cutoff for the bin.\n\nTo fit the model, the e-values are first scaled to a more suitable range\nusing the equation ``Es = -log10(E)``, where ``Es`` is the scaled\ne-value. e-values of 0 are set to an arbitrarily small value to allow\nfor log-scaling. The *fit* column of the model is this scaled value.\n\nThe model plot is named ``$QUERY.x.$DATABASE.crbl.model.plot.pdf`` by\ndefault.\n\nInstallation\n------------\n\nI recommend the Anaconda (or miniconda) Python distribution. To install\ninto an Anaconda environment, first get dependencies via conda:\n\n.. code:: bash\n\n conda install --file <(curl https://raw.githubusercontent.com/camillescott/shmlast/master/environment.txt)\n\nAnd then install shmlast and its PyPI dependencies with pip:\n\n.. code:: bash\n\n pip install shmlast\n\nThird-party Dependencies\n------------------------\n\nshmlast requires the LAST aligner and gnu-parallel.\n\nManually\n~~~~~~~~\n\nLAST can be installed manually into your home directory like so:\n\n.. code:: bash\n\n cd\n curl -LO http://last.cbrc.jp/last-658.zip\n unzip last-658.zip\n pushd last-658 && make && make install prefix=~ && popd\n\nAnd a recent version of gnu-parallel can be installed like so:\n\n.. code:: bash\n\n (wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash\n\nThrough a Package Manager\n~~~~~~~~~~~~~~~~~~~~~~~~~\n\nFor Ubuntu 16.04 or newer, sufficiently new versions of both are\navailable through the package manager:\n\n.. code:: bash\n\n sudo apt-get install last-align parallel\n\nFor OSX, you can get LAST through the homebrew-science channel:\n\n.. code:: bash\n\n brew tap homebrew/science\n brew install last\n\nLibrary\n-------\n\nshmlast is also a Python library. Each component of the pipeline is\nimplemented as a `pydoit `__ task and can be used in\ndoit workflows, and the implementations for calculating best hits,\nreciprocal best hits, and conditional reciprocal best hits are usable as\nPython classes. For example, the ``lastal`` task could be incorporated\ninto a doit file like so:\n\n.. code:: python\n\n from shmlast.last import lastal_task\n\n def task_lastal():\n return lastal_task('query.fna', 'db.faa', translate=True)\n\nKnown Issues\n------------\n\nThere is currently an issue with IUPAC codes in RNA. This will be fixed\nsoon.\n\nContributing\n------------\n\nSee `CONTRIBUTING.md `__ for guidelines.\n\nReferences\n----------\n\n1. Aubry S, Kelly S, K\u00fcmpers BMC, Smith-Unna RD, Hibberd JM (2014) Deep\n Evolutionary Comparison of Gene Expression Identifies Parallel\n Recruitment of Trans-Factors in Two Independent Origins of C4\n Photosynthesis. PLoS Genet 10(6): e1004365.\n doi:10.1371/journal.pgen.1004365\n\n2. O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login:\n The USENIX Magazine, February 2011:42-47.\n\n3. Kie\u0142basa, S. M., Wan, R., Sato, K., Horton, P., & Frith, M. C.\n (2011). Adaptive seeds tame genomic sequence comparison. Genome\n research, 21(3), 487-493.\n\n.. |Build Status| image:: https://travis-ci.org/camillescott/shmlast.svg?branch=master\n :target: https://travis-ci.org/camillescott/shmlast\n.. |codecov| image:: https://codecov.io/gh/camillescott/shmlast/branch/master/graph/badge.svg\n :target: https://codecov.io/gh/camillescott/shmlast\n.. |JOSS| image:: http://joss.theoj.org/papers/3cde54de7dfbcada7c0fc04f569b36c7/status.svg\n :target: http://joss.theoj.org/papers/3cde54de7dfbcada7c0fc04f569b36c7\n.. |DOI| image:: https://zenodo.org/badge/55653298.svg\n :target: https://zenodo.org/badge/latestdoi/55653298\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/camillescott/shmlast", "keywords": "", "license": "BSD", "maintainer": "", "maintainer_email": "", "name": "shmlast", "package_url": "https://pypi.org/project/shmlast/", "platform": "", "project_url": "https://pypi.org/project/shmlast/", "project_urls": { "Homepage": "https://github.com/camillescott/shmlast" }, "release_url": "https://pypi.org/project/shmlast/1.2.1/", "requires_dist": null, "requires_python": "", "summary": "An improved implementation of Conditional Reciprocal Best Hits with LAST and Python.", "version": "1.2.1" }, "last_serial": 3652813, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "5adbdae780719e83fdaf554bed21a7c8", "sha256": "cc83c38ac5274019b1ce00e41129861799fe9dd11f2fae18d340bc766879b338" }, "downloads": -1, "filename": "shmlast-1.0.tar.gz", "has_sig": false, "md5_digest": "5adbdae780719e83fdaf554bed21a7c8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5149971, "upload_time": "2016-11-21T13:47:43", "url": "https://files.pythonhosted.org/packages/64/be/9023dbf127dc7840c5309c3f0884893f8ab23843edfd4bd2e8fd9479234b/shmlast-1.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "cd521b154eaa13fea08d2af00fd8b4cc", "sha256": "825ecf92b49929f2a20abf1d2a756fa3eb419128ba91d3bbfe05763066e3dd73" }, "downloads": -1, "filename": "shmlast-1.0.1.tar.gz", "has_sig": false, "md5_digest": "cd521b154eaa13fea08d2af00fd8b4cc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5153670, "upload_time": "2016-11-21T14:01:27", "url": "https://files.pythonhosted.org/packages/b2/01/21752968b42feefc6f8dedc1ce4874073b9bd9df813c74d14acccd4994c0/shmlast-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "c2e1c833a56a751bbb8adc7b07b1353a", "sha256": "0670f449f4df648e827a75ddcbf9c88f49383fff592f12c9f82b666127062305" }, "downloads": -1, "filename": "shmlast-1.0.2.tar.gz", "has_sig": false, "md5_digest": "c2e1c833a56a751bbb8adc7b07b1353a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5147022, "upload_time": "2016-11-21T14:14:06", "url": "https://files.pythonhosted.org/packages/1a/46/4e5bfe574fdbf2dd7a0cbb47693d5877c07e87dbcdac8f8724030e2f2e48/shmlast-1.0.2.tar.gz" } ], "1.0.3": [ { "comment_text": "", "digests": { "md5": "8865f77452200169613c9b132837f4a2", "sha256": "9e418f49cff36c4b490e55f03331ef8021086e7f1145e4b4fc4875a37205eb4c" }, "downloads": -1, "filename": "shmlast-1.0.3.tar.gz", "has_sig": false, "md5_digest": "8865f77452200169613c9b132837f4a2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5147476, "upload_time": "2016-11-21T14:36:59", "url": "https://files.pythonhosted.org/packages/a8/3b/f9f396efddfa567421e620a0ee1664684e7f96931b98affe74a323e40f17/shmlast-1.0.3.tar.gz" } ], "1.1": [ { "comment_text": "", "digests": { "md5": "bb708e3a6750c9c6dccf2f2b69ca354c", "sha256": "89282f633f601fbef1453271491e27b7064ba679497c2980c4d6373edcd3999b" }, "downloads": -1, "filename": "shmlast-1.1.tar.gz", "has_sig": false, "md5_digest": "bb708e3a6750c9c6dccf2f2b69ca354c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5147282, "upload_time": "2017-01-18T20:36:44", "url": "https://files.pythonhosted.org/packages/e3/70/bbf6439fdbfe0772d3cc29ff8f7d9fbc73786c0fb3a5b2f672cdde2e6f79/shmlast-1.1.tar.gz" } ], "1.2": [ { "comment_text": "", "digests": { "md5": "78ca334ad038c70a55a9a29d63200abd", "sha256": "3f2a198ca380945d8b31d9189811e94e886a0a350056b07390a61faec1f275f8" }, "downloads": -1, "filename": "shmlast-1.2.tar.gz", "has_sig": false, "md5_digest": "78ca334ad038c70a55a9a29d63200abd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5151406, "upload_time": "2017-06-23T01:53:04", "url": "https://files.pythonhosted.org/packages/e2/de/28d5bdfaf7748f859f392587bc5a8c7518c3fdfe3ea6df2c9ea8b0d3659c/shmlast-1.2.tar.gz" } ], "1.2.1": [ { "comment_text": "", "digests": { "md5": "1b2c52a12a189c461c24ffe61ed3fc73", "sha256": "b080d85b36a0ae4e97f9b78df5aba004ae25a47d2690981f295573174ac44a45" }, "downloads": -1, "filename": "shmlast-1.2.1.tar.gz", "has_sig": false, "md5_digest": "1b2c52a12a189c461c24ffe61ed3fc73", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5147494, "upload_time": "2018-03-09T00:02:25", "url": "https://files.pythonhosted.org/packages/12/43/9a80e49b753ae8663ec4d5293cc19cfa450d4e95087f7076f18467856b83/shmlast-1.2.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "1b2c52a12a189c461c24ffe61ed3fc73", "sha256": "b080d85b36a0ae4e97f9b78df5aba004ae25a47d2690981f295573174ac44a45" }, "downloads": -1, "filename": "shmlast-1.2.1.tar.gz", "has_sig": false, "md5_digest": "1b2c52a12a189c461c24ffe61ed3fc73", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5147494, "upload_time": "2018-03-09T00:02:25", "url": "https://files.pythonhosted.org/packages/12/43/9a80e49b753ae8663ec4d5293cc19cfa450d4e95087f7076f18467856b83/shmlast-1.2.1.tar.gz" } ] }