{ "info": { "author": "David Koslicki", "author_email": "dmkoslicki@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Natural Language :: English", "Operating System :: MacOS :: MacOS X", "Operating System :: POSIX :: Linux", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.5", "Topic :: Scientific/Engineering :: Bio-Informatics", "Topic :: Scientific/Engineering :: Mathematics" ], "description": "# CMash\nCMash is a fast and accurate way to estimate the similarity of two sets. This is a probabilisitic data analysis approach, and uses containment min hashing. Please see the [associated paper](http://www.biorxiv.org/content/early/2017/09/04/184150) for further details (and please cite if you use it):\n>Improving Min Hash via the Containment Index with applications to Metagenomic Analysis\n>David Koslicki, Hooman Zabeti\n>bioRxiv 184150; doi: https://doi.org/10.1101/184150\n\n## Installation\nThe easiest way to install this is to use [virtualenv](https://virtualenv.pypa.io/en/stable/):\n```bash\nvirtualenv CMashVE\nsource CMashVE/bin/activate\npip install -U pip\npip install CMash\n```\nYou can also just use ``pip install CMash`` if you don't want to create a virtual environment.\n\nTo get the absolute latest edition of CMash, then you can build from the Github repository via:\n```bash\nvirtualenv CMashVE\nsource CMashVE/bin/activate\npip install -U pip\ngit clone https://github.com/dkoslicki/CMash.git\ncd CMash\npip install -r requirements.txt\n```\n\nNote that the python code in this repository is python2 and python3 compatible, but the dependency ``khmer`` technically requires python3 (but ``khmer`` version ``2.1.1`` runs just fine in python2.)\nThe external dependency ``pytst`` requires python2, so I'm making this a python2 repository.\n## Usage\nThe basic paradigm is to create a reference/training database, form a sample bloom filter, and then query the database.\n\n#### Forming a reference/training database\nSay you have three reference fasta/q file: ``ref1.fa``, ``ref2.fa`` and ``ref3.fa``. In a file (here called ``FileNames.txt``), place the absolute paths pointing to the fasta/q files:\n```bash\ncat FileNames.txt\n# /abs/path/to/ref1.fa\n# /abs/path/to/ref2.fa\n# /abs/path/to/ref3.fa\n```\nThen you can create the training database via:\n```bash\nMakeDNADatabase.py FileNames.txt TrainingDatabase.h5\n```\nSee ``MakeDNADatabase.py -h`` for more options when forming a database.\n\n#### Creating a sample bloom filter\nGiven a (large) query fasta/q file ``Metagenome.fa``, you can *optionally* create a bloom filter via ``MakeNodeGraph.py Metagenome.fa .``. \nSee ``MakeNodeGraph.py -h`` for more details about this function.\n\nThis step is not strictly necessary (as the next step automatically forms a nodegraph/bloom filter if you didn't already create one). \nHowever, I've provided this script in case you want to pre-process a bunch of metagenomes.\n\n#### Query the database\nTo get containment and Jaccard index estimates of the references files in your query file ``Metagenome.fa``, use something like the following ``QueryDNADatabase.py Metagenome.fa TrainingDatabase.h5 Output.csv``.\n\nThere are a bunch of options available: ``QueryDNADatabase.py -h``. The output file is a CSV file with rows corresponding (in this case) to ``ref1.fa``, ``ref2.fa``, and ``ref3.fa`` and columns corresponding to the containment index estimate, intersection cardinality, and Jaccard index estimate.\n\n#### Other functionality\nThe module ``MinHash`` (imported in python via ``from CMash import MinHash as MH``) has a bunch more functionality, including (but not limited to!):\n1. Fast updates to the training databases (via ``help(MH.delete_from_database)``, ``help(MH.insert_to_database)``, ``help(MH.union_databases)``)\n2. Ability to form a matrix of Jaccard indexes (for comparison of all pairwise Jaccard indexes of organisms in the training database). This is useful for identifying redundances/patterns/structure in your training database: ``help(MH.form_jaccard_count_matrix)`` and ``help(MH.form_jaccard_matrix)``.\n3. Access to the k-mers that MinHash randomly selected (see the class ``CountEstimator`` and the associated ``_kmers`` data structure.)\n\nI'd encourage you to poke through the source code of ``MinHash.py`` and take a look at the scripts as well.\n\nProtein databases (and for that matter, arbitrary K-length strings) coming soon...\n\n\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/dkoslicki/CMash", "keywords": "jaccard min hash containment genomics metagenomics", "license": "", "maintainer": "", "maintainer_email": "", "name": "CMash", "package_url": "https://pypi.org/project/CMash/", "platform": "", "project_url": "https://pypi.org/project/CMash/", "project_urls": { "Homepage": "https://github.com/dkoslicki/CMash" }, "release_url": "https://pypi.org/project/CMash/0.3.0/", "requires_dist": [ "khmer (>=2.1.1)", "screed", "h5py", "numpy", "blist", "argparse", "pandas", "setuptools (>=24.2.0)", "six", "scipy", "matplotlib" ], "requires_python": "", "summary": "Fast and accurate set similarity estimation via containment min hash (for genomic datasets).", "version": "0.3.0" }, "last_serial": 3274207, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "67c55808cb93274a9622ff31b5db8145", "sha256": "2ab543a8deeb078a82dee639c60a4bb6119267f0cae5399f469ac0b14a98b4c9" }, "downloads": -1, "filename": "CMash-0.1.0-py2-none-any.whl", "has_sig": false, "md5_digest": "67c55808cb93274a9622ff31b5db8145", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 78035, "upload_time": "2017-09-15T00:28:52", "url": "https://files.pythonhosted.org/packages/ed/0f/938041b5bb104b0ce42bc331ec6a21a4a9e577fd3823704c1605c8e44165/CMash-0.1.0-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1473f2a1e2c96115b8a0b5ba3a4968ce", "sha256": "f8a50a7abc34d8af7dad4aad9b034c87d1da668da3835004acabcb5ae9db254f" }, "downloads": -1, "filename": "CMash-0.1.0.tar.gz", "has_sig": false, "md5_digest": "1473f2a1e2c96115b8a0b5ba3a4968ce", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 73380, "upload_time": "2017-09-15T00:28:54", "url": "https://files.pythonhosted.org/packages/cc/44/4396e9ecaff4f1a2ad9852a5a20d67d373472d6013c444692afd9f9894d8/CMash-0.1.0.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "a9523954fc80f90661d8923fde082a06", "sha256": "abfcad3879da0e6528b958e86d033e1627261cdee572cfedc420a6dbce1b2c07" }, "downloads": -1, "filename": "CMash-0.2.0-py2-none-any.whl", "has_sig": false, "md5_digest": "a9523954fc80f90661d8923fde082a06", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 81561, "upload_time": "2017-09-16T17:47:19", "url": "https://files.pythonhosted.org/packages/e3/e2/ec2e3bcd99114bfe7cf8dcf192b9d31780b2fe4415f65d0c377ae000d945/CMash-0.2.0-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "ae09691dcd8c6f92d69794de7f7cb794", "sha256": "b3bcbcaa874617f11aa8ab640346e5c7cd83d58c398c876b18ea21a3819f1099" }, "downloads": -1, "filename": "CMash-0.2.0.tar.gz", "has_sig": false, "md5_digest": "ae09691dcd8c6f92d69794de7f7cb794", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 76569, "upload_time": "2017-09-16T17:47:21", "url": "https://files.pythonhosted.org/packages/d0/77/dd609a814cc37f6cc1c659e005831662756afdb0b1daee85ce36f25a66a2/CMash-0.2.0.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "cd7a9d93baa2d36c92dfb1f82087cd7f", "sha256": "620628e8c43af15d8d78bfd5dd929b2fe4ecd2fe83d1cbaf8ecdc1aa17f1938a" }, "downloads": -1, "filename": "CMash-0.2.1-py2-none-any.whl", "has_sig": false, "md5_digest": "cd7a9d93baa2d36c92dfb1f82087cd7f", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 102836, "upload_time": "2017-10-22T22:20:18", "url": "https://files.pythonhosted.org/packages/e8/34/01f6b25af7607890a934ea53f267338ae682c28bd605286a23809a4b866b/CMash-0.2.1-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e973e4fa2684bdcc577dc811834cc540", "sha256": "352527579178b6a225fae9d1958f2ef7d64151812532859d7c83b1a83ce267fd" }, "downloads": -1, "filename": "CMash-0.2.1.tar.gz", "has_sig": false, "md5_digest": "e973e4fa2684bdcc577dc811834cc540", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 86735, "upload_time": "2017-10-22T22:20:19", "url": "https://files.pythonhosted.org/packages/bd/33/952077c54c992464d6c8e8cf0fe0c2866c6eddfb3139b5bbd34dbded1f54/CMash-0.2.1.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "13bbc4a8c6b732c36060311bf9298a56", "sha256": "7f369fd6d606eeb1065324b822bde32418e892c2e5000832d3564bad73ad6dcf" }, "downloads": -1, "filename": "CMash-0.2.2-py2-none-any.whl", "has_sig": false, "md5_digest": "13bbc4a8c6b732c36060311bf9298a56", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 102849, "upload_time": "2017-10-22T23:54:32", "url": "https://files.pythonhosted.org/packages/4b/a4/aab17bc9985e963026e53bd654e9a93b35059f265c416095844f3ea88c8d/CMash-0.2.2-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "40743d9b9bd49e2301ee4db5279c262a", "sha256": "92566c79328816c869235130969aea2351a1270ed0c1cfdf5afc3137214bcacc" }, "downloads": -1, "filename": "CMash-0.2.2.tar.gz", "has_sig": false, "md5_digest": "40743d9b9bd49e2301ee4db5279c262a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 86768, "upload_time": "2017-10-22T23:54:34", "url": "https://files.pythonhosted.org/packages/76/66/4fc103694db983fb8f209bcdad019bcbec3b5c3ef1898fa2935d2670d1e2/CMash-0.2.2.tar.gz" } ], "0.2.3": [ { "comment_text": "", "digests": { "md5": "9a21111abc5c9cbf5cea70f27b7286d8", "sha256": "895445a5063abcb8af704d439ec9365575b406ded0de4a6731d79cd2ffa05d93" }, "downloads": -1, "filename": "CMash-0.2.3-py2-none-any.whl", "has_sig": false, "md5_digest": "9a21111abc5c9cbf5cea70f27b7286d8", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 103148, "upload_time": "2017-10-23T00:17:55", "url": "https://files.pythonhosted.org/packages/51/03/f0e9111207ff2594bd83b1886368b98d290fff7313863e1ff7341c63862f/CMash-0.2.3-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "947823dd9a31ec839627d854ec120d6a", "sha256": "6fff7a60f54c30715d143e7146f77ad6d509bb839b817123c1f152a365d90fe3" }, "downloads": -1, "filename": "CMash-0.2.3.tar.gz", "has_sig": false, "md5_digest": "947823dd9a31ec839627d854ec120d6a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 86830, "upload_time": "2017-10-23T00:17:57", "url": "https://files.pythonhosted.org/packages/75/f0/c2c0e367b43408c4df3d9b37e86f86b3806543235633121f683455cf972d/CMash-0.2.3.tar.gz" } ], "0.2.4": [ { "comment_text": "", "digests": { "md5": "4e882a2ed6fef119a5286c6c8a60d2f9", "sha256": "2c4b4a089244d989bdce9df08079946c9f8526950aaf15993bf0aa5da435686c" }, "downloads": -1, "filename": "CMash-0.2.4-py2-none-any.whl", "has_sig": false, "md5_digest": "4e882a2ed6fef119a5286c6c8a60d2f9", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 104748, "upload_time": "2017-10-24T00:36:59", "url": "https://files.pythonhosted.org/packages/f3/12/375d7634d02e808193931d2664e758c955586e5e044d01365ee3b36d5a55/CMash-0.2.4-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6c204880310f9167a5b31251fb72bf5f", "sha256": "a9d21915ed6c12455d80288d5d63f68d0616a4af7fbab53f9ad378fe352d7954" }, "downloads": -1, "filename": "CMash-0.2.4.tar.gz", "has_sig": false, "md5_digest": "6c204880310f9167a5b31251fb72bf5f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 88363, "upload_time": "2017-10-24T00:37:01", "url": "https://files.pythonhosted.org/packages/eb/ab/7719db5f2ed566678312bf9a51f61df835964500ae5fe5859ead408456e6/CMash-0.2.4.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "cdc62992322974e94bd6ea563a4edc4b", "sha256": "2f8cc974807f02b8f5f32f8a977db55fb81fca7c50f31d77d464139487d7a1ee" }, "downloads": -1, "filename": "CMash-0.3.0-py2-none-any.whl", "has_sig": false, "md5_digest": "cdc62992322974e94bd6ea563a4edc4b", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 89572, "upload_time": "2017-10-24T05:43:57", "url": "https://files.pythonhosted.org/packages/94/34/a550b1f6f746083d63430e240a41e406d8b7bc7da8446a461c0666e91d69/CMash-0.3.0-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e5db6d4593d2635a6dfe2e34d7a25d28", "sha256": "42376167821cb5a4e6dbee4d2fd6cd35e305698c4428a49535694637f6e5e348" }, "downloads": -1, "filename": "CMash-0.3.0.tar.gz", "has_sig": false, "md5_digest": "e5db6d4593d2635a6dfe2e34d7a25d28", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 81737, "upload_time": "2017-10-24T05:44:00", "url": "https://files.pythonhosted.org/packages/a9/9d/7ed323a60aa4c5df1dc04836be77af3c2d56c5dc748479907dc47dbc3fc1/CMash-0.3.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "cdc62992322974e94bd6ea563a4edc4b", "sha256": "2f8cc974807f02b8f5f32f8a977db55fb81fca7c50f31d77d464139487d7a1ee" }, "downloads": -1, "filename": "CMash-0.3.0-py2-none-any.whl", "has_sig": false, "md5_digest": "cdc62992322974e94bd6ea563a4edc4b", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 89572, "upload_time": "2017-10-24T05:43:57", "url": "https://files.pythonhosted.org/packages/94/34/a550b1f6f746083d63430e240a41e406d8b7bc7da8446a461c0666e91d69/CMash-0.3.0-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e5db6d4593d2635a6dfe2e34d7a25d28", "sha256": "42376167821cb5a4e6dbee4d2fd6cd35e305698c4428a49535694637f6e5e348" }, "downloads": -1, "filename": "CMash-0.3.0.tar.gz", "has_sig": false, "md5_digest": "e5db6d4593d2635a6dfe2e34d7a25d28", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 81737, "upload_time": "2017-10-24T05:44:00", "url": "https://files.pythonhosted.org/packages/a9/9d/7ed323a60aa4c5df1dc04836be77af3c2d56c5dc748479907dc47dbc3fc1/CMash-0.3.0.tar.gz" } ] }