{ "info": { "author": "Samuel Minot", "author_email": "sminot@fredhutch.org", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6" ], "description": "## Find Co-Abundant Groups of Genes\n\n[![Docker Repository on Quay](https://quay.io/repository/fhcrc-microbiome/find-cags/status \"Docker Repository on Quay\")](https://quay.io/repository/fhcrc-microbiome/find-cags)\n\n#### Purpose\n\nAnalyze gene abundance data from a large set of samples and calculate\nwhich sets of genes are found at a similar abundance across all samples.\nThose genes are expected to be biologically linked, such as the case of\nmetagenomic analysis via whole-genome shotgun sequences, where genes\nfrom the same genome tend to be found at a similar abundance.\n\n\n#### Code Availability\n\nThe code in this repository is provided in two different formats. There\nis a library of Python code (`ann_linkage_clustering` in PyPI) that can\nbe used to make CAGs directly from a Pandas DataFrame. There is also a\nDocker image that is intended to be run with the script `find-cags.py`.\nThe documentation below describes the end-to-end workflow that is available\nwith that Docker image and the single wrapper script. \n\n\n#### Input Data Format\n\nIt is assumed that all input data will be in JSON format (gzip optional).\nThe pertinent data for each individual sample is an abundance metric for\neach sample. The input file must contain a `list` in which each element\nis a `dict` that contains the gene ID with one `key` and the abundance\nmetric with another `key`. \n\nFor initial development we will assume that each input file is a single\n`dict`, with the results located at a single `key` within that `dict`. \nIn the future we may end up supporting more flexibility in extracting\nresults from files with different structures, but for the first pass we'll\njust go with this.\n\nTherefore the features that must be specified by the user are:\n\n * Key for the list of gene abundances within the JSON (e.g. \"results\")\n * Key for the gene_id within each element of the list (e.g. \"id\")\n * Key for the abundance metric within each element (e.g. \"depth\")\n\nHere is an example of what that might look like in JSON format:\n\n```json\n{\n \"results\": [\n {\n \"id\": \"gene_1\",\n \"depth\": 1.1\n },\n {\n \"id\": \"gene_2\",\n \"depth\": 0.2\n },\n {\n \"id\": \"gene_3\",\n \"depth\": 3000.015\n },\n ],\n \"logs\": [\n \"any other data\",\n \"that you would like\",\n \"to include in this file is just fine.\"\n ]\n}\n```\n\n**NOTE**: All abundance metric values must be >= 0\n\n\n#### Running from any DataFrame\n\nIf you have any other format of data, you can use this code to find CAGs as well.\nThe big difference is that this script does some data normalization that is very\nhelpful. For example, if you are using cosine distance, it's best to have the value\nindicating absence to be zero. So if you are using the centered log-ratio (clr)\nnormalization approach, you really need to set a standard cuttoff across all samples,\ntrim the lowest value to that, and then set that lowest value to zero. This is all\ndone automatically by `find-cags.py`, but you can absolutely use the same functions\nto make CAGs with any other input data format or normalization approach.\n\n\nYou can follow the workflow in the `find-cags.py` script, which basically follows\nthis workflow (assuming that `df` is your DataFrame of abundance data, with genes\nin rows and samples in columns):\n\n```python\nfrom multiprocessing import Pool\nfrom ann_linkage_clustering.lib import make_cags_with_ann\nfrom ann_linkage_clustering.lib import iteratively_refine_cags\nfrom ann_linkage_clustering.lib import make_nmslib_index\n\n# Maximum distance threshold (use any value)\nmax_dist=0.2\n\n# Distance metric (only 'cosine' is supported)\ndistance_metric=\"cosine\"\n\n# Multiprocessing pool (pick any number of threads, in this case `1`)\nthreads = 1\npool = Pool(threads)\n\n# Linkage type (only `average` is fully supported)\nlinkage_type = \"average\"\n\n# Make the ANN index\nindex = make_nmslib_index(df)\n\n# Make the CAGs in the first round\ncags = make_cags_with_ann(\n index,\n max_dist,\n df,\n pool,\n threads=threads,\n distance_metric=distance_metric,\n linkage_type=linkage_type\n)\n\n# Iteratively refine the CAGs (this is the part that is hardedcoded to \n# use average linkage clustering, while the step above could technically\n# use any of `complete`, `single`, `average`, etc.)\niteratively_refine_cags(\n cags,\n df.copy(),\n max_dist,\n distance_metric=distance_metric,\n linkage_type=linkage_type,\n threads=threads\n)\n```\n\nAt the end of all of that, the `cags` object is a dictionary containing\nall of the identified groups.\n\n\n#### Sample Sheet\n\nTo link individual files with sample names, the user will specify a\nsample sheet, which is a JSON file formatted as a `dict`, with sample\nnames as key and file locations as values. \n\n\n#### Data Locations\n\nAt the moment we will support data found in either the local file system \nor AWS S3.\n\n\n#### Test Dataset\n\nFor testing, I will use a set of JSONs which contain the abundance of\nindividual genes for a set of microbiome samples. That data is found in the\n`tests/` folder. There is also a JSON file indicating which sample goes\nwith which file, which is formatted as a simple dict (keys are sample names\nand values are file locations) and located in `tests/sample_sheet.json`.\n\n\n#### Normalization\n\nThe `--normalize` metric accepts three values, `clr`, `median`, and `sum`. In each case\nthe abundance metric for each gene within each sample is divided by either\nthe `median` or the `sum` of the abundance metrics for all genes within that\nsample. When calculating the `median`, only genes with non-zero abundances\nare considered. For `clr`, each value is divided by the geometric mean for the\nsample, and then the log10 is taken. All zero values are filled with the minimum\nvalue for the entire dataset (so that they are equal across samples, and not\nsensitive to sequencing depth).\n\n\n#### Approximate Nearest Neighbor\n\nThe Approximate Nearest Neighbor algorithm as implemented by \n[nmslib](https://nmslib.github.io/nmslib/index.html) is being used to create the CAGs.\nThis implementation has a high performance in an independent \n[benchmark](http://ann-benchmarks.com/).\n\n\n#### Distance Metric\n\nThe distance metric is now hard-coded to be the cosine similarity. This is limited by the\navailable functionality of ANN in `nmslib`, and therefore has been standardized to the\nother parts of the codebase as well.\n\n\n#### Refinements\n\nAfter finding CAGs, the algorithm will iteratively join CAGs that are very similar to each\nother in aggregate. This increases the fidelity of the final CAGs and mitigates some of the\nsensitivity limitations of ANN.\n\n\n#### Invocation\n\n```\nusage: find-cags.py [-h] --sample-sheet SAMPLE_SHEET --output-prefix\n OUTPUT_PREFIX --output-folder OUTPUT_FOLDER\n [--normalization NORMALIZATION] [--max-dist MAX_DIST]\n [--temp-folder TEMP_FOLDER] [--results-key RESULTS_KEY]\n [--abundance-key ABUNDANCE_KEY]\n [--gene-id-key GENE_ID_KEY] [--threads THREADS]\n [--min-samples MIN_SAMPLES] [--clr-floor CLR_FLOOR]\n [--test]\n\nFind a set of co-abundant genes\n\noptional arguments:\n -h, --help show this help message and exit\n --sample-sheet SAMPLE_SHEET\n Location for sample sheet (.json[.gz]).\n --output-prefix OUTPUT_PREFIX\n Prefix for output files.\n --output-folder OUTPUT_FOLDER\n Folder to place results. (Supported: s3://, or local\n path).\n --normalization NORMALIZATION\n Normalization factor per-sample (median, sum, or clr).\n --max-dist MAX_DIST Maximum cosine distance for clustering.\n --temp-folder TEMP_FOLDER\n Folder for temporary files.\n --results-key RESULTS_KEY\n Key identifying the list of gene abundances for each\n sample JSON.\n --abundance-key ABUNDANCE_KEY\n Key identifying the abundance value for each element\n in the results list.\n --gene-id-key GENE_ID_KEY\n Key identifying the gene ID for each element in the\n results list.\n --threads THREADS Number of threads to use.\n --min-samples MIN_SAMPLES\n Filter genes by the number of samples they are found\n in.\n --clr-floor CLR_FLOOR\n Set a minimum CLR value, 'auto' will use the largest\n minimum value.\n --test Run in testing mode and only process a subset of 2,000\n genes.\n\n ```\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/FredHutch/find-cags", "keywords": "science clustering", "license": "", "maintainer": "", "maintainer_email": "", "name": "ann-linkage-clustering", "package_url": "https://pypi.org/project/ann-linkage-clustering/", "platform": "", "project_url": "https://pypi.org/project/ann-linkage-clustering/", "project_urls": { "Bug Reports": "https://github.com/fredhutch/find-cags/issues", "Homepage": "https://github.com/FredHutch/find-cags", "Source": "https://github.com/fredhutch/find-cags/" }, "release_url": "https://pypi.org/project/ann-linkage-clustering/0.11.1/", "requires_dist": [ "pandas (>=0.20.3)", "numpy (>=1.13.1)", "scipy (>=0.19.1)", "awscli", "boto3 (>=1.4.7)", "python-dateutil (==2.6.0)", "fastcluster (>=1.1.24)", "nmslib (>=1.7.2)", "scikit-learn (>=0.19.2)" ], "requires_python": "", "summary": "Linkage clustering via Approximate Nearest Neighbors", "version": "0.11.1" }, "last_serial": 4582802, "releases": { "0.11": [ { "comment_text": "", "digests": { "md5": "87474f3118a621ca3c097d15a2af5e5b", "sha256": "95bfbb1b5c83a022b1a92c53b51b351d92a958a2aa516bca2f069e1488c3f737" }, "downloads": -1, "filename": "ann_linkage_clustering-0.11-py3-none-any.whl", "has_sig": false, "md5_digest": "87474f3118a621ca3c097d15a2af5e5b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 12228, "upload_time": "2018-12-06T10:05:25", "url": "https://files.pythonhosted.org/packages/19/23/b056952c899b3d6ae5399588b28e90a665ea83ccc3aaf67c3bee87c11053/ann_linkage_clustering-0.11-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "518f2981b8e9b41555606b0efd02ab9d", "sha256": "84b7f71186692b5a742ecd31cb9629f8eeb04b0cdaff2d514bfe1a7dc89259e7" }, "downloads": -1, "filename": "ann_linkage_clustering-0.11.tar.gz", "has_sig": false, "md5_digest": "518f2981b8e9b41555606b0efd02ab9d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13887, "upload_time": "2018-12-06T10:05:28", "url": "https://files.pythonhosted.org/packages/53/6c/830779e4a022fd49c04aacd56823d47e8b6e88480a06d9b68bf34781fcb5/ann_linkage_clustering-0.11.tar.gz" } ], "0.11.1": [ { "comment_text": "", "digests": { "md5": "58b5697ff89e2f3fc487d52963efcec8", "sha256": "b0716771643feca48c06db46ce5fa645a0792514d2e29d9ac6a9a8ac30f055e5" }, "downloads": -1, "filename": "ann_linkage_clustering-0.11.1-py3-none-any.whl", "has_sig": false, "md5_digest": "58b5697ff89e2f3fc487d52963efcec8", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 12416, "upload_time": "2018-12-10T22:08:41", "url": "https://files.pythonhosted.org/packages/a6/4c/8e7ecbe2a9c39c14894133780b3e19f20ddb21bf72debcabaf3dd54ea120/ann_linkage_clustering-0.11.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "262a9f81ea198f3d8de960ecc09a2235", "sha256": "975e50f5f99081bd938d9a255e8d75f3b97616d5c1f44ee741c42a0c344ab510" }, "downloads": -1, "filename": "ann_linkage_clustering-0.11.1.tar.gz", "has_sig": false, "md5_digest": "262a9f81ea198f3d8de960ecc09a2235", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14041, "upload_time": "2018-12-10T22:08:43", "url": "https://files.pythonhosted.org/packages/55/d6/9dd401fcb7a99cba15de4dedb1b599aa06ff5c9d64141f8cc1d7a0ed018b/ann_linkage_clustering-0.11.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "58b5697ff89e2f3fc487d52963efcec8", "sha256": "b0716771643feca48c06db46ce5fa645a0792514d2e29d9ac6a9a8ac30f055e5" }, "downloads": -1, "filename": "ann_linkage_clustering-0.11.1-py3-none-any.whl", "has_sig": false, "md5_digest": "58b5697ff89e2f3fc487d52963efcec8", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 12416, "upload_time": "2018-12-10T22:08:41", "url": "https://files.pythonhosted.org/packages/a6/4c/8e7ecbe2a9c39c14894133780b3e19f20ddb21bf72debcabaf3dd54ea120/ann_linkage_clustering-0.11.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "262a9f81ea198f3d8de960ecc09a2235", "sha256": "975e50f5f99081bd938d9a255e8d75f3b97616d5c1f44ee741c42a0c344ab510" }, "downloads": -1, "filename": "ann_linkage_clustering-0.11.1.tar.gz", "has_sig": false, "md5_digest": "262a9f81ea198f3d8de960ecc09a2235", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14041, "upload_time": "2018-12-10T22:08:43", "url": "https://files.pythonhosted.org/packages/55/d6/9dd401fcb7a99cba15de4dedb1b599aa06ff5c9d64141f8cc1d7a0ed018b/ann_linkage_clustering-0.11.1.tar.gz" } ] }