{ "info": { "author": "gneiss development team", "author_email": "jamietmorton@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "License :: OSI Approved :: BSD License", "Operating System :: MacOS :: MacOS X", "Operating System :: POSIX", "Operating System :: Unix", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3 :: Only", "Topic :: Scientific/Engineering", "Topic :: Scientific/Engineering :: Bio-Informatics", "Topic :: Software Development :: Libraries" ], "description": "[![Build Status](https://travis-ci.org/biocore/mmvec.svg?branch=master)](https://travis-ci.org/biocore/mmvec)\n\n# mmvec\nNeural networks for estimating microbe-metabolite interactions through their co-occurence probabilities.\n\n![](https://github.com/biocore/mmvec/raw/master/img/mmvec.png \"mmvec\")\n\n# Installation\n\nMMvec can be installed via pypi as follows\n\n```\npip install mmvec\n```\n\nIf you are planning on using GPUs, be sure to `pip install tensorflow-gpu`.\n\nMMvec can also be installed via conda as follows\n\n```\nconda install mmvec -c conda-forge\n```\n\nNote that this option may not work in cluster environments, it maybe workwhile to pip install within a virtual environment. It is possible to pip install mmvec within a conda environment, including qiime2 conda environments. However, pip and conda are known to have compatibility issues, so proceed with caution.\n\n# Getting started\n\nTo get started you can run a quick example as follows. This will learn microbe-metabolite vectors (mmvec)\nwhich can be used to estimate microbe-metabolite conditional probabilities that are accurate up to rank.\n\n```\nmmvec paired-omics \\\n\t--otu-file data/otus.biom \\\n\t--metabolite-file data/ms.biom \\\n\t--summary-dir summary\n```\n\nWhile this is running, you can open up another session and run `tensorboard --logdir .` for diagnosis, see FAQs below for more details.\n\nIf you investigate the summary folder, you will notice that there are a number of files deposited.\n\nSee the following url for a more complete tutorial with real datasets.\n\nhttps://github.com/knightlab-analyses/multiomic-cooccurences\n\nMore information can found under `mmvec --help`\n\n# Qiime2 plugin\n\nIf you want to run this in a qiime environment, install this in your\nqiime2 conda environment (see qiime2 installation instructions [here](https://qiime2.org/)) and run the following\n\n```\npip install git+https://github.com/biocore/mmvec.git\nqiime dev refresh-cache\n```\n\nThis should allow your q2 environment to recognize mmvec. Before we test\nthe qiime2 plugin, run the following commands to import an example dataset\n\n```\nqiime tools import \\\n\t--input-path data/otus_nt.biom \\\n\t--output-path otus_nt.qza \\\n\t--type FeatureTable[Frequency]\n\nqiime tools import \\\n\t--input-path data/lcms_nt.biom \\\n\t--output-path lcms_nt.qza \\\n\t--type FeatureTable[Frequency]\n```\n\nThen you can run mmvec\n```\nqiime mmvec paired-omics \\\n\t--i-microbes otus_nt.qza \\\n\t--i-metabolites lcms_nt.qza \\\n\t--o-conditionals ranks.qza \\\n\t--o-conditional-biplot biplot.qza\n```\n\nIn the results, there are two files, namely `results/conditional_biplot.qza` and `results/conditionals.qza`. The conditional biplot is a biplot representation the\nconditional probability matrix so that you can visualize these microbe-metabolite interactions in an exploratory manner. This can be directly visualized in\nEmperor as shown below. We also have the estimated conditional probability matrix given in `results/conditionals.qza`,\nwhich an be unzip to yield a tab-delimited table via `unzip results/conditionals`. Each row can be ranked,\nso the top most occurring metabolites for a given microbe can be obtained by identifying the highest co-occurrence probabilities for each microbe.\n\nThese log conditional probabilities can also be viewed directly with `qiime metadata tabulate`. This can be\ncreated as follows\n\n```\nqiime metadata tabulate \\\n\t--m-input-file results/conditionals.qza \\\n\t--o-visualization conditionals-viz.qzv\n```\n\n\nThen you can run the following to generate a emperor biplot.\n\n```\nqiime emperor biplot \\\n\t--i-biplot conditional_biplot.qza \\\n\t--m-sample-metadata-file data/metabolite-metadata.txt \\\n\t--m-feature-metadata-file data/microbe-metadata.txt \\\n\t--o-visualization emperor.qzv\n\n```\n\nThe resulting biplot should look like something as follows\n\n![biplot](https://github.com/biocore/mmvec/raw/master/img/biplot.png \"Biplot\")\n\nHere, the metabolite represent points and the arrows represent microbes. The points close together are indicative of metabolites that\nfrequently co-occur with each other. Furthermore, arrows that have a small angle between them are indicative of microbes that co-occur with each other.\nArrows that point in the same direction as the metabolites are indicative of microbe-metabolite co-occurrences. In the biplot above, the red arrows\ncorrespond to Pseudomonas aeruginosa, and the red points correspond to Rhamnolipids that are likely produced by Pseudomonas aeruginosa.\n\nAnother way to examine these associations is to build heatmaps of the log\nconditional probabilities between observations, using the `heatmap` action:\n\n```\nqiime mmvec heatmap \\\n --i-ranks ranks.qza \\\n --m-microbe-metadata-file taxonomy.tsv \\\n --m-microbe-metadata-column Taxon \\\n --m-metabolite-metadata-file metabolite-metadata.txt \\\n --m-metabolite-metadata-column Compound_Source \\\n --p-level 5 \\\n --o-visualization ranks-heatmap.qzv\n```\n\nThis action generates a clustered heatmap displaying the log conditional\nprobabilities between microbes and metabolites. Larger positive log conditional\nprobabilities indicate a stronger likelihood of co-occurrence. Low and negative\nvalues indicate no relationship, not necessarily a negative correlation. Rows\n(microbial features) can be annotated according to feature metadata, as shown\nin this example; we provide a taxonomic classification file and the semicolon-\ndelimited taxonomic rank (`level`) that should be displayed in the color-coded\nmargin annotation. Set `level` to `-1` to display the full annotation\n(including of non-delimited feature metadata). Separate parameters are\navailable to annotate the x-axis (metabolites) in a similar fashion. Row and\ncolumn clustering can be adjusted using the `method` and `metric` parameters.\nThis action will generate a heatmap that looks similar to this:\n\n![heatmap](https://github.com/biocore/mmvec/raw/master/img/heatmap.png \"Heatmap\")\n\nBiplots and heatmaps give a great overview of co-occurrence associations, but\ndo not provide information about the abundances of these co-occurring features\nin each sample. This can be done with the `paired-heatmap` action:\n\n```\nqiime mmvec paired-heatmap \\\n --i-ranks ranks.qza \\\n --i-microbes-table otus_nt.qza \\\n --i-metabolites-table lcms_nt.qza \\\n --m-microbe-metadata-file taxonomy.tsv \\\n --m-microbe-metadata-column Taxon \\\n --p-features TACGAAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCAGCAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTACTGAGCTAGAGTACGGTAGAGGGTGGTGGAATTTCCTG \\\n --p-features TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTAGATAAGTCTGAAGTTAAAGGCTGTGGCTTAACCATAGTAGGCTTTGGAAACTGTTTAACTTGAGTGCAAGAGGGGAGAGTGGAATTCCATGT \\\n --p-top-k-microbes 0 \\\n --p-normalize rel_row \\\n --p-top-k-metabolites 100 \\\n --p-level 6 \\\n --o-visualization paired-heatmap-top2.qzv\n```\n\nThis action generates paired heatmaps that are aligned on the y-axis (sample\nIDs): the left panel displays the abundances of each selected microbial feature\nin each sample, and the right panel displays the abundances of the top k\nmetabolite features associated with each of these microbes in each sample.\nMicrobes can be selected automatically using the `top-k-microbes` parameter\n(which selects the microbes with the top k highest relative abundances) or they\ncan be selected by name using the `features` parameter (if using the QIIME 2\nplugin command-line interface as shown in this example, multiple features are\nselected by passing this parameter multiple times, e.g., `--p-features feature1\n--p-features feature2`; for python interfaces, pass a list of features:\n`features=[feature1, feature2]`). As with the `heatmap` action, microbial\nfeatures can be annotated by passing in `microbe-metadata` and specifying a\ntaxonomic `level` to display. The output looks something like this:\n\n![paired-heatmap](https://github.com/biocore/mmvec/raw/master/img/paired-heatmap.png \"Paired Heatmap\")\n\n\nMore information behind the actions and parameters can found under `qiime mmvec --help`\n\n# FAQs\n\n**Q**: Looks like there are two different commands, a standalone script and a qiime2 interface. Which one should I use?!?\n\n**A**: It'll depend on how deep in the weeds you'll want to get. For most intents and purposes, the qiime2 interface will more practical for most analyses. There are 3 major reasons why the standalone scripts are more preferable to the qiime2 interface, namely\n\n1. Customized acceleration : If you want to bring down your runtime from a few days to a few hours, you may need to compile Tensorflow to handle hardware specific instructions (i.e. GPUs / SIMD instructions). It probably is possible to enable GPU compatiability within a conda environment with some effort, but since conda packages binaries, SIMD instructions will not work out of the box.\n\n2. Checkpoints : If you are not sure how long your analysis should run, the standalone script can allow you record checkpoints, which can allow you to recover your model parameters. This enables you to investigate your model while the model is training.\n\n3. More model parameters : The standalone script will return the bias parameters learned for each dataset (i.e. microbe and metabolite abundances). These are stored under the summary directory (specified by `--summary`) under the names `embeddings.csv`. This file will hold the coordinates for the microbes and metabolites, along with biases. There are 4 columns in this file, namely `feature_id`, `axis`, `embed_type` and `values`. `feature_id` is the name of the feature, whether it be a microbe name or a metabolite feature id. `axis` corresponds to the name of the axis, which either corresponds to a PC axis or bias. `embed_type` denotes if the coordinate corresponds to a microbe or metabolite. `values` is the coordinate value for the given `axis`, `embed_type` and `feature_id`. This can be useful for accessing the raw parameters and building custom biplots / ranks visualizations - this also has the advantage of requiring much less memory to manipulate.\n\nIt is also important to note that you don't have to explicitly chose - it is very doable to run the standalone version first, then import those output files into qiime2. Importing can be done as follows\n\n```\nqiime tools import --input-path --output-path conditionals.qza --type FeatureData[Conditional]\n\nqiime tools import --input-path --output-path ordination.qza --type 'PCoAResults % (\"biplot\")'\n```\n\n**Q** : You mentioned that you can use GPUs. How can you do that??\n\n**A** : This can be done by running `pip install tensorflow-gpu` in your environment. See details [here](https://www.tensorflow.org/install/gpu).\n\nAt the moment, these capabilities are only available for the standalone CLI due to complications of installation. See the `--arm-the-gpu` option in the standalone interface.\n\n**Q** : Neural networks scare me - don't they overfit the crap out of your data?\n\n**A** : Here, we are using shallow neural networks (so only two layers). This falls under the same regime as PCA and SVD. But just as you can overfit PCA/SVD, you can also overfit mmvec. Which is why we have Tensorboard enabled for diagnostics. You can visualize the `cv_rmse` to gauge if there is overfitting -- if your run is strictly decreasing, then that is a sign that you are probably not overfitting. But this is not necessarily indicative that you have reach the optimal -- you want to check to see if `logloss` has reached a plateau as shown above.\n\n**Q** : I'm confused, what is Tensorboard?\n\n**A** : Tensorboard is a diagnostic tool that runs in a web browser - note that this is only explicitly supported in the standalone version of mmvec. To open tensorboard, make sure you\u2019re in the mmvec environment and cd into the folder you are running the script above from. Then run:\n\n```\ntensorboard --logdir .\n```\n\nReturning line will look something like:\n\n```\nTensorBoard 1.9.0 at http://Lisas-MacBook-Pro-2.local:6006 (Press CTRL+C to quit)\n```\nOpen the website (highlighted in red) in a browser. (Hint; if that doesn\u2019t work try putting only the port number (here it is 6006), adding localhost, localhost:6006). Leave this tab alone. Now any mmvec output directories that you add to the folder that tensorflow is running in will be added to the webpage.\n\n\nIf working properly, it will look something like this\n![tensorboard](https://github.com/biocore/mmvec/raw/master/img/tensorboard.png \"Tensorboard\")\n\nFIRST graph in Tensorflow; 'Prediction accuracy'. Labelled `cv_rmse`\n\nThis is a graph of the prediction accuracy of the model; the model will try to guess the metabolite intensitiy values for the testing samples that were set aside in the script above, using only the microbe counts in the testing samples. Then it looks at the real values and sees how close it was.\n\nThe second graph is the `likelihood` - if your `likelihood` values are plateaued, that is a sign that you have converged and reached at a local minima.\n\nThe x-axis is the number of iterations (meaning times the model is training across the entire dataset). Every time you iterate across the training samples, you also run the test samples and the averaged results are being plotted on the y-axis.\n\n\nThe y-axis is the average number of counts off for each feature. The model is predicting the sequence counts for each feature in the samples that were set aside for testing. So in the graph above it means that, on average, the model is off by ~0.75 intensity units, which is low. However, this is ABSOLUTE error not relative error (unfortunately we don't know how to compute relative errors because of the sparsity in these datasets).\n\nYou can also compare multiple runs with different parameters to see which run performed the best. Useful parameters to note are `--epochs` and `--batch-size`. If you are committed to fine-tuning parameters, be sure to look at the `training-column` example make the testing samples consistent across runs.\n\n\n**Q** : What's up with the `--training-column` argument?\n\n**A** : That is used for cross-validation if you have a specific reproducibility question that you are interested in answering. It can also make it easier to compare cross validation results across runs. If this is specified, only samples labeled \"Train\" under this column will be used for building the model and samples labeled \"Test\" will be used for cross validation. In other words the model will attempt to predict the microbe abundances for the \"Test\" samples. The resulting prediction accuracy is used to evaluate the generalizability of the model in order to determine if the model is overfitting or not. If this argument is not specified, then 10 random samples will be chosen for the test dataset. If you want to specify more random samples to allocate for cross-validation, the `num-random-test-examples` argument can be specified.\n\n\n**Q** : What sort of parameters should I focus on when picking a good model?\n\n**A** : There are 3 different parameters to focus on, `input-prior`, `output-prior` and `latent-dim`\n\nThe `--input-prior` and `--output-prior` options specifies the width of the prior distribution of the coefficients, where the `--input-prior` is typically specific to microbes and the `--output-prior` is specific to metabolites.\nFor a prior of 1, this means 99% of entries in the embeddings (typically given in the `U.txt` and `V.txt` files are within -3 and +3 (log fold change). The higher differential-prior is, the more parameters can have bigger changes, so you want to keep this relatively small. If you see overfitting (accuracy and fit increasing over iterations in tensorboard) you may consider reducing the `--input-prior` and `--output-prior` in order to reduce the parameter space.\n\nAnother parameter worth thinking about is `--latent-dim`, which controls the number of dimensions used to approximate the conditional probability matrix. This also specifies the dimensions of the microbe/metabolite embeddings `U.txt` and `V.txt`. The more dimensions this has, the more accurate the embeddings can be -- but the higher the chance of overfitting there is. The rule of thumb to follow is in order to fit these models, you need at least 10 times as many samples as there are latent dimensions (this is following a similar rule of thumb for fitting straight lines). So if you have 100 samples, you should definitely not have a latent dimension of more than 10. Furthermore, you can still overfit certain microbes and metabolites. For example, you are fitting a model with those 100 samples and just 1 latent dimension, you can still easily overfit microbes and metabolites that appear in less than 10 samples -- so even fitting models with just 1 latent dimension will require some microbes and metabolites that appear in less than 10 samples to be filtered out.\n\n\n**Q** : What does a good model fit look like??\n\n**A** : Again the numbers vary greatly by dataset. But you want to see the both the `logloss` and `cv_rmse` curves decaying, and plateau as close to zero as possible.\n\n**Q** : How long should I expect this program to run?\n\n**A** : Both `epochs` and `batch-size` contribute to determining how long the algorithm will run, namely\n\n**Number of iterations = `epoch #` multiplied by the ( Total # of microbial reads / `batch-size` parameter)**\n\nThis also depends on if your program will converge. The `learning-rate` specifies the resolution (smaller step size = smaller resolution, but may take longer to converge). You will need to consult with Tensorboard to make sure that your model fit is sane. See this paper for more details on gradient descent: https://arxiv.org/abs/1609.04747\n\nIf you are running this on a CPU, 16 cores, a run that reaches convergence should take about 1 day.\nIf you have a GPU - you maybe able to get this down to a few hours. However, some finetuning of the `batch-size` parameter maybe required -- instead of having a small `batch-size` < 100, you'll want to bump up the `batch-size` to between 1000 and 10000 to fully leverage the speedups available on the GPU.\n\nCredits to Lisa Marotz ([@lisa55asil](https://github.com/lisa55asil)), Yoshiki Vazquez-Baeza ([@ElDeveloper](https://github.com/ElDeveloper)) and Julia Gauglitz ([@jgauglitz](https://github.com/jgauglitz)) for their README contributions.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "", "keywords": "", "license": "BSD-3-Clause", "maintainer": "gneiss development team", "maintainer_email": "jamietmorton@gmail.com", "name": "mmvec", "package_url": "https://pypi.org/project/mmvec/", "platform": "", "project_url": "https://pypi.org/project/mmvec/", "project_urls": null, "release_url": "https://pypi.org/project/mmvec/1.0.2/", "requires_dist": null, "requires_python": "", "summary": "Microbe-metabolite interactions through neural networks", "version": "1.0.2" }, "last_serial": 5993224, "releases": { "0.5.0": [ { "comment_text": "", "digests": { "md5": "851120bca492cc8b0b17c18e31fcc011", "sha256": "5ac92c92b7522180060d03398374864874e9939759da2b62906f47e80b2bd935" }, "downloads": -1, "filename": "mmvec-0.5.0.tar.gz", "has_sig": false, "md5_digest": "851120bca492cc8b0b17c18e31fcc011", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18001, "upload_time": "2019-08-20T19:24:51", "url": "https://files.pythonhosted.org/packages/36/e7/67f7774c701a1f07e4632e01d1fa55c5c1895b4a16653b89b49a1019f894/mmvec-0.5.0.tar.gz" } ], "0.6.0": [ { "comment_text": "", "digests": { "md5": "52dea14b21ca5ddbe33ac412c806c11f", "sha256": "b16d8e0bc7da9127e608759cbde02b074ebca6f8dffe250b0d7c94738171fcf5" }, "downloads": -1, "filename": "mmvec-0.6.0.tar.gz", "has_sig": false, "md5_digest": "52dea14b21ca5ddbe33ac412c806c11f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 26694, "upload_time": "2019-09-05T15:42:58", "url": "https://files.pythonhosted.org/packages/c8/ed/f33d1d7ffe2a77a872ccdb09bffcdc5a4207b5481e8fb828699bc33bb45e/mmvec-0.6.0.tar.gz" } ], "1.0.0": [ { "comment_text": "", "digests": { "md5": "9f48b32a32f757cf49f7a1b995407c22", "sha256": "d1a69bcf356fa59af299e0fa43f7093ab9bc87548ec36b8bd78b88f4351e953c" }, "downloads": -1, "filename": "mmvec-1.0.0.tar.gz", "has_sig": false, "md5_digest": "9f48b32a32f757cf49f7a1b995407c22", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 31827, "upload_time": "2019-09-30T16:05:22", "url": "https://files.pythonhosted.org/packages/c3/9b/c7666728006fc37521683cac2693dd4d85fb4091b536944e1db87f55a123/mmvec-1.0.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "b1fd7b29c118274665a2b38a63f5f0e6", "sha256": "89c53229d39ecf5dec0cc05efed32f915258b46a401027a13dfd818f6257714f" }, "downloads": -1, "filename": "mmvec-1.0.1.tar.gz", "has_sig": false, "md5_digest": "b1fd7b29c118274665a2b38a63f5f0e6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 32072, "upload_time": "2019-10-17T21:48:51", "url": "https://files.pythonhosted.org/packages/a2/ee/1e55defb5e3dce26e2eec9659fc83012f5d6b0ee30ce1cb5213d40cee101/mmvec-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "63e7924b9ff7556c5d5b95720f57c620", "sha256": "398f75783e3ec98ae4026aeedd2bef416e0b4ac785048dd62ae6286785de865c" }, "downloads": -1, "filename": "mmvec-1.0.2.tar.gz", "has_sig": false, "md5_digest": "63e7924b9ff7556c5d5b95720f57c620", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 32171, "upload_time": "2019-10-18T01:35:43", "url": "https://files.pythonhosted.org/packages/eb/2f/a8d36245aa56c4cdf0106606838edf2d0a9e0c665d2daa35928a1f054564/mmvec-1.0.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "63e7924b9ff7556c5d5b95720f57c620", "sha256": "398f75783e3ec98ae4026aeedd2bef416e0b4ac785048dd62ae6286785de865c" }, "downloads": -1, "filename": "mmvec-1.0.2.tar.gz", "has_sig": false, "md5_digest": "63e7924b9ff7556c5d5b95720f57c620", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 32171, "upload_time": "2019-10-18T01:35:43", "url": "https://files.pythonhosted.org/packages/eb/2f/a8d36245aa56c4cdf0106606838edf2d0a9e0c665d2daa35928a1f054564/mmvec-1.0.2.tar.gz" } ] }