{
"info": {
"author": "Christopher Heje Gr\u00f8nbech, Maximillian Fornitz Vording",
"author_email": "christopher.heje.groenbech@regionh.dk",
"bugtrack_url": null,
"classifiers": [
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3"
],
"description": "# scVAE: Single-cell variational auto-encoders #\n\nscVAE is a command-line tool for modelling single-cell transcript counts using variational auto-encoders.\n\nscVAE was developed by Christopher Heje Gr\u00f8nbech and Maximillian Fornitz Vording, and it is being developed further by Christopher Heje Gr\u00f8nbech. The methods used by scVAE is described and examined in Gr\u00f8nbech *et al.* (2018).\n\nscVAE requires Python 3.5 or later, which can be installed in [several ways][Python-installation-guides], for example, using [Miniconda][].\n\n[Python-installation-guides]: https://realpython.com/installing-python/\n[Miniconda]: https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html\n\n## Installation ##\n\nInstall scVAE using pip:\n\n\t$ python3 -m pip install scvae\n\nNote: scVAE depends on TensorFlow, and by default the CPU-enabled version of TensorFlow is installed, if the required version of TensorFlow is not already installed. If you have a Nvidia GPU, you should install the GPU-enabled version of TensorFlow beforehand, since this is significantly faster:\n\n\t$ python3 -m pip install tensorflow-gpu\n\n## Using scVAE ##\n\nIn general, scVAE is used in the following way:\n\n\t$ scvae $COMMAND $DATA_SET\n\nwhere `$COMMAND` can be `analyse` (data analysis), `train` (model training), or `evaluate` (model evaluation and analysis). `$DATA_SET` is a path to a data set file or a short name for a data set.\n\nBy default, data are placed and cached in the subfolder `data/`, models are saved in the subfolder `models/`, and analyses are saved in the subfolder `analyses/`.\n\nIn the following, the most relevant options are described. Use the help option to list all options for each command:\n\n\t$ scvae $COMMAND --help\n\n### Data sets ###\n\nSeveral data sets are already included in scVAE:\n\n* `Macosko-MRC`: [GSE63472][Macosko-MRC].\n* `10x-MBC`: [1.3 Million Brain Cells from E18 Mice][10x-MBC] from [10x Genomics][10x].\n\t* `10x-MBC-20k`: 20 000 sampled cells.\n* `10x-PBMC-PP`: Nine data sets of [purified PBMC populations][10x-PBMC-PP] from 10x Genomics as specified in Gr\u00f8nbech *et al.* (2018).\n* `10x-PBMC-68k`: [Fresh 68k PBMCs (Donor A)][10x-PBMC-68k] from 10x Genomics.\n* `TCGA-RSEM`: [\"transcript expression RNAseq - TOIL RSEM expected_count\"][TCGA-RSEM] data set from the [TCGA Pan-Cancer (PANCAN)][TCGA-PANCAN] cohort.\n\n[Macosko-MRC]: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63472\n[10x-MBC]: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons\n[10x]: https://www.10xgenomics.com\n[10x-PBMC-PP]: https://support.10xgenomics.com/single-cell-gene-expression/datasets/\n[10x-PBMC-68k]: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/fresh_68k_pbmc_donor_a\n[TCGA-RSEM]: https://xenabrowser.net/datapages/?dataset=tcga_expected_count&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443\n[TCGA-PANCAN]: https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443\n\nData sets will be cached in the data directory, which defaults to `data/`. This can be changed using the option `--data-directory` (or `-D`).\n\nBe aware that it might take some time to load and preprocess the data the first time for large data sets. Also note that to load and analyse the `10x-MBC` data set, 47 GB of memory is required (32 GB for the original data set in sparse representation and 15 GB for the reconstructed test set in dense representation).\n\nThe default model can be trained on, for example, the `10x-PBMC-PP` data set like this:\n\n\t$ scvae train 10x-PBMC-PP\n\n#### Custom data sets ####\n\nscVAE can read Loom files and read count from a TSV file without further configuration:\n\n\t$ scvae train PBMC.loom\n\nThe TSV files can be compressed using gzip, but each row should represent a cell or sample and each column a gene (for the reverse case, see below). If a header row and/or a header column is provided they are as gene IDs/names and/or cell/sample names, respectively.\n\nscVAE also supports the following formats (supplied using the `--format` option):\n\n* `10x`: Output format for 10x Genomics's Cell Ranger.\n* `gtex`: Format for data sets from [GTEx][].\n* `loom`: Loom format.\n* `matrix_ebf`: (gzip compressed) TSV file with cells/samples/examples as rows and gene/features as columns (examples-by-features).\n* `matrix_fbe`: (gzip compressed) TSV file with gene/features as rows and cells/samples/examples as columns (features-by-examples).\n\n[GTEx]: https://gtexportal.org/home/index.html\n\nThe last of these formats can be used to read a TSV file, which is in reverse order of the default case:\n\n\t$ scvae train PBMC.tsv.gz --format matrix_fbe\n\nUsing the Loom format, included cell types and batch indices can also be imported without further configuration. Cell types for other formats can be imported in TSV format by instead providing a [JSON][] file with a `values` field with the filename for the read counts, a `labels` field with the filename the cell types, and `format` field with the format.\n\n[JSON]: https://en.wikipedia.org/wiki/JSON\n\nA JSON file for a GTEx data set would look like this:\n\n\t{\n\t\t\"values\": \"GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_reads.gct.gz\",\n\t\t\"labels\": \"GTEx_v7_Annotations_SampleAttributesDS.txt\",\n\t\t\"format\": \"gtex\"\n\t}\n\nThe GTEx data set can then be imported and modelled:\n\n\t$ scvae train gtex.json\n\n#### Withheld data ####\n\nThe data set can split into a training, a validation, and a test set using the `--split-data-set` option:\n\n\t$ scvae train 10x-PBMC-PP --split-data-set\n\nThen, the training set is used to train the model, the validation set is used for early stopping, and the test set is used when evaluating the model.\n\n### Training a model ###\n\nThe command `train` is used to train a model on a data set:\n\n\t$ scvae train 10x-PBMC-PP\n\nBy default, a VAE model with a Poisson likelihood function, two-dimensional latent variable, and one hidden layer of 100 units will be trained on the specified data set for 200 epochs with a learning rate of 10-4.\n\nThe default model can be changed by using the following options:\n\n* `-m`: The model type, either `VAE` or `GMVAE`.\n* `-r`: Likelihood function (or reconstruction distribution):\n\t* `poisson`,\n\t* `negative binomial`,\n\t* `zero_inflated_poisson`,\n\t* `zero_inflated_negative binomial`,\n\t* `constrained_poisson`,\n\t* `bernoulli`,\n\t* `gaussian`, and\n\t* `log_normal`.\n* `-k`: The threshold for modelling low counts using discrete probabilities and high counts using a shifted likelihood function (denoted by *k*max in Gr\u00f8nbech *et al*, 2018). This turns the likelihood function into a corresponding piecewise categorical likehood function.\n* `-q`: The latent prior distribution. For the VAE model, this can only be a normal isotropic Gaussian distribution (`gaussian`) or one with unit variance (`unit_variance_gaussian`). For the GMVAE model, this can either be a Gaussian-mixture model with a diagonal covariance matrix (`gaussian_mixture`) or a full covariance matrix (`full_covariance_gaussian_mixture`). Note that a full covariance matrix should only be used for simpler GMVAE models.\n* `--prior-probabilites-method`: Method for how to set the mixture coefficients for the latent prior distribution of the GMVAE model. They can be fixed to either uniform values (`uniform`) or inferred values from labelled data (`infer`), or they can be learnt by the model (`learn`).\n* `-l`: The dimension of the latent variable.\n* `-H`: The number of hidden units in each layer separated by spaces. For example, `-H 200 100` will make both the inference (encoder) and the generative (decoder) networks two-layered with the first inference layer and the last generative layer consisting of 200 hidden units and the last inference layer and the first generative layer consisting of 100 hidden units.\n* `-K`: The number of components for the GMVAE (if possible, this is inferred from labelled data, but it can be overridden using this option).\n* `-w`: The number of epochs during the start of training with a linear weight on the KL divergence (the warm-up optimisation scheme described in Gr\u00f8nbech *et al*, 2018). This weight is gradually increased linearly from 0 to 1 for this number of epochs.\n* `--batch-correction`: Perform batch correction if batch indices are available in data set (currently only possible with Loom data sets).\n\nThe training procedure can be changed using the following options (only applicable to the `train` command):\n\n* `-e`: The number of epochs to train the model.\n* `--learning-rate`: The learning rate of the model.\n\nA GMVAE model with a negative binomial likelihood function, a 100-dimensional latent variable, two hidden layers of each 100 units, and 200 epochs using the warm-up scheme is trained for 500 epochs like this.\n\n\t$ scvae train 10x-PBMC-PP -m GMVAE -l 100 -H 100 100 -w 200 -e 500\n\nTrained models are saved to the subdirectory `models/` by default. This can be changed using the option `--models-directory` (or `-M`).\n\n### Evaluating a model ###\n\nThe command `evaluate` is used to evaluate a model on a data set:\n\n\t$ scvae evalaute 10x-PBMC-PP\n\nNote the model has to have already been trained on the same data set.\n\nThe model is specified in the same way as when training the model, and the model will be evaluated at the last epoch to which it was trained. If withheld data were used, the model will also be evaluated at the early-stopping epoch and epoch with the most optimal marginal log-likelihood lower bound (if available). A number of analyses are conducted of the models and results, and these saved in the subdirectory `analyses/`. This can be changed using the option `--analyses-directory` (or `-A`).\n\nCells can be clustered and cell types can be predicted using the option `--prediction-method`. Currently only *k*-means clustering (`kmeans`) is supported. The GMVAE clusters cells and predict cell types using its built-in density-based clustering by default.\n\nTo visualise the data sets or latent spaces thereof, these are decomposed using a decomposition method. By default, this method is PCA. This can be changed using the option `--decomposition-methods`, and as the name implies, multiple methods can be specified: PCA (`pca`), ICA (`ica`), SVD (`svd`), and *t*-SNE (`tsne`).\n\nThe GMVAE model trained in the previous section is evaluated with PCA and *t*-SNE decomposition methods like this:\n\n\t$ scvae evaluate 10x-PBMC-PP -m GMVAE -l 100 -H 100 100 -w 200 --decomposition-methods pca tsne\n\n### Examples ###\n\nTo reproduce the main results from Gr\u00f8nbech *et al* (2018), you can run the following commands:\n\n* Combined PBMC data set from 10x Genomics:\n\n\t\t$ scvae train 10x-PBMC-PP --split-data-set -m GMVAE -r negative_binomial -l 100 -H 100 100 -w 200 -e 500\n\t\t$ scvae evaluate 10x-PBMC-PP --split-data-set -m GMVAE -r negative_binomial -l 100 -H 100 100 -w 200 --decomposition-methods pca tsne\n\n* TCGA data set:\n\n\t\t$ scvae train TCGA-RSEM --map-features --feature-selection keep_highest_variances 5000 --split-data-set -m GMVAE -r negative_binomial -l 50 -H 1000 1000 -e 500\n\t\t$ scvae evaluate TCGA-RSEM --map-features --feature-selection keep_highest_variances 5000 --split-data-set -m GMVAE -r negative_binomial -l 50 -H 1000 1000 --decomposition-methods pca tsne\n\n## References ##\n\nChristopher Heje Gr\u00f8nbech, Maximillian Fornitz Vording, Pascal Nordgren Timshel, Casper Kaae S\u00f8nderby, Tune Hannes Pers, and Ole Winther. [scVAE: Variational auto-encoders for single-cell gene expression data][scVAE-paper]. bioRxiv, 2018.\n\n[scVAE-paper]: https://www.biorxiv.org/content/10.1101/318295v2\n\n\n# Release History\n\n## 2.0.0 (2019-05-18) ##\n\n* Complete refactor and clean-up including structuring as Python package.\n* Easier loading of custom data sets.\n* Batch correction included in models for data sets with batch indices.\n* Learnable mixture coefficients for the GMVAE model.\n* Full covariance matrix for the GMVAE model.\n* Sampling from models.\n\n## 1.0 (2018-04-25) ##\n\nInitial release.\n\n\n",
"description_content_type": "text/markdown",
"docs_url": null,
"download_url": "",
"downloads": {
"last_day": -1,
"last_month": -1,
"last_week": -1
},
"home_page": "http://github.com/chgroenbech/scvae",
"keywords": "",
"license": "Apache 2.0",
"maintainer": "",
"maintainer_email": "",
"name": "scvae",
"package_url": "https://pypi.org/project/scvae/",
"platform": "",
"project_url": "https://pypi.org/project/scvae/",
"project_urls": {
"Homepage": "http://github.com/chgroenbech/scvae"
},
"release_url": "https://pypi.org/project/scvae/2.0.0/",
"requires_dist": [
"importlib-resources (>=1.0)",
"loompy (>=2.0)",
"matplotlib (>=2.0)",
"numpy (>=1.16)",
"pandas (>=0.24)",
"pillow (>=5.4)",
"scikit-learn (>=0.20)",
"scipy (>=1.2)",
"seaborn (>=0.9)",
"tables (>=3.5)",
"tensorflow (>=1.13)",
"tensorflow-probability (>=0.6)"
],
"requires_python": ">=3.5.0",
"summary": "Model single-cell transcript counts using deep learning.",
"version": "2.0.0"
},
"last_serial": 5286347,
"releases": {
"0.0.1": [
{
"comment_text": "",
"digests": {
"md5": "ef1ffa662264246961d99b456dc47d83",
"sha256": "b860f23fa6df023d8fada7aa92f573d7b06e2e990d68912c739e39b2d1a9db42"
},
"downloads": -1,
"filename": "scvae-0.0.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ef1ffa662264246961d99b456dc47d83",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 1255,
"upload_time": "2019-02-27T11:37:19",
"url": "https://files.pythonhosted.org/packages/d5/bc/e9295e9c2c62866a9cda02c49c5fdbed08b12974ef6cbe43f62499fd4a4a/scvae-0.0.1-py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "cfd53c87eb3e2ab766d93affd9d5dc63",
"sha256": "c80f99063f4603e4985d05b3d81a3d1e7c2246f29a509c0e4a20d6f4841bdc08"
},
"downloads": -1,
"filename": "scvae-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "cfd53c87eb3e2ab766d93affd9d5dc63",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 824,
"upload_time": "2019-02-27T11:37:22",
"url": "https://files.pythonhosted.org/packages/37/3c/6f0883f5c0e121a57e8be132e76149c4dc33cc86c6b1d07d97f59b7e0ef6/scvae-0.0.1.tar.gz"
}
],
"2.0.0": [
{
"comment_text": "",
"digests": {
"md5": "ae0ecb420dc22aa3f026fb1573a07030",
"sha256": "d9f14c90c0d073fd76923d330cfdecddd857f83acbcd5955e733034183d63a75"
},
"downloads": -1,
"filename": "scvae-2.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ae0ecb420dc22aa3f026fb1573a07030",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.5.0",
"size": 170508,
"upload_time": "2019-05-18T18:32:51",
"url": "https://files.pythonhosted.org/packages/01/d5/3b5e84b842ec4dd4643871f47582951350dfc6d55dde5ad33ae85462f532/scvae-2.0.0-py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "65252e020f9cdcd4dcaf0275c169f774",
"sha256": "13b6932b2e487f272a585caf279ca10d587b9adf6c994f8e31f20ea6eed659a1"
},
"downloads": -1,
"filename": "scvae-2.0.0.tar.gz",
"has_sig": false,
"md5_digest": "65252e020f9cdcd4dcaf0275c169f774",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5.0",
"size": 137197,
"upload_time": "2019-05-18T18:32:54",
"url": "https://files.pythonhosted.org/packages/d0/41/29d91ef7e14e50f152011dd3ec511f767750ff7a0c18801f32b8ff8aa195/scvae-2.0.0.tar.gz"
}
]
},
"urls": [
{
"comment_text": "",
"digests": {
"md5": "ae0ecb420dc22aa3f026fb1573a07030",
"sha256": "d9f14c90c0d073fd76923d330cfdecddd857f83acbcd5955e733034183d63a75"
},
"downloads": -1,
"filename": "scvae-2.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "ae0ecb420dc22aa3f026fb1573a07030",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.5.0",
"size": 170508,
"upload_time": "2019-05-18T18:32:51",
"url": "https://files.pythonhosted.org/packages/01/d5/3b5e84b842ec4dd4643871f47582951350dfc6d55dde5ad33ae85462f532/scvae-2.0.0-py3-none-any.whl"
},
{
"comment_text": "",
"digests": {
"md5": "65252e020f9cdcd4dcaf0275c169f774",
"sha256": "13b6932b2e487f272a585caf279ca10d587b9adf6c994f8e31f20ea6eed659a1"
},
"downloads": -1,
"filename": "scvae-2.0.0.tar.gz",
"has_sig": false,
"md5_digest": "65252e020f9cdcd4dcaf0275c169f774",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.5.0",
"size": 137197,
"upload_time": "2019-05-18T18:32:54",
"url": "https://files.pythonhosted.org/packages/d0/41/29d91ef7e14e50f152011dd3ec511f767750ff7a0c18801f32b8ff8aa195/scvae-2.0.0.tar.gz"
}
]
}