{ "info": { "author": "S. Domanskyi , A. Szedlak, N. T Hawkins, J. Wang, G. Paternostro, C. Piermarocchi", "author_email": "s.domanskyi@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Intended Audience :: Education", "Intended Audience :: End Users/Desktop", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Operating System :: MacOS", "Operating System :: Microsoft :: Windows", "Operating System :: Unix", "Programming Language :: Python :: 3", "Topic :: Education", "Topic :: Scientific/Engineering :: Bio-Informatics", "Topic :: Utilities" ], "description": "# Digital Cell Sorter\nIdentification of hematological cell types from heterogeneous single cell RNA-seq data.\n\n[Polled Digital Cell Sorter (p-DCS): Automatic identification of hematological cell types from single cell RNA-sequencing clusters](\nhttps://doi.org/10.1186/s12859-019-2951-x \n\"Polled Digital Cell Sorter (p-DCS): Automatic identification of hematological cell types from single cell RNA-sequencing clusters\")\nSergii Domanskyi, Anthony Szedlak, Nathaniel T Hawkins, Jiayin Wang, Giovanni Paternostro & Carlo Piermarocchi,\n *BMC Bioinformatics* volume 20, Article number: 369 (**2019**)\n\n\n## Getting Started\n\nThese instructions will get you a copy of the project up and running on your machine for data analysis, development or testing purposes.\n\n### Prerequisites\n\nThe code runs in Python >= 3.7 environment. \n\nIt is highly recommended to install Anaconda.\nInstallers are available at https://www.anaconda.com/distribution/\n\nIt uses packages ```numpy```, ```pandas```, ```matplotlib```, ```scikit-learn```, ```scipy```, \n```mygene```, ```fftw```, ```pynndescent```, ```networkx```, ```python-louvain```, ```fitsne```\nand a few other standard Python packages. Most of these packages are installed with installation of the \nlatest release of ```DigitalCellSorter```:\n\n\tpip install DigitalCellSorter\n\nAlternatively, you can install this module directly from GitHub using:\n\n\tpip install git+https://github.com/sdomanskyi/DigitalCellSorter\n\nAlso one can create a local copy of this project for development purposes by running:\n\n\tgit clone https://github.com/sdomanskyi/DigitalCellSorter\n\nTo install ```fftw``` from the ```conda-forge``` channel add ```conda-forge``` to your channels.\nOnce the conda-forge channel has been enabled, ```fftw``` can be installed as follows:\n\n\n\tconda config --add channels conda-forge\n\tconda install fftw\n\n### Loading the package\n\nIn your script import the package:\n\n\timport DigitalCellSorter\n\nImport class ```DigitalCellSorter``` and create its instance. Here, for simplicity, we use Default parameter values:\n\n\tfrom DigitalCellSorter import DigitalCellSorter as DigitalCellSorterSubmodule\n\tDCS = DigitalCellSorterSubmodule.DigitalCellSorter()\n\n
During the initialization the following parameters can be specified (click me)

\n\n```dataName```: name used in output files, Default ''\n\n```geneListFileName```: marker cell type list name, Default None\n\n```mitochondrialGenes```: list of mitochondrial genes for quality conrol routine, Default None\n\n```sigmaOverMeanSigma```: threshold to consider a gene constant, Default 0.3\n\n```nClusters```: number of clusters, Default 5\n\n```nComponentsPCA```: number of pca components, Default 100\n\n```nSamplesDistribution```: number of random samples to generate, Default 10000\n\n```saveDir```: directory for output files, Default is current directory\n\n```makeMarkerSubplots```: whether to make subplots on markers, Default True\n\n```makePlots```: whether to make all major plots, Default True\n\n```votingScheme```: voting shceme to use instead of the built-in, Default None\n\n```availableCPUsCount```: number of CPUs available, Default os.cpu_count()\n\n```zScoreCutoff```: zscore cutoff when calculating Z_mc, Default 0.3\n\n```clusterName```: parameter used in subclustering, Default None\n\n```doQualityControl```: whether to remove low quality cells, Default True\n\n```doBatchCorrection```: whether to correct data for batches, Default False\n\n

\n\nThese and other parameters can be modified after initialization using, e.g.:\n\n\tDCS.toggleMakeStackedBarplot = False\n\n\n\n### Gene Expression Data Format\n\nThe input gene expression data is expected in one of the following formats:\n\n1. Spreadsheet of comma-separated values ```csv``` containing condensed matrix in a form ```('cell', 'gene', 'expr')```. \nIf there are batches in the data the matrix has to be of the form ```('batch', 'cell', 'gene', 'expr')```. Columns order can be arbitrary.\n\n
Examples:

\n\n| cell | gene | expr |\n|------|------|------|\n| C1 | G1 | 3 |\n| C1 | G2 | 2 |\n| C1 | G3 | 1 |\n| C2 | G1 | 1 |\n| C2 | G4 | 5 |\n| ... | ... | ... |\n\nor:\n\n| batch | cell | gene | expr |\n|--------|------|------|------|\n| batch0 | C1 | G1 | 3 |\n| batch0 | C1 | G2 | 2 |\n| batch0 | C1 | G3 | 1 |\n| batch1 | C2 | G1 | 1 |\n| batch1 | C2 | G4 | 5 |\n| ... | ... | ... | ... |\n\n

\n\n\n2. Spreadsheet of comma-separated values ```csv``` where rows are genes, columns are cells with gene expression counts.\nIf there are batches in the data the spreadsheet the first row should be ```'batch'``` and the second ```'cell'```.\n\n
Examples:

\n\n| cell | C1 | C2 | C3 | C4 |\n|-------|--------|--------|--------|--------|\n| G1 | | 3 | 1 | 7 |\n| G2 | 2 | 2 | | 2 |\n| G3 | 3 | 1 | | 5 |\n| G4 | 10 | | 5 | 4 |\n| ... | ... | ... | ... | ... |\n\nor:\n\n| batch | batch0 | batch0 | batch1 | batch1 |\n|-------|--------|--------|--------|--------|\n| cell | C1 | C2 | C3 | C4 |\n| G1 | | 3 | 1 | 7 |\n| G2 | 2 | 2 | | 2 |\n| G3 | 3 | 1 | | 5 |\n| G4 | 10 | | 5 | 4 |\n| ... | ... | ... | ... | ... |\n\n

\n\n3. ```Pandas DataFrame``` where ```axis 0``` is genes and ```axis 1``` are cells.\nIf the are batched in the data then the index of ```axis 1``` should have two levels, e.g. ```('batch', 'cell')```, \nwith the first level indicating patient, batch or expreriment where that cell was sequenced, and the\nsecond level containing cell barcodes for identification.\n\n
Examples:

\n\n df = pd.DataFrame(data=[[2,np.nan],[3,8],[3,5],[np.nan,1]], \n index=['G1','G2','G3','G4'], \n columns=pd.MultiIndex.from_arrays([['batch0','batch1'],['C1','C2']], names=['batch', 'cell'])) \n\n\n

\n\n4. ```Pandas Series ``` where index should have two levels, e.g. ```('cell', 'gene')```. If there are batched in the data\nthe first level should be indicating patient, batch or expreriment where that cell was sequenced, the second level cell barcodes for \nidentification and the third level gene names.\n\n
Examples:

\n\n se = pd.Series(data=[1,8,3,5,5], \n index=pd.MultiIndex.from_arrays([['batch0','batch0','batch1','batch1','batch1'],\n ['C1','C1','C1','C2','C2'],\n ['G1','G2','G3','G1','G4']], names=['batch', 'cell', 'gene']))\n\n\n

\n\nAny of the data types outlined above need to be prepared/validated with a function ```prepare()```. \nLet us demonstrate this on the input of type 1:\n\n\tdf_expr = DCS.prepare(data='data/testData/data.tsv', \n\t\t\t\tgenes='data/testData/genes.tsv', \n\t\t\t\tcells='data/testData/barcodes.tsv',\n\t\t\t\tbatches=None)\n\n### Other Data\n\n```markersDCS.xlsx```: An excel book with marker data. Rows are markers and columns are cell types. \n'1' means that the gene is a marker for that cell type, and '0' otherwise.\nThis gene marker file included in the package is used by Default. \nIf you use your own file it has to be prepared in the same format (including tabs names, etc.).\n\n```Human.MitoCarta2.0.csv```: An ```csv``` spreadsheet with human mitochondrial genes, created within work \n[MitoCarta2.0: an updated inventory of mammalian mitochondrial proteins](https://doi.org/10.1093/nar/gkv1003 \"MitoCarta2.0\")\nSarah E. Calvo, Karl R. Clauser, Vamsi K. Mootha, *Nucleic Acids Research*, Volume 44, Issue D1, 4 January 2016, Pages D1251\u2013D1257.\n\n\n## Functionality\n\nThe main class for cell sorting functions and producing output images is DigitalCellSorter\n\n
The class includes tools for:

\n\n 1. **Pre-preprocessing** of single cell mRNA sequencing data (gene expression data)\n 1. Cleaning: filling in missing values, zemoving all-zero genes and cells, converting gene index to a desired convention, etc.\n 2. Normalizing: rescaling all cells expression, log-transforming, etc.\n\n 2. **Quality control**\n 3. **Batch effects correction**\n 4. **Cells anomaly score evaluation**\n 4. **Dimensionality reduction**\n 5. **Clustering** (Hierarchical, K-Means, knn-graph-based, etc.)\n 6. **Annotating cell types**\n 7. **Vizualization**\n 1. t-SNE layout plot\n 2. Quality Control histogram plot\n 3. Marker expression t-SNE subplot\n 4. Marker-centroids expression plot\n 5. Voting results matrix plot\n 6. Cell types stacked barplot\n 7. Anomaly scores plot\n 8. Histogram null distribution plot\n 9. New markers plot\n 10. Sankey diagram (a.k.a. river plot)\n\n 8. **Post-processing** functions, e.g. extract cells of interest, find significantly expressed genes, \nplot marker expression of the cells of interest, etc.\n\n

\n\n\nThe ```process()``` function will produce all necessary files for post-analysis of the data. \n\n
The visualization tools include:

\n\n- ```makeMarkerExpressionPlot()```: a heatmap that shows all markers and their expression levels in the clusters, \nin addition this figure contains relative (%) and absolute (cell counts) cluster sizes\n\n

\n\t\n

\n\n- ```getIndividualGeneExpressionPlot()```: t-SNE layout colored by individual gene's expression\n\n

\n\t\n\t\n

\n\n- ```makeVotingResultsMatrixPlot()```: z-scores of the voting results for each input cell type and each cluster, \nin addition this figure contains relative (%) and absolute (cell counts) cluster sizes\n\n

\n \n

\n\n- ```makeHistogramNullDistributionPlot()```: null distribution for each cluster and each cell type illustrating \nthe \"machinery\" of the Digital Cell Sorter\n\n

\n\t\n

\n\n- ```makeQualityControlHistogramPlot()```: Quality control histogram plots\n\n

\n\t\n\t\n\t\n

\n\n- ```makeTSNEplot()```: t-SNE layouts colored by number of unique genes expressed, \nnumber of counts measured, and a faraction of mitochondrial genes..\n\n

\n\t\n\t\n\t\n

\n\n

\n\t\n

\n\n

\n\t\n\t\n

\n\nEffect of batch correction demostrated on combining BM1, BM2, BM3 and processing the data jointly without (left) and with (right) batch correction option:\n\n

\n\t\n\t\n

\n\n- ```makeStackedBarplot()```: plot with fractions of various cell types\n\n

\n\t\n\t\n

\n\n\n- ```makeSankeyDiagram()```: river plot to compare various results \n\n[(see interactive HTML version, download it and open in a browser)](https://github.com/sdomanskyi/DigitalCellSorter/blob/master/docs/examples/Sankey_example.html \"Sankey interactive diagram\")\n\n

\n\t\n

\n\n- ```getAnomalyScoresPlot()```: plot with anomaly scores per cell\n\n

\n\t\n

\n\nCalculate and plot anomaly scores for an arbitrary cell type or cluster:\n\n

\n\t\n\t\n\t\n

\n\n\n- ```makePlotOfNewMarkers()```: genes significantly expressed in the annotated cell types\n\n

\n\t\n

\n\n

\n\n\n## Demo\n\n### Usage\n\nIn these instructions we have already created an instance of ```DigitalCellSorter``` class (see section **Loading the package**) .\nThe function ```process()``` takes takes as an input parameter a pandas DataFrame validated by function ```process()```:\n\n\tDCS.process(df_expr) \n\nWe have made an example execution file ```demo.py``` that shows how to use ```DigitalCellSorter```.\n\nIn the demo, folder ```data``` is intentionally left empty. The reader can download the file ```ica_bone_marrow_h5.h5``` \nfrom https://preview.data.humancellatlas.org/ (Raw Counts Matrix - Bone Marrow) and place in folder ```data```. \nThe file is ~485Mb and contains all 378000 cells from 8 bone marrow donors (BM1-BM8). \nIn our example, the data of BM1 is prepared by \nfunction ```PrepareDataOnePatient()``` in module ```ReadPrepareDataHCApreviewDataset```.\nLoad this function, and call it to create a ```BM1.h5``` file (HDF file of input type 3) in the ```data``` folder:\n\n\tfrom DigitalCellSorter import ReadPrepareDataHCApreviewDataset as HCAtools\n\tHCAtools.PrepareDataOnePatient(os.path.join('data', 'ica_bone_marrow_h5.h5'), 'BM1', os.path.join('data', ''))\n\nLet's modify some of the ```DCS``` attributes:\n\n\tDCS.dataName = 'BM1'\n\tDCS.saveDir = os.path.join('output', dataName, '')\n\tDCS.geneListFileName = os.path.join('geneLists', 'CIBERSORT.xlsx')\n\tDCS.nClusters = 20\n\nNow we are ready to ```load``` the data, ```validate``` it and ```process```:\n\n\tdf_expr = pd.read_hdf(os.path.join('data', 'BM1.h5'), key='BM1', mode='r')\n\n\tdf_expr = DCS.prepare(df_expr)\n\n\tDCS.process(df_expr)\n\nFurther analysis can be done on cell types of interest, e.g. here 'T cell' and 'B cell'.\nLet's create a new instance of DigitalCellSorter to run \"sub-analysis\" with it:\n\n DCSsub = DigitalCellSorterSubmodule.DigitalCellSorter(dataName=DCS.dataName, \n nClusters=10, \n doQualityControl=False)\n\nHere it was important to disable Quality control, because the low quality cells have already been identified with ```DCS```.\nAlso ```dataName``` parameter points to the location processed with ```DCS```. \nNow modify a few other attributes and process cell type 'T cell':\n\n DCSsub.subclusteringName = 'T cell'\n DCSsub.saveDir = os.path.join('output', DCS.dataName, 'subclustering T cell', '')\n DCSsub.geneListFileName = os.path.join('geneLists', 'CIBERSORT_T_SUB.xlsx')\n\n DCSsub.process(df_expr[DCS.getCells(celltype='T cell')])\n\nThis way the t-SNE layout with annotated clusters (left) of T cell sub-types and the corresponding voting matrix (right) \nare generated by the function ```process()```:\n\n

\n\t\n\t\n

\n\nWe can reuse the ```DCSsub``` to analyze cell type 'B cell':\n\n DCSsub.subclusteringName = 'B cell'\n DCSsub.saveDir = os.path.join('output', DCS.dataName, 'subclustering B cell', '')\n DCSsub.geneListFileName = os.path.join('geneLists', 'CIBERSORT_B_SUB.xlsx')\n\n DCSsub.process(df_expr[DCS.getCells(celltype='B cell')])\n\n\nFor a complete script see:\n\n\tpython demo.py\n\n### Output\n\nAll the output files are saved in ```output``` directory. If you specify any other directory, the results will be generetaed in it.\nIf you do not provide any directory the results will appear in the root where the script was executed.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "https://github.com/sdomanskyi/DigitalCellSorter/archive/1.2.1.tar.gz", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/sdomanskyi/DigitalCellSorter", "keywords": "single cell RNA sequencing,cell type identification,biomarkers", "license": "MIT License", "maintainer": "", "maintainer_email": "", "name": "DigitalCellSorter", "package_url": "https://pypi.org/project/DigitalCellSorter/", "platform": "", "project_url": "https://pypi.org/project/DigitalCellSorter/", "project_urls": { "Download": "https://github.com/sdomanskyi/DigitalCellSorter/archive/1.2.1.tar.gz", "Homepage": "https://github.com/sdomanskyi/DigitalCellSorter" }, "release_url": "https://pypi.org/project/DigitalCellSorter/1.2.2/", "requires_dist": [ "numpy (>=1.16.4)", "pandas (>=0.24.2)", "tables (>=3.5.2)", "scipy (>=1.3.0)", "matplotlib (>=3.1.0)", "scikit-learn (>=0.21.2)", "plotly (>=4.1.1)", "mygene (>=3.1.0)", "pynndescent (>=0.3.3)", "networkx (>=2.3)", "python-louvain (>=0.13)", "fitsne (>=1.0.1) ; platform_system == \"Linux\" or platform_system == \"Darwin\"" ], "requires_python": "", "summary": "Toolkit for analysis and identification of hematological cell types from heterogeneous single cell RNA-seq data", "version": "1.2.2" }, "last_serial": 5919001, "releases": { "1.2.1": [ { "comment_text": "", "digests": { "md5": "05cbdc0e74cd27b5b154de9a4ac458ba", "sha256": "e91b968e8083a997d62003c710c75f99f2371c5eee96abd48b5ffc45db3af495" }, "downloads": -1, "filename": "DigitalCellSorter-1.2.1-py3-none-any.whl", "has_sig": false, "md5_digest": "05cbdc0e74cd27b5b154de9a4ac458ba", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6149910, "upload_time": "2019-10-02T15:17:43", "url": "https://files.pythonhosted.org/packages/fb/96/dd6f02d1e398630d46026e5462fb4d87bf8873680efb7683a72ed2387b6a/DigitalCellSorter-1.2.1-py3-none-any.whl" } ], "1.2.2": [ { "comment_text": "", "digests": { "md5": "c2430d6342af71cc118d06761f60c6db", "sha256": "e65b605476773343e25059ea074b8f8f4f7a3c932e0e885734b6bc67e9a3e18c" }, "downloads": -1, "filename": "DigitalCellSorter-1.2.2-py3-none-any.whl", "has_sig": false, "md5_digest": "c2430d6342af71cc118d06761f60c6db", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6155532, "upload_time": "2019-10-02T16:43:42", "url": "https://files.pythonhosted.org/packages/a2/85/5f4d1b9be17f9d297a6a157adce66f06ae440b25eca61fbd286b50fe7188/DigitalCellSorter-1.2.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "0a62345d7fa1121b22853adc6d93d858", "sha256": "7b70e2d4438dba5c42d96b1f522f1196849cb689be4b413ad669e45886219f27" }, "downloads": -1, "filename": "DigitalCellSorter-1.2.2.tar.gz", "has_sig": false, "md5_digest": "0a62345d7fa1121b22853adc6d93d858", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6110178, "upload_time": "2019-10-02T16:54:23", "url": "https://files.pythonhosted.org/packages/51/46/ce13a06945c3eaede9423cadbcb6ec1ca25dfc2f058fc70eeae5dec5678d/DigitalCellSorter-1.2.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "c2430d6342af71cc118d06761f60c6db", "sha256": "e65b605476773343e25059ea074b8f8f4f7a3c932e0e885734b6bc67e9a3e18c" }, "downloads": -1, "filename": "DigitalCellSorter-1.2.2-py3-none-any.whl", "has_sig": false, "md5_digest": "c2430d6342af71cc118d06761f60c6db", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6155532, "upload_time": "2019-10-02T16:43:42", "url": "https://files.pythonhosted.org/packages/a2/85/5f4d1b9be17f9d297a6a157adce66f06ae440b25eca61fbd286b50fe7188/DigitalCellSorter-1.2.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "0a62345d7fa1121b22853adc6d93d858", "sha256": "7b70e2d4438dba5c42d96b1f522f1196849cb689be4b413ad669e45886219f27" }, "downloads": -1, "filename": "DigitalCellSorter-1.2.2.tar.gz", "has_sig": false, "md5_digest": "0a62345d7fa1121b22853adc6d93d858", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6110178, "upload_time": "2019-10-02T16:54:23", "url": "https://files.pythonhosted.org/packages/51/46/ce13a06945c3eaede9423cadbcb6ec1ca25dfc2f058fc70eeae5dec5678d/DigitalCellSorter-1.2.2.tar.gz" } ] }