{
"info": {
"author": "S. Domanskyi , A. Szedlak, N. T Hawkins, J. Wang, G. Paternostro, C. Piermarocchi",
"author_email": "s.domanskyi@gmail.com",
"bugtrack_url": null,
"classifiers": [
"Intended Audience :: Developers",
"Intended Audience :: Education",
"Intended Audience :: End Users/Desktop",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: MIT License",
"Operating System :: MacOS",
"Operating System :: Microsoft :: Windows",
"Operating System :: Unix",
"Programming Language :: Python :: 3",
"Topic :: Education",
"Topic :: Scientific/Engineering :: Bio-Informatics",
"Topic :: Utilities"
],
"description": "# Digital Cell Sorter\nIdentification of hematological cell types from heterogeneous single cell RNA-seq data.\n\n[Polled Digital Cell Sorter (p-DCS): Automatic identification of hematological cell types from single cell RNA-sequencing clusters](\nhttps://doi.org/10.1186/s12859-019-2951-x \n\"Polled Digital Cell Sorter (p-DCS): Automatic identification of hematological cell types from single cell RNA-sequencing clusters\")\nSergii Domanskyi, Anthony Szedlak, Nathaniel T Hawkins, Jiayin Wang, Giovanni Paternostro & Carlo Piermarocchi,\n *BMC Bioinformatics* volume 20, Article number: 369 (**2019**)\n\n\n## Getting Started\n\nThese instructions will get you a copy of the project up and running on your machine for data analysis, development or testing purposes.\n\n### Prerequisites\n\nThe code runs in Python >= 3.7 environment. \n\nIt is highly recommended to install Anaconda.\nInstallers are available at https://www.anaconda.com/distribution/\n\nIt uses packages ```numpy```, ```pandas```, ```matplotlib```, ```scikit-learn```, ```scipy```, \n```mygene```, ```fftw```, ```pynndescent```, ```networkx```, ```python-louvain```, ```fitsne```\nand a few other standard Python packages. Most of these packages are installed with installation of the \nlatest release of ```DigitalCellSorter```:\n\n\tpip install DigitalCellSorter\n\nAlternatively, you can install this module directly from GitHub using:\n\n\tpip install git+https://github.com/sdomanskyi/DigitalCellSorter\n\nAlso one can create a local copy of this project for development purposes by running:\n\n\tgit clone https://github.com/sdomanskyi/DigitalCellSorter\n\nTo install ```fftw``` from the ```conda-forge``` channel add ```conda-forge``` to your channels.\nOnce the conda-forge channel has been enabled, ```fftw``` can be installed as follows:\n\n\n\tconda config --add channels conda-forge\n\tconda install fftw\n\n### Loading the package\n\nIn your script import the package:\n\n\timport DigitalCellSorter\n\nImport class ```DigitalCellSorter``` and create its instance. Here, for simplicity, we use Default parameter values:\n\n\tfrom DigitalCellSorter import DigitalCellSorter as DigitalCellSorterSubmodule\n\tDCS = DigitalCellSorterSubmodule.DigitalCellSorter()\n\n \n\n```dataName```: name used in output files, Default ''\n\n```geneListFileName```: marker cell type list name, Default None\n\n```mitochondrialGenes```: list of mitochondrial genes for quality conrol routine, Default None\n\n```sigmaOverMeanSigma```: threshold to consider a gene constant, Default 0.3\n\n```nClusters```: number of clusters, Default 5\n\n```nComponentsPCA```: number of pca components, Default 100\n\n```nSamplesDistribution```: number of random samples to generate, Default 10000\n\n```saveDir```: directory for output files, Default is current directory\n\n```makeMarkerSubplots```: whether to make subplots on markers, Default True\n\n```makePlots```: whether to make all major plots, Default True\n\n```votingScheme```: voting shceme to use instead of the built-in, Default None\n\n```availableCPUsCount```: number of CPUs available, Default os.cpu_count()\n\n```zScoreCutoff```: zscore cutoff when calculating Z_mc, Default 0.3\n\n```clusterName```: parameter used in subclustering, Default None\n\n```doQualityControl```: whether to remove low quality cells, Default True\n\n```doBatchCorrection```: whether to correct data for batches, Default False\n\n \n\n| cell | gene | expr |\n|------|------|------|\n| C1 | G1 | 3 |\n| C1 | G2 | 2 |\n| C1 | G3 | 1 |\n| C2 | G1 | 1 |\n| C2 | G4 | 5 |\n| ... | ... | ... |\n\nor:\n\n| batch | cell | gene | expr |\n|--------|------|------|------|\n| batch0 | C1 | G1 | 3 |\n| batch0 | C1 | G2 | 2 |\n| batch0 | C1 | G3 | 1 |\n| batch1 | C2 | G1 | 1 |\n| batch1 | C2 | G4 | 5 |\n| ... | ... | ... | ... |\n\n \n\n| cell | C1 | C2 | C3 | C4 |\n|-------|--------|--------|--------|--------|\n| G1 | | 3 | 1 | 7 |\n| G2 | 2 | 2 | | 2 |\n| G3 | 3 | 1 | | 5 |\n| G4 | 10 | | 5 | 4 |\n| ... | ... | ... | ... | ... |\n\nor:\n\n| batch | batch0 | batch0 | batch1 | batch1 |\n|-------|--------|--------|--------|--------|\n| cell | C1 | C2 | C3 | C4 |\n| G1 | | 3 | 1 | 7 |\n| G2 | 2 | 2 | | 2 |\n| G3 | 3 | 1 | | 5 |\n| G4 | 10 | | 5 | 4 |\n| ... | ... | ... | ... | ... |\n\n \n\n df = pd.DataFrame(data=[[2,np.nan],[3,8],[3,5],[np.nan,1]], \n index=['G1','G2','G3','G4'], \n columns=pd.MultiIndex.from_arrays([['batch0','batch1'],['C1','C2']], names=['batch', 'cell'])) \n\n\n \n\n se = pd.Series(data=[1,8,3,5,5], \n index=pd.MultiIndex.from_arrays([['batch0','batch0','batch1','batch1','batch1'],\n ['C1','C1','C1','C2','C2'],\n ['G1','G2','G3','G1','G4']], names=['batch', 'cell', 'gene']))\n\n\n \n\n 1. **Pre-preprocessing** of single cell mRNA sequencing data (gene expression data)\n 1. Cleaning: filling in missing values, zemoving all-zero genes and cells, converting gene index to a desired convention, etc.\n 2. Normalizing: rescaling all cells expression, log-transforming, etc.\n\n 2. **Quality control**\n 3. **Batch effects correction**\n 4. **Cells anomaly score evaluation**\n 4. **Dimensionality reduction**\n 5. **Clustering** (Hierarchical, K-Means, knn-graph-based, etc.)\n 6. **Annotating cell types**\n 7. **Vizualization**\n 1. t-SNE layout plot\n 2. Quality Control histogram plot\n 3. Marker expression t-SNE subplot\n 4. Marker-centroids expression plot\n 5. Voting results matrix plot\n 6. Cell types stacked barplot\n 7. Anomaly scores plot\n 8. Histogram null distribution plot\n 9. New markers plot\n 10. Sankey diagram (a.k.a. river plot)\n\n 8. **Post-processing** functions, e.g. extract cells of interest, find significantly expressed genes, \nplot marker expression of the cells of interest, etc.\n\n \n\n- ```makeMarkerExpressionPlot()```: a heatmap that shows all markers and their expression levels in the clusters, \nin addition this figure contains relative (%) and absolute (cell counts) cluster sizes\n\n \n\t \n\t \n \n\t \n\t \n\t \n\t \n\t \n\t \n\t \n\t \n\t \n\t \n\tDuring the initialization the following parameters can be specified (click me)
Examples:
Examples:
Examples:
Examples:
The class includes tools for:
The visualization tools include:
\n
\n\t
\n
\n
\n
\n\t
\n\t
\n
\n\t
\n\t
\n
\n
\n\t
\n
\n\t
\n
\n\t
\n
\n\n
\n\t
\n\t
\n
\n
\n\t\n\t
\n