{ "info": { "author": "Gregory Giecold", "author_email": "ggiecold@jimmy.harvard.edu", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Console", "Intended Audience :: Developers", "Intended Audience :: End Users/Desktop", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Operating System :: MacOS :: MacOS X", "Operating System :: POSIX", "Programming Language :: Python :: 2.7", "Topic :: Scientific/Engineering", "Topic :: Scientific/Engineering :: Bio-Informatics", "Topic :: Scientific/Engineering :: Mathematics", "Topic :: Scientific/Engineering :: Visualization" ], "description": "ECLAIR\n======\n\nRobust and scalable inference of cell lineages from gene expression\ndata.\n\nECLAIR achieves a higher level of confidence in the estimated lineages\nthrough the use of approximation algorithms for consensus clustering and\nby combining the information from an ensemble of minimum spanning trees\nso as to come up with an improved, aggregated lineage tree.\n\nIn addition, the present package features several customized algorithms\nfor assessing the similarity between weighted graphs or unrooted trees\nand for estimating the reproducibility of each edge in a given tree.\n\nHow ECLAIR graphs and trees are generated\n-----------------------------------------\n\nECLAIR stands for Ensemble Clustering for Lineage Analysis, Inference\nand Robustness. It proceeds as follow:\n\n- Choose among affinity propagation, hierarchical or k-means clustering\n and DBSCAN (cf. our ``DBSCAN_multiplex`` and ``Concurrent_AP``\n packages for streamlined and scalable implementations of DBSCAN and\n affinity propagation clustering) for how to group cells from\n subsamples of your dataset.\n\n- Such a subsample is obtained by density-based downsampling (as\n implemented in our ``Density_Sampling`` software posted on the Python\n Package Index), either by aiming for an overall number of datapoints\n to extract from the dataset or by specifiying a target percentile of\n the distribution built from local densities around each datapoint.\n\n- ECLAIR then goes about performing several rounds of downsampling and\n clustering on such subsamples, for as many iterations as specified by\n the user. After each run of clustering a given subsample, the\n datapoints that were left over by the downsampling procedure are\n upsampled by associating them to the closest centroid in\n high-dimensional feature space.\n\n- For each such run, build a minimum spanning tree. This minimum\n spanning tree is obtained from a matrix of L2 pairwise similarities\n between the centroids associated to each cluster.\n\n- The next step obtains a consensus clustering from this ensemble of\n partitions of the whole dataset. Three heuristic methods are\n considered for this purpose: CSPA, HGPA and MCLA, all of them based\n on graph or hypergraph partitioning (cf. the documentation of our\n ``Cluster_Ensembles`` package for more information).\n\n- Once a consensus clustering has been reached, we build a graph from\n the consensus clusters and from the information associated with the\n ensemble of partitions from which those consensus clusters have just\n been computed. The edge weights of this graph are calculated as the\n mean of the following distribution: for each of the 2-uple consisting\n of one datapoint from consensus cluster ``a`` and another datapoint\n from consensus cluster ``b``, scan over the ensemble of partitions\n and keep track of the distance separating those two samples across\n each partition comprising the cluster ensemble.\n\n- We then obtain a minimum spanning tree from this graph, for\n convenience of visualization as well as for later comparison with a\n few other methods that purport to provide estimates of cell lineages \n (including the popular SPADE method, whose reproducibility issues spurred \n the development of ECLAIR. A module from the present package is indeed dedicated to\n illustrating the superior statistical performance of ECLAIR).\n\nStatistical performance of ECLAIR\n---------------------------------\n\nTo compare two lineage trees, one has to take into account their edge\nconnections but also the sample contents of their nodes, since the\nvariation associated to subsampling results in different clusters of\nsamples. Although there are many papers on graph matching and graph\ncomparison, we are not aware of any previously published method that\ntakes into account the node differences. We therefore developed\ncustomized statistical tests suitable for comparing lineage trees.\n\n- The first score we developped aims to compare the overall similarity\n between two lineage trees, ``T_1`` and ``T_2``. For each tree, we\n evaluate the path length between every pair of cells in the\n population, based on the edge connectivity. The correlation between\n the two sets of path length values is used as a score to compare the\n overall similarity of ``T_1`` and ``T_2``. For a moderately large\n dataset of 500,000 samples, this would naively translate into more\n than 100 billion pairs of distances along ``T_1``\\ and along ``T_2``.\n The details of the much more efficient algorithm we developped for\n that purpose is available from the docstrings of our package; the\n gist of this algorithm is to first build a contingency table\n recording the overlap in the number of samples between pairs of\n ``T_1`` nodes versus pairs of ``T_2`` nodes.\n\n- Second, we define ``D_ij``\\ as an edge-specific measures of\n statistical dispersion to evaluate the robustness of each edge within\n a given lineage tree , denoted ``T*``. Specifically, for each edge\n ``E_ij`` connecting a pair of clusters ``C_i*`` and ``C_j*``, we\n define the dispersion ``D_ij`` associated with ``E_ij`` as the\n standard deviation of the the distribution of path lengths\n ``L^a(x,y)``, where ``x`` and ``y`` are selected from ``C_i*``\\ and\n ``C_j*`` respectively, and ``a`` is summed over the partitions and\n minimum spanning trees from the ensemble out of which ``T*`` was\n constructed in the first place. This distribution is the same as the\n distribution of path lengths whose mean was used to assign a weight\n to edge ``E_ij`` of the graph from which the ECLAIR tree was inferred\n in the first place.\n\n- The afore-mentioned measure of statistical dispersion is computed\n solely in terms of the partitions and trees making up an ensemble\n from which a consensus clustering and an ECLAIR tree are then\n extracted. We also compare this measure with another measure of\n statistical dispersion, obtained by independently generating 50\n different ECLAIR trees in a procedure reminiscent of the bootstrap.\n One such tree is singled out as a reference tree. For each edge of\n this reference tree, we keep track across the rest of the 49 ECLAIR tree \n how spread out are the pairs of cells comprising the two nodes of this \n reference edge.\n\nOur ECLAIR package features a module entirely devoted to computing\nthrough suitable data structures and algorithms such statistical\nmeasures and a few more tests on pairs of ECLAIR trees.\n\nInstallation\n------------\n\nECLAIR is written in Python 2.7. It has been tested on Fedora Linux and\non Ubuntu and should be supported by any other member of the UNIX-like\nfamily of operating systems.\n\nInstall ECLAIR by sending a request to the Python Package Index (PyPI)\nas follows:\n\n- start a terminal;\n\n- enter ``pip install ECLAIR``.\n\nAny missing or out-of-date dependency should be automatically resolved.\nApart from the Python Standard Library, those include:\n\n- ``Cluster_Ensembles`` (version 1.16 or later)\n \n- ``Concurrent_AP`` (version 1.4 or later)\n\n- ``DBSCAN_multiplex`` (version 1.5 or ulterior)\n\n- ``Density_Sampling`` (1.1 or subsequent version)\n\n- ``igraph``\n\n- ``matplotlib`` (version 1.4.3 at least)\n\n- ``munkres``\n\n- ``numpy`` (1.9.0 or ulterior version)\n\n- ``scipy`` (0.16 or later version)\n\n- ``sklearn``\n\n- ``setuptools``\n \n- ``tables``\n\nPlease note that as part of the installation of this package, some code\nwritten in C that is part of the ``Cluster_Ensembles`` package will be\nautomatically compiled, under the hood and according to the\nspecifications of your machine. For this process to go seamlessly, you\nhave however to ensure availability of CMake and GNU make on your\noperating system. ``Cluster_Ensembles`` also requires the 32-bit version\nof the GNU C library. Please refer to the ``Cluster_Ensembles``\ndocumentation for more information on how to meet those few requirements\ndepending on Linux distribution.\n\nUsage\n-----\n\nTo subject a dataset to an ECLAIR analysis:\n\n- start a terminal;\n\n- enter ``python -m ECLAIR.Build_instance [options] [file_name]``, where ``file_name`` denotes the path to the data about to be processed. \n\nIt is generally recommended to leave the ``options`` and ``file_name`` field empty, which will trigger an interface asking the user to provide the path to the dataset to be processed and some guidance on the choice of parameters for the ECLAIR analysis at hand. Each row of the dataset accessed via the path ``file_name`` must correspond to a sample, whose features must be on display in a tab-separated format. A folder will be created in your current working directory, containing information on your ECLAIR tree and the underlying weighted graph (such as its adjacency matrix and confidence coefficients for each edge) along with a PDF figure illustrating a force-directed representation of the inferred lineage tree.\n\nTo launch a full-fledged statistical performance analysis of ECLAIR and see how it consistenly performs better than SPADE, a popular method for estimating cell lineages, proceed as follows:\n\n- at the Shell command-line interface or graphical user interface, type in ``python -m ECLAIR.Statistical_performance``.\n\nThe eponymous folder ``ECLAIR_performance`` will be created in your current directory, recording on the fly the results of various statistical tests and comparisons of ECLAIR graphs and trees, as well as of SPADE trees.\n\nIn the current version, the statistical performance of ECLAIR is only evaluated for a fairly large (by the current standards of computational biology) flow cytometry dataset of half-a-million samples and 8 features, as well as on a qPCR dataset of mouse bone marrow samples. It shouldn't be difficult for anyone competent in Python to quickly peruse through the source code of ECLAIR and bring about a few of the changes required to submit his/her own data to a similar statistical analysis (those changes mostly pertain to domain-specific knowledge and to the format of your dataset). ECLAIR has been designed so as to accommodate arbitrarily large datasets (this is achieved through the use HDF5 data structures, most notably).\n\nUpon sending the ``ECLAIR_performance`` command, several \"experiments\"\nwill be performed, including the comparisons of pairs of ECLAIR graphs\nor trees and pairs of SPADE trees generated on the same dataset. The\ncomparison of ECLAIR instances and of SPADE instances generated on\nnon-overlapping datasets and evaluated on a separate test set calls for\ndetailed explanations.\n\nWe are splitting a dataset into three equally-sized, non-overlapping\nparts, ``S1``, ``S2`` and ``S3``. We train an ECLAIR tree (``Ecl_1``)\nand a SPADE tree on ``S1`` (``Spd_1``). We then train another ECLAIR\ntree (``Ecl_2``) and yet another SPADE tree (``Spd_2``) on the set\n``S2``.\n\nThe training procedure for ``Ecl_1`` involves 50 runs of downsampling\nand clustering of the samples within ``S1``. The downsampling ratio is\nset at 50%. Therefore, ``Ecl_1`` is an aggregation of 50 trees, all\ngenerated from ``S1`` alone.\n\nIn order to compare ``Ecl_1`` with ``Ecl_2``, the cells in ``S3`` are\nmapped to the clusters/nodes in ``Ecl_1`` and in ``Ecl_2`` to which they\nare nearest in the high-dimensional gene expression space.\n\nIdem when it comes to comparing ``Spd_1`` and ``Spd_2``.\n\nThe procedure outlined above is repeated 10 times. We end up with two\nlists of 30 correlation coefficients telling us about the similarity of\nas many pairs of ECLAIR or SPADE trees. Indeed, while things have been\nexposed as involving only the evaluation of ``Ecl_1`` and ``Ecl_2`` on\n``S3`` using as a test set, one can also generate an ECLAIR tree using\nS3 as a training set. This allows the additional comparisons of\n``Ecl_1`` with ``Ecl_3`` and of ``Ecl_2`` with ``Ecl_3``.\n\nIt also bears pointing out we are using the same test set (``S3``) for\nassessing the similarity of pairs of ECLAIR trees (``Ecl_1`` vs.\n``Ecl_2``) as for evaluating the similitude of pairs of SPADE trees\n(``Spd_1`` vs. ``Spd_2``).\n\nReferences\n----------\n\n- Giecold, G., Marco, E., Trippa, L. and Yuan, G.-C., \"Robust Inference\n of Cell Lineages\". ArXiv preprint [q-bio.QM, stat.AP, stat.CO, stat.ML]: \n http://arxiv.org/abs/1601.02748\n- Strehl, A. and Ghosh, J., \"Cluster Ensembles - A Knowledge Reuse\n Framework for Combining Multiple Partitions\". In: Journal of Machine\n Learning Research, 3, pp. 583-617. 2002\n- Conte, D., Foggia, P., Sansone, C. and Vento, M., \"Thirty Years of\n Graph Matching in Pattern Recognition\". In: International Journal of\n Pattern Recognition and Artificial Intelligence, 18, 3, pp. 265-298.\n 2004", "description_content_type": null, "docs_url": null, "download_url": "https://github.com/GGiecold/ECLAIR", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/GGiecold/ECLAIR", "keywords": "aggregation bio-informatics clustering computational-biology consensus consensus-clustering cytometry data-mining ensemble ensemble-clustering gene gene-expression graph graph-matching machine-learning matching pattern-recognition qPCR tree tree-matching unsupervised-learning", "license": "MIT License", "maintainer": null, "maintainer_email": null, "name": "ECLAIR", "package_url": "https://pypi.org/project/ECLAIR/", "platform": "Any", "project_url": "https://pypi.org/project/ECLAIR/", "project_urls": { "Download": "https://github.com/GGiecold/ECLAIR", "Homepage": "https://github.com/GGiecold/ECLAIR" }, "release_url": "https://pypi.org/project/ECLAIR/1.18/", "requires_dist": null, "requires_python": null, "summary": "Robust inference of cell lineages from gene expression data via consensus clustering and the aggregation of ensembles of minimum spanning trees.", "version": "1.18" }, "last_serial": 2031859, "releases": { "1.11": [ { "comment_text": "", "digests": { "md5": "60773df08bcf4cec4bbe879f0b773b36", "sha256": "32138e3debe37d5b3e88382f36c1a6a6ef711001ca35b7d6b06dcddd52f28ed2" }, "downloads": -1, "filename": "ECLAIR-1.11.tar.gz", "has_sig": false, "md5_digest": "60773df08bcf4cec4bbe879f0b773b36", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24507654, "upload_time": "2016-01-01T02:51:59", "url": "https://files.pythonhosted.org/packages/9c/ce/01a04e6648fd0d298e6873ee87fd3daca4141b57538849cd1a35b497b3d9/ECLAIR-1.11.tar.gz" } ], "1.13": [ { "comment_text": "", "digests": { "md5": "eb80d6e25a0a289f7b8cf23dc69fbf70", "sha256": "2ec403b847bf4d7f306714bca2e4b2d8fe47e545c96f81a452467384abe5b2bc" }, "downloads": -1, "filename": "ECLAIR-1.13.tar.gz", "has_sig": false, "md5_digest": "eb80d6e25a0a289f7b8cf23dc69fbf70", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24507755, "upload_time": "2016-01-12T18:24:27", "url": "https://files.pythonhosted.org/packages/bb/ac/c2fe6ef88bb966fedc2755d0c18e0d9458bf3663f7e7c239884b9861cb93/ECLAIR-1.13.tar.gz" } ], "1.14": [ { "comment_text": "", "digests": { "md5": "89f2be38b169a0b85d314bcbbb773898", "sha256": "7e5face343ab13656d833f157e33dde2530e2ef5d85150701d8f42450ac07d53" }, "downloads": -1, "filename": "ECLAIR-1.14.tar.gz", "has_sig": false, "md5_digest": "89f2be38b169a0b85d314bcbbb773898", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24507812, "upload_time": "2016-01-12T20:07:15", "url": "https://files.pythonhosted.org/packages/f7/de/b6434831c9ecc6aadd671e1959ed1013ce9d9abbf8e21acccf88d8b2f769/ECLAIR-1.14.tar.gz" } ], "1.15": [ { "comment_text": "", "digests": { "md5": "5e269cb28d81d8ba24a50aeb5d23467c", "sha256": "f677da5d8525ef0e1dcc090bf91cb90a570438f8445df441c577f10c6cebe4b1" }, "downloads": -1, "filename": "ECLAIR-1.15.tar.gz", "has_sig": false, "md5_digest": "5e269cb28d81d8ba24a50aeb5d23467c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24508038, "upload_time": "2016-01-15T15:46:30", "url": "https://files.pythonhosted.org/packages/5f/49/961f9b00bdbd11de8dce939f3145f66ae7b322d1801b797e4ffa49a0ac7b/ECLAIR-1.15.tar.gz" } ], "1.16": [ { "comment_text": "", "digests": { "md5": "f99c373622a9c6c5d22bca5621711be8", "sha256": "49ea17f0812d24d8e7b833c8b0f52455f6a509943497ebd3be71e97e267cb321" }, "downloads": -1, "filename": "ECLAIR-1.16.tar.gz", "has_sig": false, "md5_digest": "f99c373622a9c6c5d22bca5621711be8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24508294, "upload_time": "2016-03-23T16:41:32", "url": "https://files.pythonhosted.org/packages/09/2f/4daf0f4cd358c8689c5528944d1e1819502a050eaaa549e9fbff37c77aa6/ECLAIR-1.16.tar.gz" } ], "1.17": [ { "comment_text": "", "digests": { "md5": "f34b94859de7040b538d25709cb02c26", "sha256": "afe4ed5b29a86bb4e1a50d0d2d7a523a0aff1e132ff981edc0387368e766e812" }, "downloads": -1, "filename": "ECLAIR-1.17.tar.gz", "has_sig": false, "md5_digest": "f34b94859de7040b538d25709cb02c26", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24508288, "upload_time": "2016-03-23T17:13:51", "url": "https://files.pythonhosted.org/packages/9a/00/cef5f6036f73c4fad8e9c35e981be73ed8eb37a1285717021409ce6b2e86/ECLAIR-1.17.tar.gz" } ], "1.18": [ { "comment_text": "", "digests": { "md5": "68c93bd47a0bcc8ae91a0951b914513d", "sha256": "cea7ff8430427fe6aad2d21f9686c51238d4e18958f08a10585e39fc0a967479" }, "downloads": -1, "filename": "ECLAIR-1.18.tar.gz", "has_sig": false, "md5_digest": "68c93bd47a0bcc8ae91a0951b914513d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24508300, "upload_time": "2016-03-28T15:00:54", "url": "https://files.pythonhosted.org/packages/4e/fb/faab6d38092034a8f8ed97fdd582f260a49d3e93806c8e336d8392036e5d/ECLAIR-1.18.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "68c93bd47a0bcc8ae91a0951b914513d", "sha256": "cea7ff8430427fe6aad2d21f9686c51238d4e18958f08a10585e39fc0a967479" }, "downloads": -1, "filename": "ECLAIR-1.18.tar.gz", "has_sig": false, "md5_digest": "68c93bd47a0bcc8ae91a0951b914513d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24508300, "upload_time": "2016-03-28T15:00:54", "url": "https://files.pythonhosted.org/packages/4e/fb/faab6d38092034a8f8ed97fdd582f260a49d3e93806c8e336d8392036e5d/ECLAIR-1.18.tar.gz" } ] }