{ "info": { "author": "Milton Pividori", "author_email": "miltondp@uchicago.edu", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Console", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# clustermatch\n_Title:_ Clustermatch: discovering hidden relations in highly-diverse kinds of qualitative and quantitative data without standardization \n_Authors:_ Milton Pividori, Andres Cernadas, Luis de Haro, Fernando Carrari, Georgina Stegmayer and Diego H. Milone\n\nsinc(i) (Research institute for signals, systems and computational intelligence) - http://sinc.unl.edu.ar\n\n\\* Corresponding author: mpividori@sinc.unl.edu.ar\n\n## Description\n\nClustermatch is an efficient clustering method for processing highly diverse\ndata. It can handle very different data types (such as numerical and\ncategorical), in the presence of linear or non-linear relationships, also with\nnoise, and without the need of any previous pre-processing. The article\ndescribing the method has been sent for publication.\n\nIf you want to quickly test Clustermatch, you can access an online web-demo from\n[here](http://sinc.unl.edu.ar/web-demo/clustermatch/).\n\nMirrors:\n\n * Github: https://github.com/sinc-lab/clustermatch\n * Bitbucket (Mercurial): https://bitbucket.org/sinc-lab/clustermatch\n\n## Installation\nYou can easily install Clustermatch with pip by running:\n\n```bash\n$ pip install clustermatch\n```\n\nThis will install a command line utility (run `clustermatch -h` for usage instructions)\nthat it is considered alpha and still under development. Follow the instructions\nbelow if you want to create your own environment and use the Python API to run\nClustermatch.\n\nClustermatch works with Python 3.6 (it should work with version 3.5 too). You\nalso need a C compiler (like GCC) to install `minepy` and run the simulations,\nalthough it's not necessary to use Clustermatch. In Ubuntu you can install GCC\nby running:\n\n```bash\n$ sudo apt-get install build-essential\n```\n\nThe recommended way to install the Python environment needed is using the\n[Anaconda](https://anaconda.org/)/[Miniconda](https://conda.io/miniconda.html)\ndistribution. Once conda is installed, move to the folder where Clustermatch\nwas unpacked and follow these steps:\n\n```bash\n$ conda env create -n cm -f environment.yml\n$ conda activate cm\n```\n\nThis will create a conda environment named `cm`. The last step activates the\nenvironment. You can run the test suite to make sure everything works in your\nsystem:\n\n```bash\n$ python -m unittest discover .\n......................................................................\n\nRan 92 tests in 47.056s\n\nOK\n```\n\nKeep in mind that if you want to fully reproduce the results in the manuscript,\nthen you need to install the full environment using the file\n`environment_full.yml`, which has additional dependencies. The one we used\nbefore (`environment.yml`) has the minimum set of packages needed to run\nClustermatch.\n\n\n## Reproducing results\n\nYou can reproduce one of the manuscripts results by running an experiment using\nan artificial dataset with several linear and non-linear transformations and\nsee how the method behave (replace `{CLUSTERMATCH_FOLDER}` with the path\nof the Clustermatch folder):\n\n```bash\n$ export PYTHONPATH={CLUSTERMATCH_FOLDER}\n$ cd {CLUSTERMATCH_FOLDER}/experiments\n$ python main.py --data-transf transform_rows_nonlinear03 --noise-perc 45 --n-jobs 4 --n-reps 1 --n-features 50\nRunning now:\n{\n \"clustering_algorithm\": \"spectral\",\n \"clustering_metric\": \"ari\",\n \"data_generator\": \"Blobs (data_seed_mode=False). n_features=50, n_samples=1000, centers=3, cluster_std=0.10, center_box=(-1.0, 1.0)\",\n \"data_noise\": {\n \"magnitude\": 0.0,\n \"percentage_measures\": 0.0,\n \"percentage_objects\": 0.45\n },\n \"data_transform\": \"Nonlinear row transformation 03. 10 simulated data sources; Functions: x^4, log, exp2, 100, log1p, x^5, 10000, log10, 0.0001, log2\",\n \"k_final\": null,\n \"n_reps\": 1\n}\n```\n\nThe arguments to the `main.py` scripts are: the data transformation function\n(`--data-transf transform_rows_nonlinear03`), the noise percentage (`--noise-perc 45`), the number of\ncores used (`--n-jobs 4`) and the number of repetitions (`--n-reps 1`). We are using just `1`\nrepetition and 50 features (`--n-features 50`) so as to speed up the\nexperiment. If you want to fully run this experiment as it was done in the\nmanuscript (Figure 3), use this command (for all noise levels):\n\n```bash\npython main.py --data-transf transform_rows_nonlinear03 --noise-perc 45 --n-jobs 4 --n-reps 20\n```\n\nOnce finished, you will find the output in directory\n`results_transform_rows_nonlinear03_0.45/{TIMESTAMP}/`:\n\n```bash\n$ cat results_transform_rows_nonlinear03_0.45/20180829_161133/output000.txt\n\n[...]\n\nmethod ('metric', 'mean') ('metric', 'std') ('time', 'mean')\n---------------- -------------------- ------------------- ------------------\n00. Clustermatch 1.00 nan 31.56\n01. SC-Pearson 0.11 nan 0.33\n02. SC-Spearman 0.29 nan 0.67\n03. SC-DC 0.45 nan 37.19\n04. SC-MIC 0.88 nan 45.73\n```\n\n## Usage\n\nIf you installed the command line utility (`clustermatch`), you can run it like this:\n\n```bash\n$ cd {CLUSTERMATCH_FOLDER}\n$ clustermatch -i experiments/tomato/data/real_sample.xlsx -k 3 -o partition.xls\n```\n\nThe file `partition.xls` will contain the partition of the data (`real_sample.xlsx`).\nCheck out the help (`clustermatch -h`) for more options.\n\nYou can also try the method by loading a sample of the tomato dataset used in\nthe manuscript. For that, follow these instructions:\n\n```bash\n$ cd {CLUSTERMATCH_FOLDER}\n$ ipython\n```\n```python\nIn [1]: from utils.data import merge_sources\nIn [2]: from clustermatch.cluster import calculate_simmatrix, get_partition_spectral\nIn [3]: data_files = ['experiments/tomato/data/real_sample.xlsx']\nIn [4]: merged_sources, feature_names, sources_names = merge_sources(data_files)\nIn [5]: cm_sim_matrix = calculate_simmatrix(merged_sources, n_jobs=4)\nIn [6]: partition = get_partition_spectral(cm_sim_matrix, 3)\n```\n\nThe variable `partition` will have the clustering solution for the number of\nclusters specified (`3` in this case). You can specify multiple input data\nfiles by filling the list `data_files`.\n\nClustermatch is able to process different data types (numerical, ordinal or\ncategorical) with no previous preprocessing required. The current\nimplementation considers a variable as categorical if it contains text. The\nrest, numerical and ordinal, are processed in a similar way, so you are\nresponsible for coding your ordinal varibles appropriately (for example,\n`low`, `normal` and `high` could be coded as 0, 1, 2; otherwise, if left as text,\nwill be considered as categorical).\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/sinc-lab/clustermatch", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "clustermatch", "package_url": "https://pypi.org/project/clustermatch/", "platform": "", "project_url": "https://pypi.org/project/clustermatch/", "project_urls": { "Homepage": "https://github.com/sinc-lab/clustermatch" }, "release_url": "https://pypi.org/project/clustermatch/0.1.4a3/", "requires_dist": [ "numpy", "scipy", "pandas", "joblib", "scikit-learn", "xlrd", "xlwt", "openpyxl" ], "requires_python": ">=3", "summary": "Efficient clustering method for processing highly diverse data", "version": "0.1.4a3" }, "last_serial": 4439466, "releases": { "0.1.4a1": [ { "comment_text": "", "digests": { "md5": "5cf2a882a1cf69f2b5c5ee7d5036a943", "sha256": "a5cb70eeb19e820616e653931d5f04180af32e14399eb11a261a39b693196242" }, "downloads": -1, "filename": "clustermatch-0.1.4a1-py3-none-any.whl", "has_sig": false, "md5_digest": "5cf2a882a1cf69f2b5c5ee7d5036a943", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 17324, "upload_time": "2018-11-01T04:23:30", "url": "https://files.pythonhosted.org/packages/c4/87/14a3e1b43d953e029b09639869283055cecae5e0f9f8be7fd571bee5712a/clustermatch-0.1.4a1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "dcfe2e6cff4ecd899580657d6e9f4a4d", "sha256": "e7d396690eb3e7abce58c4680c69c7b0cc45ce65cd55a60e5fd250a5bd221221" }, "downloads": -1, "filename": "clustermatch-0.1.4a1.tar.gz", "has_sig": false, "md5_digest": "dcfe2e6cff4ecd899580657d6e9f4a4d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 16656, "upload_time": "2018-11-01T04:23:31", "url": "https://files.pythonhosted.org/packages/8a/1b/80190dcd81e2cc2c4717e7a0f7a28148c52b48d9c404e550decb3f6cb65b/clustermatch-0.1.4a1.tar.gz" } ], "0.1.4a2": [ { "comment_text": "", "digests": { "md5": "04a3d66985531c84d0c133d37bf25f4a", "sha256": "69cf82ba065952ad89c8aa00648fc31e43ccf3c41211b777f1f301bb6aba5008" }, "downloads": -1, "filename": "clustermatch-0.1.4a2-py3-none-any.whl", "has_sig": false, "md5_digest": "04a3d66985531c84d0c133d37bf25f4a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 17321, "upload_time": "2018-11-01T04:51:37", "url": "https://files.pythonhosted.org/packages/4b/73/df20c9f3d0138edccb1d6e8e06a8fbab39a211640c23faab007d9eee2145/clustermatch-0.1.4a2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "85777ac175675c031cee12d86c7d6c77", "sha256": "904783f5a0fad7fe46b5b75976ddcf6733afdb2b2c6830f2bb972bd4f551da38" }, "downloads": -1, "filename": "clustermatch-0.1.4a2.tar.gz", "has_sig": false, "md5_digest": "85777ac175675c031cee12d86c7d6c77", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 16663, "upload_time": "2018-11-01T04:51:38", "url": "https://files.pythonhosted.org/packages/3c/44/ec8d0ce4ce0fc38170dc15e195fab95a298ce67ce283ed91eeb4026d7333/clustermatch-0.1.4a2.tar.gz" } ], "0.1.4a3": [ { "comment_text": "", "digests": { "md5": "1adc22ba8142bc2b21a32d1fbe1e3fee", "sha256": "cf434deef2457f60776c5f07a491f2146919c8ffeed5963f500d24932bfc488a" }, "downloads": -1, "filename": "clustermatch-0.1.4a3-py3-none-any.whl", "has_sig": false, "md5_digest": "1adc22ba8142bc2b21a32d1fbe1e3fee", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 17537, "upload_time": "2018-11-01T05:08:24", "url": "https://files.pythonhosted.org/packages/4a/9e/4f6fecf38ad50c70f5a2d5d1e5f0abb3603aebe92e12bda3b208daf70d97/clustermatch-0.1.4a3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "bc0d8e73061b22410173c061f5ae377f", "sha256": "fa0a3bccda5e85b0472a2613cddd56b557419151cbe497ca6fbfce83e6c52867" }, "downloads": -1, "filename": "clustermatch-0.1.4a3.tar.gz", "has_sig": false, "md5_digest": "bc0d8e73061b22410173c061f5ae377f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 17155, "upload_time": "2018-11-01T05:08:25", "url": "https://files.pythonhosted.org/packages/6e/0c/cfa9e292c1bfe41a9a7ba6c85291217b91146f48a4b586e11a51f61ceff7/clustermatch-0.1.4a3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "1adc22ba8142bc2b21a32d1fbe1e3fee", "sha256": "cf434deef2457f60776c5f07a491f2146919c8ffeed5963f500d24932bfc488a" }, "downloads": -1, "filename": "clustermatch-0.1.4a3-py3-none-any.whl", "has_sig": false, "md5_digest": "1adc22ba8142bc2b21a32d1fbe1e3fee", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 17537, "upload_time": "2018-11-01T05:08:24", "url": "https://files.pythonhosted.org/packages/4a/9e/4f6fecf38ad50c70f5a2d5d1e5f0abb3603aebe92e12bda3b208daf70d97/clustermatch-0.1.4a3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "bc0d8e73061b22410173c061f5ae377f", "sha256": "fa0a3bccda5e85b0472a2613cddd56b557419151cbe497ca6fbfce83e6c52867" }, "downloads": -1, "filename": "clustermatch-0.1.4a3.tar.gz", "has_sig": false, "md5_digest": "bc0d8e73061b22410173c061f5ae377f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 17155, "upload_time": "2018-11-01T05:08:25", "url": "https://files.pythonhosted.org/packages/6e/0c/cfa9e292c1bfe41a9a7ba6c85291217b91146f48a4b586e11a51f61ceff7/clustermatch-0.1.4a3.tar.gz" } ] }