{ "info": { "author": "Fritz Obermeyer", "author_email": "fritz.obermeyer@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: Apache Software License", "Operating System :: MacOS :: MacOS X", "Operating System :: Microsoft :: Windows", "Operating System :: POSIX :: Linux", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.5", "Topic :: Scientific/Engineering :: Information Analysis" ], "description": ".. figure:: doc/cartoon.png\n :alt: A Bayesian latent tree model\n\n A Bayesian latent tree model\n\nTreeCat\n=======\n\n|Build Status| |Latest Version| |DOI|\n\nIntended Use\n------------\n\nTreeCat is an inference engine for machine learning and Bayesian\ninference. TreeCat is appropriate for analyzing medium-sized tabular\ndata with categorical and ordinal values, possibly with missing\nobservations.\n\n+--------------------------+--------------------------+\n| | TreeCat supports |\n+==========================+==========================+\n| Feature Types | categorical, ordinal |\n+--------------------------+--------------------------+\n| # Rows (n) | 1000-100K |\n+--------------------------+--------------------------+\n| # Features (p) | 10-1000 |\n+--------------------------+--------------------------+\n| # Cells (n \u00d7 p) | 10K-10M |\n+--------------------------+--------------------------+\n| # Categories | 2-10ish |\n+--------------------------+--------------------------+\n| Max Ordinal | 10ish |\n+--------------------------+--------------------------+\n| Missing obervations? | yes |\n+--------------------------+--------------------------+\n| Repeated observations? | yes |\n+--------------------------+--------------------------+\n| Sparse data? | no, use something else |\n+--------------------------+--------------------------+\n| Unsupervised | yes |\n+--------------------------+--------------------------+\n| Semisupervised | yes |\n+--------------------------+--------------------------+\n| Supervised | no, use something else |\n+--------------------------+--------------------------+\n\nInstalling\n----------\n\nIf you already have `Numba `__ installed, you\nshould be able to simply\n\n.. code:: sh\n\n pip install pytreecat\n\nIf you're new to Numba, we recommend installing it using\n`miniconda `__ or\n`Anaconda `__.\n\nIf you want to install TreeCat for development, then clone the source\ncode and create a new conda env\n\n.. code:: sh\n\n git clone git@github.com:posterior/treecat\n cd treecat\n conda env create -f environment.3.yml\n source activate treecat3\n pip install -e .\n\nQuick Start\n-----------\n\n1. Format your data as a\n ```data.csv`` `__ file with a header\n row. It's fine to include extra columns that won't be used.\n\n Contents of ```data.csv`` `__:\n\n +-------------+------------+----------+----------+\n | title | genre | decade | rating |\n +=============+============+==========+==========+\n | vertigo | thriller | 1950s | 5 |\n +-------------+------------+----------+----------+\n | up | family | 2000s | 3 |\n +-------------+------------+----------+----------+\n | desk set | comedy | 1950s | 4 |\n +-------------+------------+----------+----------+\n | santapaws | family | 2010s | |\n +-------------+------------+----------+----------+\n | ... | ... | ... | ... |\n +-------------+------------+----------+----------+\n\n2. Generate two schema files\n ```types.csv`` `__ and\n ```values.csv`` `__ using TreeCat's\n ``guess-schema`` command:\n\n .. code:: sh\n\n $ treecat guess-schema data.csv types.csv values.csv\n\n Contents of ```types.csv`` `__:\n\n +----------+---------------+---------+----------+--------------+\n | name | type | total | unique | singletons |\n +==========+===============+=========+==========+==============+\n | title | | 11 | 11 | 11 |\n +----------+---------------+---------+----------+--------------+\n | genre | categorical | 11 | 7 | 4 |\n +----------+---------------+---------+----------+--------------+\n | decade | categorical | 11 | 6 | 3 |\n +----------+---------------+---------+----------+--------------+\n | rating | ordinal | 10 | 5 | 2 |\n +----------+---------------+---------+----------+--------------+\n\n Contents of ```values.csv`` `__:\n\n +----------+-----------+---------+\n | name | value | count |\n +==========+===========+=========+\n | genre | drama | 3 |\n +----------+-----------+---------+\n | genre | family | 2 |\n +----------+-----------+---------+\n | genre | fantasy | 2 |\n +----------+-----------+---------+\n | decade | 1950s | 3 |\n +----------+-----------+---------+\n | ... | ... | ... |\n +----------+-----------+---------+\n\n You can manually fix any incorrectly guessed feature types, or\n add/remove feature values. TreeCat ignores any feature with an empty\n type field.\n\n3. Import your csv files into treecat's internal format. We'll call our\n dataset ``dataset.pkz`` (a gzipped pickle file).\n\n .. code:: sh\n\n $ treecat import-data data.csv types.csv values.csv '' dataset.pkz\n\n (the empty argument '' is an optional structural prior that we\n ignore).\n\n4. Train an ensemble model on your dataset. This typically takes\n ~15minutes for a 1M cell dataset.\n\n .. code:: sh\n\n $ treecat train dataset.pkz model.pkz\n\n5. Load your trained model into a server\n\n .. code:: python\n\n from treecat.serving import serve_model\n server = serve_model('dataset.pkz', 'model.pkz')\n\n6. Run queries against the server. For example we can compute\n expecations\n\n .. code:: python\n\n samples = server.sample(100, evidence={'genre': 'drama'})\n print(np.mean([s['rating'] for s in samples]))\n\n or explore feature structure through the latent correlation matrix\n\n .. code:: python\n\n print(server.latent_correlation())\n\nTuning Hyperparameters\n----------------------\n\nTreeCat requires tuning of two parameters: ``learning_init_epochs``\n(like the number of iterations) and ``model_num_clusters`` (the number\nof latent classes above each feature). The easiest way to tune these is\nto do grid search using the ``treecat.validate`` module with a csv file\nof example parameters.\n\nContents of ```tuning.csv`` `__:\n\n+------------------------+--------------------------+\n| model\\_num\\_clusters | learning\\_init\\_epochs |\n+========================+==========================+\n| 2 | 2 |\n+------------------------+--------------------------+\n| 2 | 3 |\n+------------------------+--------------------------+\n| 4 | 2 |\n+------------------------+--------------------------+\n| ... | ... |\n+------------------------+--------------------------+\n\n.. code:: sh\n\n # This reads parameters from tuning.csv and dumps results to tuning.pkz\n $ treecat.validate tune-csv dataset.pkz tuning.csv tuning.pkz\n\nThe ``tune-csv`` command prints its results, but if you want to seem\nthem later, you can\n\n.. code:: sh\n\n $ treecat.format cat tuning.pkz\n\nThe Server Interface\n--------------------\n\nTreeCat's\n`server `__\ninterface supports primitives for Bayesian inference and tools to\ninspect latent structure:\n\n- ``server.sample(N, evidence=None)`` draws ``N`` samples from the\n joint posterior distribution over observable data, optionally\n conditioned on ``evidence``.\n\n- ``server.logprob(rows, evidence=None)`` computes posterior log\n probability of ``data``, optionally conditioned on ``evidence``.\n\n- ``server.median(evidence)`` computes L1-loss-minimizing estimates,\n conditioned on ``evidence``.\n\n- ``server.observed_perplexity()`` computes the\n `perplexity `__ (a soft\n measure of cardinality) of each observed feature.\n\n- ``server.latent_perplexity()`` computes the perplexity of the latent\n class behind each observed feature.\n\n- ``server.latent_correlation()`` computes the latent-latent\n correlation between each pair of latent variables.\n\n- ``server.estimate_tree()`` computes a maximum a posteriori estimate\n of the latent tree structure.\n\n- ``server.sample_tree(N)`` draws ``N`` samples from posterior\n distribution over the latent tree structures.\n\nThe Model\n---------\n\nTreeCat's generative model is closest to Zhang and Poon's Latent Tree\nAnalysis [1], with the notable difference that TreeCat fixes exactly one\nlatent node per observed node. TreeCat is historically a descendent of\nMansinghka et al.'s CrossCat, a model in which latent nodes (\"views\" or\n\"kinds\") are completely independent. TreeCat addresses the same kind of\nhigh-dimensional categorical distribution that Dunson and Xing's\nmixture-of-product-multinomial models [3] addresses.\n\nLet ``V`` be a set of vertices (one vertex per feature). Let ``C[v]`` be\nthe dimension of the ``v``\\ th feature. Let ``N`` be the number of\ndatapoints. Let ``K[n,v]`` be the number of observations of feature\n``v`` in row ``n`` (e.g. 1 for a categorical variable, 0 for missing\ndata, or ``k`` for an ordinal value with minimum 0 and maximum ``k``).\n\nTreeCat is the following generative model:\n\n.. code:: python\n\n E ~ UniformSpanningTree(V) # An undirected tree.\n for v in V:\n Pv[v] ~ Dirichlet(size = [M], alpha = 1/2)\n for (u,v) in E:\n Pe[u,v] ~ Dirichlet(size = [M,M], alpha = 1/(2*M))\n assume(Pv[u] == sum(Pe[u,v], axis = 1))\n assume(Pv[v] == sum(Pe[u,v], axis = 0))\n for v in V:\n for i in 1:M:\n Q[v,i] ~ Dirichlet(size = [C[v]])\n for n in 1:N:\n for v in V:\n X[n,v] ~ Categorical(Pv[v])\n for (u,v) in E:\n (X[n,u],X[n,v]) ~ Categorical(Pe[u,v])\n for v in V:\n Z[n,v] ~ Multinomial(Q[v,X[n,v]], count = K[n,v])\n\nwhere we've avoided adding an arbitrary root to the tree, and instead\npresented the model as a manifold with overlapping variables and\nconstraints.\n\nThe Inference Algorithm\n-----------------------\n\nThis package implements fully Bayesian MCMC inference using\nsubsample-annealed collapsed Gibbs sampling. There are two pieces of\nlatent state that are sampled:\n\n- Latent class assignments for each row for each vertex (feature).\n These are sampled by single-site collapsed Gibbs sampler with a\n linear subsample-annealing schedule.\n\n- The latent tree structure is sampled by randomly removing an edge and\n replacing it. Since removing an edge splits the graph into two\n connected components, the only replacement locations that are\n feasible are those that re-connect the graph.\n\nThe single-site Gibbs sampler uses dynamic programming to simultaneously\nsample the complete latent assignment vector for each row. A dynamic\nprogramming program is created each time the tree structure changes.\nThis program is interpreted by various virtual machines for different\npurposes (training the model, sampling from the posterior, computing log\nprobability of the posterior). The virtual machine for training is\njit-compiled using numba.\n\nReferences\n----------\n\n1. Nevin L. Zhang, Leonard K. M. Poon (2016) `Latent Tree\n Analysis `__\n2. Vikash Mansinghka, Patrick Shafto, Eric Jonas, Cap Petschulat, Max\n Gasner, Joshua B. Tenenbaum (2015) `CrossCat: A Fully Bayesian\n Nonparametric Method for Analyzing Heterogeneous, High Dimensional\n Data `__\n3. David B. Dunson, Chuanhua Xing (2012) `Nonparametric Bayes Modeling\n of Multivariate Categorical\n Data `__\n\nLicense\n-------\n\nCopyright (c) 2017 Fritz Obermeyer. TreeCat is licensed under the\n`Apache 2.0 License `__.\n\n.. |Build Status| image:: https://travis-ci.org/posterior/treecat.svg?branch=master\n :target: https://travis-ci.org/posterior/treecat\n.. |Latest Version| image:: https://badge.fury.io/py/pytreecat.svg\n :target: https://pypi.python.org/pypi/pytreecat\n.. |DOI| image:: https://zenodo.org/badge/93913649.svg\n :target: https://zenodo.org/badge/latestdoi/93913649\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/posterior/treecat", "keywords": "", "license": "Apache License 2.0", "maintainer": "", "maintainer_email": "", "name": "pytreecat", "package_url": "https://pypi.org/project/pytreecat/", "platform": "", "project_url": "https://pypi.org/project/pytreecat/", "project_urls": { "Homepage": "https://github.com/posterior/treecat" }, "release_url": "https://pypi.org/project/pytreecat/0.1.9/", "requires_dist": null, "requires_python": "", "summary": "A Bayesian latent tree model of multivariate multinomial data", "version": "0.1.9" }, "last_serial": 3097040, "releases": { "0.1.2": [ { "comment_text": "", "digests": { "md5": "ceec0ed5d5a26176cd7b6f376c5a3322", "sha256": "cc439d0cb58e65150291e250896b64dfaaf46b779a993a19e33eaaee0b090d0d" }, "downloads": -1, "filename": "pytreecat-0.1.2.tar.gz", "has_sig": false, "md5_digest": "ceec0ed5d5a26176cd7b6f376c5a3322", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23259, "upload_time": "2017-07-07T01:41:33", "url": "https://files.pythonhosted.org/packages/dc/66/e37c23d89671eccf7d0ec087352c0f13c1d873f5efe7ccabce313563546c/pytreecat-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "85099773606451a063ae0073935146a9", "sha256": "d9bbb82c1206186c46b8cb96c89acb01cb7f86350099c3f8d45d81a382c473e5" }, "downloads": -1, "filename": "pytreecat-0.1.3.tar.gz", "has_sig": false, "md5_digest": "85099773606451a063ae0073935146a9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24179, "upload_time": "2017-07-07T04:40:45", "url": "https://files.pythonhosted.org/packages/52/c9/2895cbf07ea97838e396e2daf504b6de667073cacc73da8dd7b83e1f677f/pytreecat-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "d07385a4b8a49ca62a9206b5a38db48e", "sha256": "c7ec6797b717c6721a2035e90eab3dd249bc0bba65150d0deafdbf1a85f8f79c" }, "downloads": -1, "filename": "pytreecat-0.1.4.tar.gz", "has_sig": false, "md5_digest": "d07385a4b8a49ca62a9206b5a38db48e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27554, "upload_time": "2017-07-08T05:22:06", "url": "https://files.pythonhosted.org/packages/d8/ff/36989220ef154a1d1a6972223985f9a7d03b828046f8a96963fb577b32c0/pytreecat-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "4e2e3cd421974562d04d3b75bdee5112", "sha256": "3e9a1a0e6eb8dcea8a45f20c1b11d222782923e0dc6703d34f11ff24407c1db9" }, "downloads": -1, "filename": "pytreecat-0.1.5.tar.gz", "has_sig": false, "md5_digest": "4e2e3cd421974562d04d3b75bdee5112", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 60739, "upload_time": "2017-07-09T19:21:04", "url": "https://files.pythonhosted.org/packages/01/78/e0267f32ccfbc35b581e94f539b8d54459f53bbfe17ee8f3ecae9f27528d/pytreecat-0.1.5.tar.gz" } ], "0.1.6": [ { "comment_text": "", "digests": { "md5": "64afac3acaff9c3ae8ed4387258356a1", "sha256": "d10ab72d5edacea544a5540c85eaa8657d3a33b2a563e781ca62a2e599b362ef" }, "downloads": -1, "filename": "pytreecat-0.1.6.tar.gz", "has_sig": false, "md5_digest": "64afac3acaff9c3ae8ed4387258356a1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 37225, "upload_time": "2017-07-15T03:29:21", "url": "https://files.pythonhosted.org/packages/98/27/ae6a9d5ececf5ddf97720759fd7d1a6e253a843cc0b14a8a576167ca7296/pytreecat-0.1.6.tar.gz" } ], "0.1.7": [ { "comment_text": "", "digests": { "md5": "8aec87219b420abcad09bfa92cf7a873", "sha256": "4f111417dfb077c9561ccc4268278b2355c02eb538a357b153136a19e9492056" }, "downloads": -1, "filename": "pytreecat-0.1.7.tar.gz", "has_sig": false, "md5_digest": "8aec87219b420abcad09bfa92cf7a873", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 84309, "upload_time": "2017-07-20T02:49:07", "url": "https://files.pythonhosted.org/packages/da/f4/0af85607941b32afadaae49b7473e323efb5933efcabf3be1a88123a5ee1/pytreecat-0.1.7.tar.gz" } ], "0.1.8": [ { "comment_text": "", "digests": { "md5": "719374a903c37f7eb9bc8817a41bc10c", "sha256": "ec69a4507b12f4bbe13deeeab9419bebb5c797a7860d7be9c9c7eccac6088788" }, "downloads": -1, "filename": "pytreecat-0.1.8.tar.gz", "has_sig": false, "md5_digest": "719374a903c37f7eb9bc8817a41bc10c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 119386, "upload_time": "2017-07-20T03:17:25", "url": "https://files.pythonhosted.org/packages/af/73/842330e8a972a1594f1e88d4ddb20890d907e1467cbce1d49a961bfc7ce0/pytreecat-0.1.8.tar.gz" } ], "0.1.9": [ { "comment_text": "", "digests": { "md5": "7fe670e26d2f8abc1020d846bae530de", "sha256": "8af953e7fcb344c1bf0a5487bd8cfb496edd0a2dcb73284f14eb37ff8bf1d77f" }, "downloads": -1, "filename": "pytreecat-0.1.9.tar.gz", "has_sig": false, "md5_digest": "7fe670e26d2f8abc1020d846bae530de", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 136756, "upload_time": "2017-08-15T03:08:37", "url": "https://files.pythonhosted.org/packages/e6/77/09351408b74446ae6a2fda10619cc23880936067716dab918ce5c5a8dfcd/pytreecat-0.1.9.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "7fe670e26d2f8abc1020d846bae530de", "sha256": "8af953e7fcb344c1bf0a5487bd8cfb496edd0a2dcb73284f14eb37ff8bf1d77f" }, "downloads": -1, "filename": "pytreecat-0.1.9.tar.gz", "has_sig": false, "md5_digest": "7fe670e26d2f8abc1020d846bae530de", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 136756, "upload_time": "2017-08-15T03:08:37", "url": "https://files.pythonhosted.org/packages/e6/77/09351408b74446ae6a2fda10619cc23880936067716dab918ce5c5a8dfcd/pytreecat-0.1.9.tar.gz" } ] }