{ "info": { "author": "Adrien Guille, Pavel Soriano", "author_email": "adrien.guille@univ-lyon2.fr", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Science/Research", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2.7", "Topic :: Scientific/Engineering", "Topic :: Text Processing" ], "description": "TOM\n===\n\nTOM (TOpic Modeling) is a Python 3 library for topic modeling and\nbrowsing, licensed under the MIT license. Its objective is to allow for\nan efficient analysis of a text corpus from start to finish, via the\ndiscovery of latent topics. To this end, TOM features functions for\npreparing and vectorizing a text corpus. It also offers a common\ninterface for two topic models (namely LDA using either variational\ninference or Gibbs sampling, and NMF using alternating least-square with\na projected gradient method), and implements three state-of-the-art\nmethods for estimating the optimal number of topics to model a corpus.\nWhat is more, TOM constructs an interactive Web-based browser that makes\nit easy to explore a topic model and the related corpus.\n\nFeatures\n--------\n\nVector space modeling\n~~~~~~~~~~~~~~~~~~~~~\n\n- Feature selection based on word frequency\n- Weighting\n\n - tf\n - tf-idf\n\nTopic modeling\n~~~~~~~~~~~~~~\n\n- Latent Dirichlet Allocation\n\n - Standard variational Bayesian inference (Latent Dirichlet\n Allocation. Blei et al, 2003)\n - Online variational Bayesian inference (Online learning for Latent\n Dirichlet Allocation. Hoffman et al, 2010)\n - Collapsed Gibbs sampling (Finding scientific topics. Griffiths &\n Steyvers, 2004)\n\n- Non-negative Matrix Factorization (NMF)\n\n - Alternating least-square with a projected gradient method\n (Projected gradient methods for non-negative matrix factorization.\n Lin, 2007)\n\nEstimating the optimal number of topics\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n- Stability analysis (How many topics? Stability analysis for topic\n models. Greene et al, 2014)\n- Spectral analysis (On finding the natural number of topics with\n latent dirichlet allocation: Some observations. Arun et al, 2010)\n- Consensus-based analysis (Metagenes and molecular pattern discovery\n using matrix factorization. Brunet et al, 2004)\n\nInstallation\n------------\n\nWe recommend you to install Anaconda (https://www.continuum.io) which\nwill automatically install most of the required dependencies (i.e.\npandas, numpy, scipy, scikit-learn, matplotlib, flask). You should then\ninstall the lda module (pip install lda). Eventually, clone or download\nthis repo and run the following command:\n\n::\n\n python setup.py install\n\nOr, install it directly from PyPi:\n\n::\n\n pip install tom_lib\n\nUsage\n-----\n\nWe provide two sample programs, topic\\_model.py (which shows you how to\nload and prepare a corpus, estimate the optimal number of topics, infer\nthe topic model and then manipulate it) and topic\\_model\\_browser.py\n(which shows you how to generate a topic model browser to explore a\ncorpus), to help you get started using TOM.\n\nLoad and prepare a textual corpus\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe following code snippet shows how to load a corpus of French\ndocuments and vectorize them using tf-idf with unigrams.\n\n::\n\n corpus = Corpus(source_file_path='input/raw_corpus.csv',\n language='french', \n vectorization='tfidf', \n n_gram=1,\n max_relative_frequency=0.8, \n min_absolute_frequency=4)\n print('corpus size:', corpus.size)\n print('vocabulary size:', len(corpus.vocabulary))\n print('Vector representation of document 0:\\n', corpus.vector_for_document(0))\n\nInstantiate a topic model and infer topics\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nIt is possible to instantiate a NMF or LDA object then infer topics.\n\nNMF:\n\n::\n\n topic_model = NonNegativeMatrixFactorization(corpus)\n topic_model.infer_topics(num_topics=15)\n\nLDA (using either the standard variational Bayesian inference or Gibbs\nsampling):\n\n::\n\n topic_model = LatentDirichletAllocation(corpus)\n topic_model.infer_topics(num_topics=15, algorithm='variational')\n\n::\n\n topic_model = LatentDirichletAllocation(corpus)\n topic_model.infer_topics(num_topics=15, algorithm='gibbs')\n\nInstantiate a topic model and estimate the optimal number of topics\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nHere we instantiate a NMF object, then generate plots with the three\nmetrics for estimating the optimal number of topics.\n\n::\n\n topic_model = NonNegativeMatrixFactorization(corpus)\n viz = Visualization(topic_model)\n viz.plot_greene_metric(min_num_topics=5, \n max_num_topics=50, \n tao=10, step=1, \n top_n_words=10)\n viz.plot_arun_metric(min_num_topics=5, \n max_num_topics=50, \n iterations=10)\n viz.plot_brunet_metric(min_num_topics=5, \n max_num_topics=50,\n iterations=10)\n\nSave/load a topic model\n~~~~~~~~~~~~~~~~~~~~~~~\n\nTo allow reusing previously learned topics models, TOM can save them on\ndisk, as shown below.\n\n::\n\n utils.save_topic_model(topic_model, 'output/NMF_15topics.tom')\n topic_model = utils.load_topic_model('output/NMF_15topics.tom')\n\nPrint information about a topic model\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThis code excerpt illustrates how one can manipulate a topic model, e.g.\nget the topic distribution for a document or the word distribution for a\ntopic.\n\n::\n\n print('\\nTopics:')\n topic_model.print_topics(num_words=10)\n print('\\nTopic distribution for document 0:',\n topic_model.topic_distribution_for_document(0))\n print('\\nMost likely topic for document 0:',\n topic_model.most_likely_topic_for_document(0))\n print('\\nFrequency of topics:',\n topic_model.topics_frequency())\n print('\\nTop 10 most relevant words for topic 2:',\n topic_model.top_words(2, 10))\n\nTopic model browser: screenshots\n--------------------------------\n\nTopic cloud\n~~~~~~~~~~~\n\n|image0| ### Topic details |image1| ### Document details |image2|\n\n.. |image0| image:: http://mediamining.univ-lyon2.fr/people/guille/tom_resources/topic_cloud.jpg\n.. |image1| image:: http://mediamining.univ-lyon2.fr/people/guille/tom_resources/topic_details.jpg\n.. |image2| image:: http://mediamining.univ-lyon2.fr/people/guille/tom_resources/document_details.jpg", "description_content_type": null, "docs_url": null, "download_url": "http://pypi.python.org/packages/source/t/tom_lib/tom_lib-0.2.2.tar.gz", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://mediamining.univ-lyon2.fr/people/guille/tom.php", "keywords": null, "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "tom_lib", "package_url": "https://pypi.org/project/tom_lib/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/tom_lib/", "project_urls": { "Download": "http://pypi.python.org/packages/source/t/tom_lib/tom_lib-0.2.2.tar.gz", "Homepage": "http://mediamining.univ-lyon2.fr/people/guille/tom.php" }, "release_url": "https://pypi.org/project/tom_lib/0.2.2/", "requires_dist": null, "requires_python": null, "summary": "A library for topic modeling and browsing", "version": "0.2.2" }, "last_serial": 2185219, "releases": { "0.1.2": [ { "comment_text": "", "digests": { "md5": "30df7e7b5911835089d665d003b0b435", "sha256": "c8bb67a437b7a18740b4b647d3da7a2062cb53cb69751d59aaa6479019cdf86c" }, "downloads": -1, "filename": "tom_lib-0.1.2.tar.gz", "has_sig": false, "md5_digest": "30df7e7b5911835089d665d003b0b435", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6510757, "upload_time": "2016-04-15T23:28:17", "url": "https://files.pythonhosted.org/packages/d6/4b/d3040e1ee423ffc04b0f58bdbe335f9a44a8095b11b5a9e2c7c837327d27/tom_lib-0.1.2.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "b0a9e473d35b6af559705c17d5f9069e", "sha256": "db0b7c2b48ed5b9b146c81e208e222706d06b74a836c411a87063f32a4b9cb0e" }, "downloads": -1, "filename": "tom_lib-0.2.0.tar.gz", "has_sig": false, "md5_digest": "b0a9e473d35b6af559705c17d5f9069e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 821153, "upload_time": "2016-06-23T22:35:52", "url": "https://files.pythonhosted.org/packages/ef/ee/14cec017ac1a6a3b6dcebf98196dd6a0bc07c92e7a356e74583cd629d7c4/tom_lib-0.2.0.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "1a11c87bd3f939f5c85524d975febe80", "sha256": "2d075520d19e87fcaab6a8889b88f291d6f84715ea049638962b1ddafff314e2" }, "downloads": -1, "filename": "tom_lib-0.2.1.tar.gz", "has_sig": false, "md5_digest": "1a11c87bd3f939f5c85524d975febe80", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 819806, "upload_time": "2016-06-24T12:55:37", "url": "https://files.pythonhosted.org/packages/53/dd/2d45db2bba7460cf76da261fd493cb4b3b6d31be3ae7c0424eba19e15c34/tom_lib-0.2.1.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "b1e189a856f9298a0efde649df2d4a87", "sha256": "46b0b542f0b241e8aead1470fd38965449784268f4a71663c7b01eff0094e41b" }, "downloads": -1, "filename": "tom_lib-0.2.2.tar.gz", "has_sig": false, "md5_digest": "b1e189a856f9298a0efde649df2d4a87", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 819546, "upload_time": "2016-06-24T13:42:41", "url": "https://files.pythonhosted.org/packages/e4/50/045dce3e4f2b77611bd07d9635626ea7acfdafd81d81c64c271b84456ee9/tom_lib-0.2.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "b1e189a856f9298a0efde649df2d4a87", "sha256": "46b0b542f0b241e8aead1470fd38965449784268f4a71663c7b01eff0094e41b" }, "downloads": -1, "filename": "tom_lib-0.2.2.tar.gz", "has_sig": false, "md5_digest": "b1e189a856f9298a0efde649df2d4a87", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 819546, "upload_time": "2016-06-24T13:42:41", "url": "https://files.pythonhosted.org/packages/e4/50/045dce3e4f2b77611bd07d9635626ea7acfdafd81d81c64c271b84456ee9/tom_lib-0.2.2.tar.gz" } ] }