{ "info": { "author": "Vikash Singh (vi3k6i5)", "author_email": "vikash.duliajan@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Intended Audience :: Information Technology", "Intended Audience :: Science/Research", "License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)", "Operating System :: MacOS", "Operating System :: Microsoft :: Windows", "Operating System :: POSIX", "Operating System :: Unix", "Programming Language :: C", "Programming Language :: Cython", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6" ], "description": "GuidedLDA: Guided Topic modeling with latent Dirichlet allocation\n====================================================\n\n.. image:: https://readthedocs.org/projects/guidedlda/badge/?version=latest\n :target: http://guidedlda.readthedocs.io/en/latest/?badge=latest\n :alt: Documentation Status\n\n.. image:: https://badge.fury.io/py/guidedlda.svg\n :target: https://badge.fury.io/py/guidedlda\n :alt: Package version\n\n\n``GuidedLDA`` OR ``SeededLDA`` implements latent Dirichlet allocation (LDA) using collapsed Gibbs sampling. ``GuidedLDA`` can be guided by setting some seed words per topic. Which will make the topics converge in that direction.\n\nYou can read more about guidedlda in `the documentation `_.\n\nInstallation\n------------\n\n::\n\n pip install guidedlda\n\nIf pip install does not work, then try the next step:\n\n::\n\n https://github.com/vi3k6i5/GuidedLDA\n cd GuidedLDA\n sh build_dist.sh\n python setup.py sdist\n pip install -e .\n\nIf the above step also does not work, please raise an `issue `_ with details of your workstation's OS version, Python version, architecture etc. and I will try my best to fix it ASAP.\n\nGetting started\n---------------\n\n``guidedlda.GuidedLDA`` implements latent Dirichlet allocation (LDA). The interface follows\nconventions found in scikit-learn_.\n\n`Example Code `_.\n\n\nThe following demonstrates how to inspect a model of a subset of the NYT\nnews dataset. The input below, ``X``, is a document-term matrix (sparse matrices\nare accepted).\n\n.. code-block:: python\n\n >>> import numpy as np\n >>> import guidedlda\n \n >>> X = guidedlda.datasets.load_data(guidedlda.datasets.NYT)\n >>> vocab = guidedlda.datasets.load_vocab(guidedlda.datasets.NYT)\n >>> word2id = dict((v, idx) for idx, v in enumerate(vocab))\n \n >>> X.shape\n (8447, 3012)\n \n >>> X.sum()\n 1221626\n >>> # Normal LDA without seeding\n >>> model = guidedlda.GuidedLDA(n_topics=5, n_iter=100, random_state=7, refresh=20)\n >>> model.fit(X)\n INFO:guidedlda:n_documents: 8447\n INFO:guidedlda:vocab_size: 3012\n INFO:guidedlda:n_words: 1221626\n INFO:guidedlda:n_topics: 5\n INFO:guidedlda:n_iter: 100\n WARNING:guidedlda:all zero column in document-term matrix found\n INFO:guidedlda:<0> log likelihood: -11489265\n INFO:guidedlda:<20> log likelihood: -9844667\n INFO:guidedlda:<40> log likelihood: -9694223\n INFO:guidedlda:<60> log likelihood: -9642506\n INFO:guidedlda:<80> log likelihood: -9617962\n INFO:guidedlda:<99> log likelihood: -9604031\n \n >>> topic_word = model.topic_word_\n >>> n_top_words = 8\n >>> for i, topic_dist in enumerate(topic_word):\n >>> topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]\n >>> print('Topic {}: {}'.format(i, ' '.join(topic_words)))\n Topic 0: company percent market business plan pay price increase\n Topic 1: game play team win player season second start\n Topic 2: life child write man school woman father family\n Topic 3: place open small house music turn large play\n Topic 4: official state government political states issue leader case\n \n >>> # Guided LDA with seed topics.\n >>> seed_topic_list = [['game', 'team', 'win', 'player', 'season', 'second', 'victory'],\n >>> ['percent', 'company', 'market', 'price', 'sell', 'business', 'stock', 'share'],\n >>> ['music', 'write', 'art', 'book', 'world', 'film'],\n >>> ['political', 'government', 'leader', 'official', 'state', 'country', 'american','case', 'law', 'police', 'charge', 'officer', 'kill', 'arrest', 'lawyer']]\n \n >>> model = guidedlda.GuidedLDA(n_topics=5, n_iter=100, random_state=7, refresh=20)\n \n >>> seed_topics = {}\n >>> for t_id, st in enumerate(seed_topic_list):\n >>> for word in st:\n >>> seed_topics[word2id[word]] = t_id\n \n >>> model.fit(X, seed_topics=seed_topics, seed_confidence=0.15)\n INFO:guidedlda:n_documents: 8447\n INFO:guidedlda:vocab_size: 3012\n INFO:guidedlda:n_words: 1221626\n INFO:guidedlda:n_topics: 5\n INFO:guidedlda:n_iter: 100\n WARNING:guidedlda:all zero column in document-term matrix found\n INFO:guidedlda:<0> log likelihood: -11486362\n INFO:guidedlda:<20> log likelihood: -9767277\n INFO:guidedlda:<40> log likelihood: -9663718\n INFO:guidedlda:<60> log likelihood: -9624150\n INFO:guidedlda:<80> log likelihood: -9601684\n INFO:guidedlda:<99> log likelihood: -9587803\n \n \n >>> n_top_words = 10\n >>> topic_word = model.topic_word_\n >>> for i, topic_dist in enumerate(topic_word):\n >>> topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]\n >>> print('Topic {}: {}'.format(i, ' '.join(topic_words)))\n Topic 0: game play team win season player second point start victory\n Topic 1: company percent market price business sell executive pay plan sale\n Topic 2: play life man music place write turn woman old book\n Topic 3: official government state political leader states issue case member country\n Topic 4: school child city program problem student state study family group\n\nThe document-topic distributions should be retrived as: ``doc_topic = model.transform(X)``.\n\n.. code-block:: python\n\n >>> doc_topic = model.transform(X)\n >>> for i in range(9):\n >>> print(\"top topic: {} Document: {}\".format(doc_topic[i].argmax(), \n ', '.join(np.array(vocab)[list(reversed(X[i,:].argsort()))[0:5]])))\n top topic: 4 Document: plant, increase, food, increasingly, animal\n top topic: 3 Document: explain, life, country, citizen, nation\n top topic: 2 Document: thing, solve, problem, machine, carry\n top topic: 2 Document: company, authority, opera, artistic, director\n top topic: 3 Document: partner, lawyer, attorney, client, indict\n top topic: 2 Document: roll, place, soon, treat, rating\n top topic: 3 Document: city, drug, program, commission, report\n top topic: 1 Document: company, comic, series, case, executive\n top topic: 3 Document: son, scene, charge, episode, attack\n\nSave the model for production or for running later:\n\n.. code-block:: python\n\n >>> from six.moves import cPickle as pickle\n >>> # Uncomment next step if you want to lighten the model object\n >>> # This step will delete some matrices inside the model.\n >>> # you will be able to use model.transform(X) the same way as earlier.\n >>> # you wont be able to use model.fit_transform(X_new)\n >>> # model.purge_extra_matrices()\n >>> with open('guidedlda_model.pickle', 'wb') as file_handle:\n >>> pickle.dump(model, file_handle)\n >>> # load the model for prediction\n >>> with open('guidedlda_model.pickle', 'rb') as file_handle:\n >>> model = pickle.load(file_handle)\n >>> doc_topic = model.transform(X)\n\n\nRequirements\n------------\n\nPython 2.7 or Python 3.3+ is required. The following packages are required\n\n- numpy_\n- pbr_\n\nCaveat\n------\n\n``guidedlda`` aims for Guiding LDA. More often then not the topics we get from a LDA model are not to our satisfaction. GuidedLDA can give the topics a nudge in the direction we want it to converge. We have production trained it for half a million documents (We have a big machine). We have run predictions on millions and manually checked topics for thousands (we are satisfied with the results).\n\nIf you are working with a very large corpus you may wish to use more sophisticated topic models such as those implemented in hca_ and MALLET_. hca_ is written entirely in C and MALLET_ is written in Java. Unlike ``guidedlda``, hca_ can use more than one processor at a time. Both MALLET_ and hca_ implement topic models known to be more robust than standard latent Dirichlet allocation.\n\nNotes\n-----\n\nLatent Dirichlet allocation is described in `Blei et al. (2003)`_ and `Pritchard\net al. (2000)`_. Inference using collapsed Gibbs sampling is described in\n`Griffiths and Steyvers (2004)`_. And Guided LDA is described in `Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa (2012)`_\n\n\nImportant links\n---------------\n\n- Documentation: http://guidedlda.readthedocs.org\n- Source code: https://github.com/vi3k6i5/guidedlda/\n- Issue tracker: https://github.com/vi3k6i5/guidedlda/issues\n\nOther implementations\n---------------------\n- scikit-learn_'s `LatentDirichletAllocation `_ (uses online variational inference)\n- `gensim `_ (uses online variational inference)\n\nCredits\n-------\nI would like to thank the creators of `LDA project `_. I used the code from that LDA project as base to implement GuidedLDA on top of it.\n\nThanks to : `Allen Riddell `_ and `Tim Hopper `_. :)\n\nLicense\n-------\n\n``guidedlda`` is licensed under Version 2.0 of the Mozilla Public License.\n\n.. _Python: http://www.python.org/\n.. _scikit-learn: http://scikit-learn.org\n.. _hca: http://www.mloss.org/software/view/527/\n.. _MALLET: http://mallet.cs.umass.edu/\n.. _numpy: http://www.numpy.org/\n.. _pbr: https://pypi.python.org/pypi/pbr\n.. _Cython: http://cython.org\n.. _Blei et al. (2003): http://jmlr.org/papers/v3/blei03a.html\n.. _Pritchard et al. (2000): http://www.genetics.org/content/155/2/945.full\n.. _Griffiths and Steyvers (2004): http://www.pnas.org/content/101/suppl_1/5228.abstract\n.. _Jagadeesh Jagarlamudi, Hal Daume III and Raghavendra Udupa (2012): http://www.umiacs.umd.edu/~jags/pdfs/GuidedLDA.pdf", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "guidedlda", "package_url": "https://pypi.org/project/guidedlda/", "platform": "", "project_url": "https://pypi.org/project/guidedlda/", "project_urls": null, "release_url": "https://pypi.org/project/guidedlda/2.0.0.dev22/", "requires_dist": null, "requires_python": "", "summary": "Topic modeling with Guided latent Dirichlet allocation", "version": "2.0.0.dev22" }, "last_serial": 3283717, "releases": { "1.6.0.dev2": [ { "comment_text": "", "digests": { "md5": "7bee161c95880dad699bdd2e26e2cc8a", "sha256": "192a03fbdf2ec0a9b7c0f15688bafcbd14f7f56ddfb985160c3685315c335d5c" }, "downloads": -1, "filename": "guidedlda-1.6.0.dev2.tar.gz", "has_sig": false, "md5_digest": "7bee161c95880dad699bdd2e26e2cc8a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2082292, "upload_time": "2017-10-04T20:10:33", "url": "https://files.pythonhosted.org/packages/76/9e/59f9c6eb5f0a33e6ffc5ca0f6d414a8314e31834a31a564760006b273e8f/guidedlda-1.6.0.dev2.tar.gz" } ], "1.7.0.dev9": [ { "comment_text": "", "digests": { "md5": "bac38d68747a57aa92501fd2d7982fb2", "sha256": "15f2d0a19a03c8d47c3a78674ee0f132916eb9a2bd4b8130f2884ce4cc3d7de0" }, "downloads": -1, "filename": "guidedlda-1.7.0.dev9.tar.gz", "has_sig": false, "md5_digest": "bac38d68747a57aa92501fd2d7982fb2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2083157, "upload_time": "2017-10-05T16:58:08", "url": "https://files.pythonhosted.org/packages/88/9c/051852aa83d1bd10c68978d0cd01716bc4df702b3fa068c61c62f2e86743/guidedlda-1.7.0.dev9.tar.gz" } ], "1.8.0.dev9": [ { "comment_text": "", "digests": { "md5": "f3865bb69986b9af8c65eda1bd4eaa63", "sha256": "62c353470359a1afb52a4fa60d1b86887d5d55c4daf64a509b484a8e77a6a844" }, "downloads": -1, "filename": "guidedlda-1.8.0.dev9.tar.gz", "has_sig": false, "md5_digest": "f3865bb69986b9af8c65eda1bd4eaa63", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2083179, "upload_time": "2017-10-05T17:06:04", "url": "https://files.pythonhosted.org/packages/62/8d/df11c7da1227f15cbe83549c1051f10e7e73cdc815a4e76b52b2d6dd0378/guidedlda-1.8.0.dev9.tar.gz" } ], "1.9.0.dev9": [ { "comment_text": "", "digests": { "md5": "25585cb1ef3247b749319d7e2d9d1201", "sha256": "36f2739639c35014ea2c481ce696133bb4fd69d8edb579d39eeaf1cf88278227" }, "downloads": -1, "filename": "guidedlda-1.9.0.dev9.tar.gz", "has_sig": false, "md5_digest": "25585cb1ef3247b749319d7e2d9d1201", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2083186, "upload_time": "2017-10-05T17:10:57", "url": "https://files.pythonhosted.org/packages/11/8a/7552649d65b5f63cf43e4a227126c93888105cdfae768d6c5e1b0a1937de/guidedlda-1.9.0.dev9.tar.gz" } ], "2.0.0.dev22": [ { "comment_text": "", "digests": { "md5": "6b199ec5a1f3a1561543a3a1c6f0f640", "sha256": "0918b5102ec9a47f2109e6c07d95e06c3c63a8acd73ffb57538280e69ebe1c5c" }, "downloads": -1, "filename": "guidedlda-2.0.0.dev22.tar.gz", "has_sig": false, "md5_digest": "6b199ec5a1f3a1561543a3a1c6f0f640", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2093241, "upload_time": "2017-10-27T13:25:13", "url": "https://files.pythonhosted.org/packages/f8/ee/6d6e2b3525388399e12a4482554c7529a5fcf5e99c50a60abaa02894b8bf/guidedlda-2.0.0.dev22.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "6b199ec5a1f3a1561543a3a1c6f0f640", "sha256": "0918b5102ec9a47f2109e6c07d95e06c3c63a8acd73ffb57538280e69ebe1c5c" }, "downloads": -1, "filename": "guidedlda-2.0.0.dev22.tar.gz", "has_sig": false, "md5_digest": "6b199ec5a1f3a1561543a3a1c6f0f640", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2093241, "upload_time": "2017-10-27T13:25:13", "url": "https://files.pythonhosted.org/packages/f8/ee/6d6e2b3525388399e12a4482554c7529a5fcf5e99c50a60abaa02894b8bf/guidedlda-2.0.0.dev22.tar.gz" } ] }