{ "info": { "author": "Victor Davis", "author_email": "vadsql@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# Legomena\n\nTool for exploring types, tokens, and n-legomena relationships in text. Based on [Davis 2019](https://arxiv.org/pdf/1901.00521.pdf) [1] research paper.\n\n[![Testing Status](https://github.com/VictorDavis/legomena/workflows/tests/badge.svg)](https://github.com/VictorDavis/legomena/actions)\n\n## Installation\n\n```\npip install legomena\n```\n\n## Data Sources\n\nThis package may be driven by any data source, but the author has tested two: the [Natural Language ToolKit](https://www.nltk.org/) and the [Standard Project Gutenberg Corpus](https://arxiv.org/abs/1812.08092). The former being the gold standard of python NLP applications, but having a rather measly 18-book gutenberg corpus. The latter containing the full 55,000+ book gutenberg corpus, already tokenized and counted. NOTE: The overlap of the two datasets do _not_ agree in their exact type/token counts, their methodology differing, but this package takes type/token counts as raw data and is therefore methodology-agnostic.\n\n```\n# moby dick from NLTK\nimport nltk\nnltk.download(\"gutenberg\")\nfrom nltk.corpus import gutenberg\nwords = gutenberg.words(\"melville-moby_dick.txt\")\ncorpus = Corpus(words)\nassert corpus.M, corpus.N == (260819, 19317)\n\n# moby dick from SPGC\n# NOTE: download and unzip https://zenodo.org/record/2422561/files/SPGC-counts-2018-07-18.zip into DATA_FOLDER\nimport pandas as pd\nfname = \"%s/SPGC-counts-2018-07-18/PG2701_counts.txt\" % DATA_FOLDER\nwith open(fname) as f:\n df = pd.read_csv(f, delimiter=\"\\t\", header=None, names=[\"word\", \"freq\"])\n f.close()\nwfd = {str(row.word): int(row.freq) for row in df.itertuples()}\ncorpus = Corpus(wfd)\nassert corpus.M, corpus.N == (210258, 16402)\n```\n\n## Basic Usage:\n\nDemo notebooks may be found [here](https://github.com/VictorDavis/legomena/tree/master/notebooks). Unit tests may be found [here](https://github.com/VictorDavis/legomena/blob/master/test_legomena.py).\n\n```\n# basic properties\ncorpus.tokens # list of tokens\ncorpus.types # list of types\ncorpus.fdist # word frequency distribution dataframe\ncorpus.WFD # alias for corpus.fdist\ncorpus.M # number of tokens\ncorpus.N # number of types\ncorpus.k # n-legomena vector\ncorpus.k[n] # n-legomena count (n=1 -> number of hapaxes)\ncorpus.hapax # list of hapax legomena, alias for corpus.nlegomena(1)\ncorpus.dis # list of dis legomena, alias for corpus.nlegomena(2)\ncorpus.tris # list of tris legomena, alias for corpus.nlegomena(3)\ncorpus.tetrakis # list of tetrakis legomena, alias for corpus.nlegomena(4)\ncorpus.pentakis # list of pentakis legomena, alias for corpus.nlegomena(5)\n\n# advanced properties\ncorpus.options # tuple of optional settings\ncorpus.resolution # number of samples to take to calculate TTR curve\ncorpus.dimension # n-legomena vector length to pre-compute (max 6)\ncorpus.seed # random number seed for sampling TTR data\ncorpus.TTR # type-token ratio dataframe\n\n# basic functions\ncorpus.nlegomena(n:int) # list of types occurring exactly n times\ncorpus.sample(m:int) # samples m tokens from corpus *without replacement*\ncorpus.sample(x:float) # samples proportion x of corpus *without replacement*\n```\n\n## Type-Token Models\n\nThere are a variety of models in the literature predicting number of types as a function of tokens, the most well-known being [Heap's Law](https://en.wikipedia.org/wiki/Heaps%27_law). Here are a few implemented, overlaid by the `Corpus` class.\n\n```\n# three models\nmodel = HeapsModel() # Heap's Law\nmodel = InfSeriesModel(corpus) # Infinite Series Model [1]\nmodel = LogModel() # Logarithmic Model [1]\n\n# model fitting\nm_tokens = corpus.TTR.m_tokens\nn_types = corpus.TTR.n_types\nmodel.fit(m_tokens, n_types)\npredictions = model.fit_predict(m_tokens, n_types)\n\n# model parameters\nmodel.params\n\n# model predictions\npredictions = model.predict(m_tokens)\n\n# log model only\ndim = corpus.dimension\npredicted_k = model.predict_k(m_tokens, dim)\n```\n\n## Demo App\n\nCheck out the [demo app](http://legomena.herokuapp.com/) to explore the type-token and n-legomena counts of a few Project Gutenberg books.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/VictorDavis/legomena", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "legomena", "package_url": "https://pypi.org/project/legomena/", "platform": "", "project_url": "https://pypi.org/project/legomena/", "project_urls": { "Homepage": "https://github.com/VictorDavis/legomena" }, "release_url": "https://pypi.org/project/legomena/1.2.0/", "requires_dist": [ "mpmath", "nltk", "numpy", "pandas", "scipy" ], "requires_python": ">=3.5", "summary": "Tool for exploring types, tokens, and n-legomena relationships in text.", "version": "1.2.0", "yanked": false, "yanked_reason": null }, "last_serial": 6010214, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "fdd0463abdf7dccc7bc1060ff41a8d1b", "sha256": "7a1212e3961740c1746e37b2fd3ce066e659c1872d1ece520127090d9e42c0ab" }, "downloads": -1, "filename": "legomena-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "fdd0463abdf7dccc7bc1060ff41a8d1b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 9081, "upload_time": "2019-10-16T02:16:12", "upload_time_iso_8601": "2019-10-16T02:16:12.268110Z", "url": "https://files.pythonhosted.org/packages/20/3a/e4008338954eb3bc6df5f45286a5c4f93a7e41552703f170b25ece59929a/legomena-1.0.0-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "14923eac0e7682f8ba17012f4c206e6f", "sha256": "17be23b6756f44fa87e93b3d46fedff9c1f8ef692a68df373b210dcece2386a3" }, "downloads": -1, "filename": "legomena-1.0.0.tar.gz", "has_sig": false, "md5_digest": "14923eac0e7682f8ba17012f4c206e6f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 7983, "upload_time": "2019-10-16T02:16:19", "upload_time_iso_8601": "2019-10-16T02:16:19.869054Z", "url": "https://files.pythonhosted.org/packages/88/91/a4132d380c4db909f4c0dd10011f3a1f0a48b39a4110a1bfe309815ebf3f/legomena-1.0.0.tar.gz", "yanked": false, "yanked_reason": null } ], "1.1.0": [ { "comment_text": "", "digests": { "md5": "b10e6e1bd78c35567e89909b37cac97b", "sha256": "b3429029352baaeb0f35b2e892ca0935d21a53bbaa4b679108cbc515070b5e08" }, "downloads": -1, "filename": "legomena-1.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "b10e6e1bd78c35567e89909b37cac97b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 8335, "upload_time": "2019-10-18T23:48:35", "upload_time_iso_8601": "2019-10-18T23:48:35.878775Z", "url": "https://files.pythonhosted.org/packages/03/01/c069b52734d6ed2d63b060d40b7adbd126178cc58515ee74bd8d96af75ce/legomena-1.1.0-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "9a3f5be983931665813edddb072fb0fd", "sha256": "c7f61b6342edddc6d1dbfb39d05b4193c6e7619844cd788130df0f9e5a3a3815" }, "downloads": -1, "filename": "legomena-1.1.0.tar.gz", "has_sig": false, "md5_digest": "9a3f5be983931665813edddb072fb0fd", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 7315, "upload_time": "2019-10-18T23:48:38", "upload_time_iso_8601": "2019-10-18T23:48:38.807375Z", "url": "https://files.pythonhosted.org/packages/30/27/74a844bc9ad5a14fa85814af01d043f0b022914f021b9c1770421b37c75d/legomena-1.1.0.tar.gz", "yanked": false, "yanked_reason": null } ], "1.2.0": [ { "comment_text": "", "digests": { "md5": "c62b08ab9118a606671ef021d9761b37", "sha256": "f315ca5678d30013673280b72784d048149e8015cac4ba0520c1d8d7fa1305ef" }, "downloads": -1, "filename": "legomena-1.2.0-py3-none-any.whl", "has_sig": false, "md5_digest": "c62b08ab9118a606671ef021d9761b37", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 8931, "upload_time": "2019-10-22T02:42:24", "upload_time_iso_8601": "2019-10-22T02:42:24.384958Z", "url": "https://files.pythonhosted.org/packages/6d/de/fa5c1f390561ec50ed652f9e3aae26786bfe85b744fe3184781d577d9327/legomena-1.2.0-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "3584bf4265df4bfaa9860c7445178bab", "sha256": "d52e020cbe2c7aa21f0f05cea0c67f9ac4c1b12e9760bca45a12583d33db5abf" }, "downloads": -1, "filename": "legomena-1.2.0.tar.gz", "has_sig": false, "md5_digest": "3584bf4265df4bfaa9860c7445178bab", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 7894, "upload_time": "2019-10-22T02:42:26", "upload_time_iso_8601": "2019-10-22T02:42:26.250781Z", "url": "https://files.pythonhosted.org/packages/f5/b3/9b410e2439a34f7eda7607fe4c7aa08eb1d7d231533eb500c38804df2a02/legomena-1.2.0.tar.gz", "yanked": false, "yanked_reason": null } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "c62b08ab9118a606671ef021d9761b37", "sha256": "f315ca5678d30013673280b72784d048149e8015cac4ba0520c1d8d7fa1305ef" }, "downloads": -1, "filename": "legomena-1.2.0-py3-none-any.whl", "has_sig": false, "md5_digest": "c62b08ab9118a606671ef021d9761b37", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 8931, "upload_time": "2019-10-22T02:42:24", "upload_time_iso_8601": "2019-10-22T02:42:24.384958Z", "url": "https://files.pythonhosted.org/packages/6d/de/fa5c1f390561ec50ed652f9e3aae26786bfe85b744fe3184781d577d9327/legomena-1.2.0-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "3584bf4265df4bfaa9860c7445178bab", "sha256": "d52e020cbe2c7aa21f0f05cea0c67f9ac4c1b12e9760bca45a12583d33db5abf" }, "downloads": -1, "filename": "legomena-1.2.0.tar.gz", "has_sig": false, "md5_digest": "3584bf4265df4bfaa9860c7445178bab", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 7894, "upload_time": "2019-10-22T02:42:26", "upload_time_iso_8601": "2019-10-22T02:42:26.250781Z", "url": "https://files.pythonhosted.org/packages/f5/b3/9b410e2439a34f7eda7607fe4c7aa08eb1d7d231533eb500c38804df2a02/legomena-1.2.0.tar.gz", "yanked": false, "yanked_reason": null } ], "vulnerabilities": [] }