{ "info": { "author": "Kyle Gorman", "author_email": "kylebgorman@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Environment :: Console", "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Text Processing :: Linguistic" ], "description": "\ud83d\uddfd CityLex: a free multisource English lexical database\n======================================================\n\n[![PyPI\nversion](https://badge.fury.io/py/citylex.svg)](https://pypi.org/project/citylex)\n[![Supported Python\nversions](https://img.shields.io/pypi/pyversions/citylex.svg)](https://pypi.org/project/citylex)\n[![CircleCI](https://circleci.com/gh/kylebgorman/citylex/tree/master.svg?style=svg)](https://circleci.com/gh/kylebgorman/citylex/tree/master)\n\nCityLex is an English lexical database intended to replace or enhance databases\nlike [CELEX](https://catalog.ldc.upenn.edu/LDC96L14). It combines data from up\nto seven unique sources, including frequency norms, morphological analyses, and\npronunciations. Since these have varying license conditions (some are\nproprietary, others restrict redistribution), we do not provide the database as\nis. Rather the user must generate a personal copy by executing a Python script,\nenabling whatever sources they wish to use.\n\nBuilding your own CityLex\n-------------------------\n\nTo install CityLex execute\n\n```bash\npip install citylex\n```\n\nTo see the available data sources and options, execute `citylex --help`.\n\nTo generate the lexicon, execute `citylex` with at least one source enabled\nusing command-line flags. As most of the data is downloaded from outline\nsources, an internet connection is normally required. The process takes roughly\nfour minutes with all sources enabled; much of the time is spent downloading\nlarge files.\n\nTo generate a lexicon with all the sources that don't require manual downloads,\nexecute\n\n```bash\ncitylex --cmudict \\\n --elp \\\n --subtlex_uk \\\n --subtlex_us \\\n --udlexicons \\\n --unimorph \\\n --wikipron\n```\n\nFile formats\n------------\n\nTwo files are produced. The first, by default `citylex.tsv`, is a standard\nwide-format \"tab separated values\" (TSV) file, of the sort that can be read into\nExcel or R. Some fields (particularly pronunciations and morphological analyses)\ncan have multiple entries per wordform. In this case, they are separated using\nthe `^` character.\n\nAdvanced users may wish to make use of the second file, by default\n`citylex.textproto`, a\n[text-format](https://developers.google.com/protocol-buffers/docs/reference/python/google.protobuf.text_format-module)\n[protocol buffer](https://developers.google.com/protocol-buffers/) which\nprovides a better representation of the repeated fields. To parse this file in\nPython, use the following snippet:\n\n```python\nfrom google.protobuf import text_format\n\nimport citylex_pb2\n\nlexicon = citylex_pb2.Lexicon()\nwith open(\"citylex.textproto\", \"r\") as source: \n text_format.ParseLines(source, lexicon)\n```\n\nThis will parse the text-format data and populate `lexicon`. One can then\niterate over `lexicon.entry` like a Python dictionary.\n\nNon-redistributable data sources\n--------------------------------\n\nNot all CityLex data can be obtained automatically from online sources. If you\nwish to enable CELEX features, follow the instructions below.\n\nThis proprietary resource must be obtained from the [Linguistic Data\nConsortium](https://catalog.ldc.upenn.edu/LDC96L14) as `LDC96L14.tgz`. The file\nshould be decompressed using\n\n```bash\ntar -xzf LDC96L14.tgz\n```\n\nThis will produce a directory named `celex2`. To enable CELEX2 features, use\n`--celex` and pass the local path of this directory as an argument to\n`--celex_path`.\n\nFor more information\n--------------------\n\n- `citylex.proto` for the protocol buffer data structure\n- `citylex.bib` for references to the data sources used\n\nFor contributors\n----------------\n\nTo regenerate `citylex_pb2.py` you will need to install the [Protocol Buffers\nC++ runtime](https://github.com/protocolbuffers/protobuf) for your platform,\nmaking sure the version number (e.g., the one returned by `protoc --version`\nmatches that of `protobuf` in `requirements.txt`. Then, run\n`protoc --python_out=. citylex.proto`.\n\nLicense\n-------\n\nThe CityLex codebase are distributed under the Apache 2.0 license. Please see\n[`License.txt`](LICENSE.txt) for details.\n\nAll other data sources bear their original licenses chosen by their creators;\nsee `citylex --help` for more information.\n\nAuthor\n------\n\nCityLex was created by [Kyle Gorman](http://wellformedness.com).", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/kylebgorman/citylex", "keywords": "computational linguistics,morphology,natural language processing,phonology,phonetics,speech,language", "license": "Apache 2.0", "maintainer": "", "maintainer_email": "", "name": "citylex", "package_url": "https://pypi.org/project/citylex/", "platform": "", "project_url": "https://pypi.org/project/citylex/", "project_urls": { "Homepage": "https://github.com/kylebgorman/citylex" }, "release_url": "https://pypi.org/project/citylex/0.1.2/", "requires_dist": null, "requires_python": ">=3.6", "summary": "Builds a multisource English lexicon", "version": "0.1.2" }, "last_serial": 5972471, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "49f8e4f73076c81b1016d774f06ed925", "sha256": "8e7c9883585a5911849e1b231248df0839db184f4670be2abb7b853918fb9db0" }, "downloads": -1, "filename": "citylex-0.1.1.tar.gz", "has_sig": false, "md5_digest": "49f8e4f73076c81b1016d774f06ed925", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 14409, "upload_time": "2019-09-27T12:28:27", "url": "https://files.pythonhosted.org/packages/b0/bf/f3e635c8835e445ba698ddcccdfa322d6597a6cbcc3ff712b1a3a4691ca3/citylex-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "5952be3b53c345f931c3e4626970b497", "sha256": "b0d9c09d63efc9a2ee4f54e48af98a6c312454a858efcf4d92802574117a7d88" }, "downloads": -1, "filename": "citylex-0.1.2.tar.gz", "has_sig": false, "md5_digest": "5952be3b53c345f931c3e4626970b497", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 14530, "upload_time": "2019-10-14T17:25:02", "url": "https://files.pythonhosted.org/packages/5d/4a/7c5eca88341e29855bc2f19086aeba555e23dd4f8711fbe10e1e7f2cd9aa/citylex-0.1.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "5952be3b53c345f931c3e4626970b497", "sha256": "b0d9c09d63efc9a2ee4f54e48af98a6c312454a858efcf4d92802574117a7d88" }, "downloads": -1, "filename": "citylex-0.1.2.tar.gz", "has_sig": false, "md5_digest": "5952be3b53c345f931c3e4626970b497", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 14530, "upload_time": "2019-10-14T17:25:02", "url": "https://files.pythonhosted.org/packages/5d/4a/7c5eca88341e29855bc2f19086aeba555e23dd4f8711fbe10e1e7f2cd9aa/citylex-0.1.2.tar.gz" } ] }