{ "info": { "author": "Tatu Ylonen", "author_email": "ylo@clausal.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Natural Language :: Finnish", "Operating System :: OS Independent", "Operating System :: POSIX :: Linux", "Programming Language :: Python", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3 :: Only", "Topic :: Text Processing", "Topic :: Text Processing :: Linguistic" ], "description": "# Wiktextract\n\nThis is a utility and Python package for for extracing data from Wiktionary.\n\n## Overview\n\nThis is a Python package and tool for extracting information from\nWiktionary data dumps. It reads the\n``enwiktionary--pages-articles.xml.bz2`` file (or corresponding\nfiles from other wiktionaries) and returns Python dictionaries\ncontaining most of the information in Wiktionary.\n\nThis tool extracts glosses, parts-of-speech, declension/conjugation\ninformation when available, translations for all languages when\navailable, pronunciations (including audio file links), qualifiers\nincluding usage notes, word forms, links between words including\nhypernyms, hyponyms, holonyms, meronyms, related words, derived terms,\ncompounds, alternative forms, etc. Links to Wikipedia pages, Wikidata\nidentifiers, and other such data are also extracted when available.\nFor many classes of words, a word sense is annotated with specific\ninformation such as what word it is a form of, what is the RGB value\nof the color it represents, what is the numeric value of a number,\nwhat SI unit it represents, etc.\n\nThe tool is capable of extracting information for any language.\nHowever, so far it has mostly been tested with English and Finnish,\nand to some extent German and Spanish. Changes to extract information\nfor any additional languages are likely to be small. Basic\ninformation extraction most likely works out of the box for any\nlanguage.\n\nThis utility will be useful for many natural language processing,\nsemantic parsing, machine translation, and language generation\napplications both in research and industry.\n\nThe tool can be used to extract machine translation dictionaries,\nlanguage understanding dictionaries, semantically annotated\ndictionaries, and morphological dictionaries with\ndeclension/conjugation information (where this information is\navailable for the target language). Dozens of languages have\nextensive vocabulary in ``enwiktionary``, and several thousand\nlanguages have partial coverage.\n\nThe ``wiktwords`` script makes extracting the information for use by\nother tools trivial without writing a single line of code. It\nextracts the information specified by command options for languages\nspecified on the command line, and writes the extracted data to a file\nor standard output in JSON format for processing by other tools.\n\nAs far as we know, this is the most comprehensive tool available for\nextracting information from Wiktionary as of November 2018.\n\n## Getting started\n\n### Installing\n\nTo install ``wiktextract``, use ``pip`` (or ``pip3``, as appropriate),\nor clone the repository and install from the source:\n\n```\ngit clone https://github.com/tatuylonen/wiktextract.git\ncd wiktextract\npython3 setup.py install\n```\n\nThis will install the ``wiktextract`` package and the ``wiktwords`` script.\n\nNote that this software has currently only been tested with Python 3.\nBack-porting to Python 2.7 should not be difficult; it just hasn't been\ntested yet. Please report back if you test and make this work with\nPython 2.\n\n### Running tests\n\nThis package includes tests written using the ``unittest`` framework.\nThey can be run using, for example, ``nose``, which can be installed\nusing ``pip3 install nose``.\n\nTo run the tests, just use the following command in the top-level directory:\n```\nnosetests\n```\n\n## Using the command-line tool\n\nThe ``wiktwords`` script is the easiest way to extract data from\nWiktionary. Just download the data dump file from\n[dumps.wikimedia.org](https://dumps.wikimedia.org/enwiktionary/) and\nrun the script. The correct dump file the name\n``enwiktionary--pages-articles.xml.bz2``.\n\nThe command-line tool may be invoced as follows:\n\n```\nwiktwords data/enwiktionary-latest-pages-articles.xml.bz2 --out wikt.words --language English --all\n```\n\nThe following command-line options are supported:\n\n* --out FILE: specifies the name of the file to write (specifying \"-\" as the file writes to stdout)\n* --language LANGUAGE: extracts the given language (this option may be specified multiple times; by default, English and Translingual words are extracted)\n* --list-languages: prints a list of supported language names\n* --all: causes all data to be captured for the selected languages\n* --translations: causes translations to be captured\n* --pronunciation: causes pronunciation information to be captured\n* --linkages: causes linkages (hypernyms etc.) to be captured\n* --compounds: causes compound words using each word to be captured\n* --redirects: causes redirects to be extracted\n* --statistics: prints useful statistics at the end\n* --pages-dir DIR: save all wiktionary pages under this directory (mostly for debugging)\n* --help: displays help text\n\nExtracting all of English Wiktionary may take about an hour, depending\non the speed of your system.\n\n## Calling the library\n\nThe library can be called as follows:\n\n```\nimport wiktextract\n\nctx = wiktextract.parse_wiktionary(\n path, word_cb,\n capture_cb=None,\n languages=[\"English\", \"Translingual\"],\n translations=False,\n pronunciations=False,\n redirects=False):\n```\n\nThe ``parse_wiktionary`` call will call ``word_cb(data)`` for words\nand redirects found in the Wiktionary dump. ``data`` is information\nabout a single word and part-of-speech as a dictionary (multiple\nsenses of the same part-of-speech are combined into the same\ndictionary). It may also be a redirect (indicated by presence of a\n\"redirect\" key in the dictionray). It is in the same format as the\nJSON-formatted dictionaries returned by the ``wiktwords`` tool. The\nformat is described below.\n\n``capture_cb(title, text)`` is called for every page before extracting any\nwords from it. It should return True if the page should be analyzed, and\nFalse if the page should be ignored. It can also be used to write certain\npages to disk or capture certain pages for different analyses (e.g., extracting\nhierarchies, classes, thesauri, or topic-specific word lists). If this\ncallback is None, all pages are analyzed.\n\n``languages`` should be a list, tuple, or set of language names to\ncapture. It defaults to ``[\"English\", \"Translingual\"]``.\n\n``translations`` can be set to True to capture translation\ninformation for words. Translation information seems to be most\nwidely available for the English language, which has translations into\nother languages. The translation information increases the size and\nloading time of the captured data substantially, so this is disabled\nby default.\n\n``pronunciations`` should be set to True to capture pronunciation\ninformation for words. Typically, this includes IPA transcriptions\nand any audio files included in the word entries, along with other\ninformation. However, the type and amount of pronunciation\ninformation varies widely between languages. This is disabled by\ndefault since many applications won't need the information.\n\n``linkages`` should be set to True to capture linkages between word, such as\nhypernyms, antonyms, synonyms, etc.\n\n``compounds`` should be set to True to capture compound words containing\nthe word.\n\n``redirects`` should be set to True to capture redirects. Redirects\nare not associated with any specific language and thus requesting them\nreturns them for words in all languages.\n\n## Format of extracted redirects\n\nSome pages in Wiktionary are redirects. For these, ``word_cb`` will\nbe called with data in a special format. In this case, the dictionary\nwill have the key ``redirect``, which will contain the name of the\nword the entry redirects to. The key ``word`` contains the word/term\nthat contains the redirect. Redirect entries do not have ``pos`` or\nany of the other fields. Redirects also are not associated with any\nlanguage, so all redirects are always returned regardless of the captured\nlanguages (if extracting redirects has been requested).\n\n## Format of the extracted word entries\n\nInformation returned for each word is a dictionary. The dictionary has the\nfollowing keys (others may also be present or added later):\n\n* ``word``: the word form\n* pos: part-of-speech, such as \"noun\", \"verb\", \"adj\", \"adv\", \"pron\", \"determiner\", \"prep\" (preposition), \"postp\" (postposition), and many others. The complete list of possibel values returned by the package can be found in ``wiktextract.PARTS_OF_SPEECH``.\n* ``senses``: word senses for this word/part-of-speech (see below)\n* ``conjugation``: conjugation/declension entries found for the word\n* ``heads``: part-of-speech specific head tags for the word. Useful for, e.g., obtaining comparatives, superlatives, and other inflection information for many languages. Each value is a dictionary, basically containing the arguments of the corresponding template in Wiktionary, with the template name under \"template_name\".\n* ``hyphenation``: list of hyphenations for the word when available. Each hyphenation is a sequence of syllables.\n* ``pinyin``: for Chinese words, the romanized transliteration, when available\n* ``synonyms``: synonym linkages for the word (see below)\n* ``antonyms``: antonym linkages for the word (see below)\n* ``hypernyms``: hypernym linkages for the word (see below)\n* ``holonyms``: linkages indicating being part of something (see below) (not systematically encoded)\n* ``meronyms``: linkages indicating having a part (see below) (fairly rare)\n* ``derived``: derived word linkages for the word (see below)\n* ``related``: related word linkages for the word (see below)\n* ``pronunciations``: contains pronunciation information when collected (see below)\n* ``translations``: contains translation information when collected (see below)\n\n### Word senses\n\nEach part-of-speech may have multiple glosses under the ``senses`` key. Each\nsense is a dictionary that may contain the following keys (among others, and more may be added in the future):\n\n* ``glosses``: list of gloss strings for the word sense (usually only one). This has been cleaned, and should be straightforward text with no tagging.\n* ``nonglosses``: list of gloss-like strings but that are not traditional glossary entries describing the word's meaning\n* ``tags``: list of qualifiers and tags for the gloss. This is a list of strings, and may include words such as \"archaic\", \"colloquial\", \"present\", \"plural\", \"person\", \"organism\", \"british\", \"chemistry\", \"given name\", \"surname\", \"female\", and many othes (new words may appear arbitrarily). Some effort has been put into trying to canonicalize various sources and styles of annotation into a consistent set of tags, but it is impossible to do an exact job at this.\n* ``senseid``: list of identifiers collected for the sense. Some entries have a Wikidata identifier (Q) here; others may have other identifiers. Currently sense ids are not very widely annotated in Wiktionary.\n* ``wikipedia``: link to wikipedia page from the word sense/gloss\n* ``topics``: topic categories specified for the sense (these may also be in \"tags\")\n* ``taxon``: links to taxonomical data\n* ``categories``: Category links specified for the page\n* ``color``: specification of RGB color values (hex or CSS color name)\n* ``value``: value represented by the word (e.g., for numerals)\n* ``unit``: information about units of measurement, particularly SI units, tagged to the word\n* ``alt_of``: list of words of which this sense is an alternative form or abbreviation\n* ``inflection_of``: list of words that this sense is an inflection of\n* ``conjugation``: list of templates indicating conjugation/declension (list of dictionaries containing the arguments of the Wiktionary template, with template name under \"template_name\")\n\n### Linkages to other words\n\nLinkages (``synonyms``, ``antonyms``, ``hypernyms``, ``derived words``, ``holonyms``, ``meronyms``, ``derived``, ``related``) are lists of dictionaries, where each dictionary can contain the following keys, among others:\n\n* ``word``: the word this links to (string). If this starts with \"Thesaurus:\", then this entry is a link to a thesaurus page in Wiktionary. If this starts with \"Category:\", then this refers to a category page in Wiktionary.\n* ``sense``: text identifying the word sense or context. This may also be a title from a table where the links are shown (e.g., \"Derived terms of name that are not hyponyms\").\n* ``tags``: qualifiers specified for the sense (e.g., field of study, region, dialect, style). This is a list of strings.\n\n### Pronunciation\n\nPronunciation information is stored under the ``pronunciations`` key. It is a\nlist of dictionaries, each of which may contain the following keys,\namong others:\n\n* ``audios``: list of audio files referenced as a list of ``(languagecode, filename, description)``\n* ``ipa``: pronunciation specifications in IPA format as tuples (lang, ipatext)\n* ``special_ipa``: special IPA-like specifications (sometimes macros calling code in Wiktionary), as list of dictionaries\n* ``enpr``: pronunciations in English pronunciation format as list of strings\n* ``homophones``: list of homophones for the word\n* ``accent``: accent markers associated with the pronunciation specification, such as dialect, country, etc. Common values for English include, e.g., \"RP\" (for Received Pronunciation), \"US\", \"UK\", \"GA\" (General American), etc. A list (in the form of code) can be found [here](https://en.wiktionary.org/wiki/Module:accent_qualifier/data).\n* ``tags``: other labels or context information attached to the sense (free form)\n* ``sense``: optional sense specifier, e.g., \"noun\", \"verb\", \"anatomy sense\" (a string)\n\n### Translations\n\nTranslations, when captured, are stored under the ``translations`` key\nin the word's data. They are stored in a list of dictionaries, where\neach dictionary has the following keys (and possibly others):\n\n* ``lang``: the language that the translation is for (Wiktionary's 2 or 3-letter language code)\n* ``word``: the translation in the specified language\n* ``sense``: optional sense for which the translation is (this is a free-text string, and may not match any gloss exactly)\n* ``alt``: an optional alternative form of the translation (used when the translation is not a lemma form/page name; in those cases, ``word`` is the page name and ``alt`` contains the actual translated form)\n* ``roman``: an optional romanization of the translation\n* ``script``: optional name of the script that the translation is in\n* ``tags``: optional list of [gender/number specifiers](https://en.wiktionary.org/wiki/Module:gender_and_number)\n\n## Related packages\n\nThe [wiktfinnish](https://github.com/tatuylonen/wiktfinnish) package\ncan be used to interpret Finnish noun declications and verb\nconjugations and for generating Finnish inflected word forms.\n\n## Known issues\n\n* Some information that is global for a page, such as category links for the page, may only be included in the last part-of-speech defined on the page or even the last language defined on the page. This should be fixed.\n\nThis software is still quite new and should still be considered a beta\nversion.\n\n## Dependencies\n\nThis package depends on the following other packages:\n\n* [lxml](https://lxml.de)\n* [wikitextparser](https://pypi.org/project/WikiTextParser/)\n\n## Contributing\n\nThe official repository of this project is on\n[github](https://github.com/tatuylonen/wiktextract).\n\nPlease email to ylo at clausal.com if you wish to contribute or have\npatches or suggestions.\n\n## License\n\nCopyright (c) 2018 [Tatu Ylonen](https://ylonen.org). This package is\nfree for both commercial and non-commercial use. It is licensed under\nthe MIT license. See the file\n[LICENSE](https://github.com/tatuylonen/wiktextract/blob/master/LICENSE)\nfor details.\n\nCredit and linking to the project's website and/or citing any future\npapers on the project would be highly appreciated.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "https://github.com/tatuylonen/wiktextract", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://ylonen.org", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "wiktextract", "package_url": "https://pypi.org/project/wiktextract/", "platform": "", "project_url": "https://pypi.org/project/wiktextract/", "project_urls": { "Download": "https://github.com/tatuylonen/wiktextract", "Homepage": "https://ylonen.org" }, "release_url": "https://pypi.org/project/wiktextract/0.2.0/", "requires_dist": [ "lxml", "wikitextparser" ], "requires_python": "", "summary": "Wiktionary dump file parser and multilingual data extractor", "version": "0.2.0" }, "last_serial": 4596598, "releases": { "0.1.2": [ { "comment_text": "", "digests": { "md5": "951eaf2547d516da413615ce6e2c82ad", "sha256": "0efc1e790d84de3cff14725a7f0fa671a5d019b9ad2a7646318ef9a65e3eeb17" }, "downloads": -1, "filename": "wiktextract-0.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "951eaf2547d516da413615ce6e2c82ad", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 50203, "upload_time": "2018-10-30T12:27:26", "url": "https://files.pythonhosted.org/packages/1d/19/ee246d2029b719b2117a9a4369d4589b2b420d26a167fe895b82e2c3d1fd/wiktextract-0.1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2f60cc73b0f1a29eac3a2f07573ec88d", "sha256": "ea5c618485aafd5932ca011b516c57ca225048a0fd258a567ce3899a2d2f152c" }, "downloads": -1, "filename": "wiktextract-0.1.2.tar.gz", "has_sig": false, "md5_digest": "2f60cc73b0f1a29eac3a2f07573ec88d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 54362, "upload_time": "2018-10-30T12:27:27", "url": "https://files.pythonhosted.org/packages/57/df/5124de3f3f13e19405afeae8a29a0e5e9213e6d7972955e16c76e4d41829/wiktextract-0.1.2.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "93a7ce8b1c17d6648ca074369422edb8", "sha256": "f5901c8cf9204f691a0161437d3e593f28e0d18dd9d10c2fe10b6979ccd82f24" }, "downloads": -1, "filename": "wiktextract-0.1.5-py3-none-any.whl", "has_sig": false, "md5_digest": "93a7ce8b1c17d6648ca074369422edb8", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 53025, "upload_time": "2018-11-01T23:05:31", "url": "https://files.pythonhosted.org/packages/3c/ce/1c175870c089b1cc6fb029504df28b0147831ae4cc92b78ab590dbf513a2/wiktextract-0.1.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "29a8de61b73fbadeccba9823647aa4d3", "sha256": "ca388794ec2a51cb772733caafcada0c05259bade36ff693ac1e5338b75ffffd" }, "downloads": -1, "filename": "wiktextract-0.1.5.tar.gz", "has_sig": false, "md5_digest": "29a8de61b73fbadeccba9823647aa4d3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 57924, "upload_time": "2018-11-01T23:05:32", "url": "https://files.pythonhosted.org/packages/b6/2b/f7cbf1aa9a8cc71989a55d4d637a22948d40a67637f12917d58a52c118c1/wiktextract-0.1.5.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "c38b7b70e1a96ff02f959d36253f75c0", "sha256": "670c929d6af7e2ea5c1696b49f37e9cea8973e0baa1ecb92088bdb3a2c9a7a8d" }, "downloads": -1, "filename": "wiktextract-0.2.0-py3-none-any.whl", "has_sig": false, "md5_digest": "c38b7b70e1a96ff02f959d36253f75c0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 54201, "upload_time": "2018-12-13T19:55:49", "url": "https://files.pythonhosted.org/packages/7f/d3/df57c0eec94287e07cedfb659decc761ef4b91667eda4cd52690db5021fe/wiktextract-0.2.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "04097185f09b68a5281b2c2f43c70500", "sha256": "f2c7b41d6a972eec02de591a91f8ec68908a57d0bd5589d5f8a87ed8b826d20b" }, "downloads": -1, "filename": "wiktextract-0.2.0.tar.gz", "has_sig": false, "md5_digest": "04097185f09b68a5281b2c2f43c70500", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 58351, "upload_time": "2018-12-13T19:55:51", "url": "https://files.pythonhosted.org/packages/5d/2a/d0cc3b280430b6abec6f854c267d5c8911de74b149466f832fa8fe3d1ab6/wiktextract-0.2.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "c38b7b70e1a96ff02f959d36253f75c0", "sha256": "670c929d6af7e2ea5c1696b49f37e9cea8973e0baa1ecb92088bdb3a2c9a7a8d" }, "downloads": -1, "filename": "wiktextract-0.2.0-py3-none-any.whl", "has_sig": false, "md5_digest": "c38b7b70e1a96ff02f959d36253f75c0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 54201, "upload_time": "2018-12-13T19:55:49", "url": "https://files.pythonhosted.org/packages/7f/d3/df57c0eec94287e07cedfb659decc761ef4b91667eda4cd52690db5021fe/wiktextract-0.2.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "04097185f09b68a5281b2c2f43c70500", "sha256": "f2c7b41d6a972eec02de591a91f8ec68908a57d0bd5589d5f8a87ed8b826d20b" }, "downloads": -1, "filename": "wiktextract-0.2.0.tar.gz", "has_sig": false, "md5_digest": "04097185f09b68a5281b2c2f43c70500", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 58351, "upload_time": "2018-12-13T19:55:51", "url": "https://files.pythonhosted.org/packages/5d/2a/d0cc3b280430b6abec6f854c267d5c8911de74b149466f832fa8fe3d1ab6/wiktextract-0.2.0.tar.gz" } ] }