{ "info": { "author": "Gregor Weichbrodt", "author_email": "gregorweichbrodt@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Natural Language :: German", "Operating System :: OS Independent", "Programming Language :: Python :: 3.7", "Topic :: Text Processing :: Markup :: XML" ], "description": "# wiktionary_de_parser\n`wiktionary_de_parser` is a Python module to extract data from German Wiktionary XML files. It allows you to add your own extraction methods.\n\n## Requirements\n- Python 3.7 (might work with other 3.+ versions, but not tested)\n\n## Features\n- comes with preset extraction methods for:\n - flexion tables, genus, IPA, language, lemma, part of speech (basic), syllables, raw Wikitext\n- allows you to add your own extraction methods (pass them as argument)\n- data values are normalized and cleaned from obsolete Wikitext markup\n- yields per section, not per page (a word can have multiple meanings, which is why some Wiktionary pages have multiple 'sections')\n\n## Usage\n1. Install via `pip3 install wiktionary_de_parser`.\n2. Import `wiktionary_de_parser` like this:\n\n```python\nfrom bz2file import BZ2File\nfrom wiktionary_de_parser import Parser\n\nbzfile_path = 'C:/Users/Gregor/Downloads/dewiktionary-latest-pages-articles-multistream.xml.bz2'\nbz = BZ2File(bzfile_path)\n\nfor record in Parser(bz):\n if 'language' not in record or record['language'] != 'Deutsch':\n continue\n # do stuff with 'record'\n```\nNote: in this example we use [BZ2File](https://pypi.org/project/bz2file/) to read a compressed Wiktionary dump file.\nThe Wiktionary dump file is obtained from [here](https://dumps.wikimedia.org/dewiktionary/).\n\n### Adding new extraction methods\nAll extraction methods must return a `Dict()` and accept the following arguments:\n- `title` (_string_): The title of the current Wiktionary page\n- `text` (_string_): The [Wikitext](https://en.wikipedia.org/wiki/Wiki#Editing) of the current word entry/section\n- `current_record` (_Dict_): A dictionary with all values of the current iteration (e. g. `current_record['language']`)\n\n```python\n# Create a new extraction method\ndef my_method(title, text, current_record):\n # do stuff\n return {'my_field': my_data}\n\n# Pass a list with all extraction methods to the class constructor:\nfor record in Parser(bz, custom_methods=[my_method]):\n print(record['my_field'])\n```\n\n## Sample data:\n```python\n{'flexion': {'Akkusativ Plural': 'Trittbrettfahrer',\n 'Akkusativ Singular': 'Trittbrettfahrer',\n 'Dativ Plural': 'Trittbrettfahrern',\n 'Dativ Singular': 'Trittbrettfahrer',\n 'Genitiv Plural': 'Trittbrettfahrer',\n 'Genitiv Singular': 'Trittbrettfahrers',\n 'Genus': 'm',\n 'Nominativ Plural': 'Trittbrettfahrer',\n 'Nominativ Singular': 'Trittbrettfahrer'},\n 'inflected': False,\n 'ipa': ['\u02c8t\u0281\u026atb\u0281\u025bt\u02ccfa\u02d0\u0281\u0250'],\n 'language': 'Deutsch',\n 'lemma': 'Trittbrettfahrer',\n 'pos': {'Substantiv': []},\n 'syllables': ['Tritt', 'brett', 'fah', 'rer'],\n 'title': 'Trittbrettfahrer',\n 'wikitext': '=== {{Wortart|Substantiv|Deutsch}}, {{m}} ===\\n'\n '\\n'\n '{{Deutsch Substantiv \u00dcbersicht\\n'\n '|Genus=m\\n'\n '|Nominativ Singular=Trittbrettfahrer\\n'\n '|Nominativ Plural=Trittbrettfahrer\\n'\n '|Genitiv Singular=Trittbrettfahrers\\n'\n '|Genitiv Plural=Trittbrettfahrer\\n'\n '|Dativ Singular=Trittbrettfahrer\\n'\n '|Dativ Plural=Trittbrettfahrern\\n'\n '|Akkusativ Singular=Trittbrettfahrer\\n'\n '|Akkusativ Plural=Trittbrettfahrer\\n'\n '}}\\n'\n '\\n'\n '{{Worttrennung}}\\n'\n ':Tritt\u00b7brett\u00b7fah\u00b7rer, {{Pl.}} Tritt\u00b7brett\u00b7fah\u00b7rer\\n'\n '\\n'\n '{{Aussprache}}\\n'\n ':{{IPA}} {{Lautschrift|\u02c8t\u0281\u026atb\u0281\u025bt\u02ccfa\u02d0\u0281\u0250}}\\n'\n ':{{H\u00f6rbeispiele}} {{Audio|}}\\n'\n '\\n'\n '{{Bedeutungen}}\\n'\n ':[1] Person, die ohne [[Anstrengung]] an Vorteilen teilhaben '\n 'will\\n'\n '\\n'\n '{{Herkunft}}\\n'\n ':[[Determinativkompositum]] aus den Substantiven '\n \"''[[Trittbrett]]'' und ''[[Fahrer]]''\\n\"\n '\\n'\n '{{Weibliche Wortformen}}\\n'\n ':[1] [[Trittbrettfahrerin]]\\n'\n '\\n'\n '{{Beispiele}}\\n'\n ':[1] \u201eBleibt schlie\u00dflich noch das Problem der '\n \"''Trittbrettfahrer,'' die sich ohne Versicherung aus \"\n 'Nachl\u00e4ssigkeit in das soziale Netz abgleiten '\n 'lassen.\u201c{{Internetquelle|url=http://books.google.se/books?id=VjLq84xNpfMC&pg=PA446&dq=trittbrettfahrer&hl=de&sa=X&ei=8AztU4aVJYq_ygOd1oKIDA&ved=0CEEQ6AEwBjgK#v=onepage&q=trittbrettfahrer&f=false|titel=\u00d6ffentliche '\n 'Finanzen in der Demokratie: Eine Einf\u00fchrung, Charles B. '\n 'Blankart|zugriff=2014-08-14}}\\n'\n '\\n'\n '{{Wortbildungen}}\\n'\n ':[1] [[Trittbrettfahrer-Problem]]\\n'\n '\\n'\n '==== {{\u00dcbersetzungen}} ====\\n'\n '{{\u00dc-Tabelle|\u00dc-links=\\n'\n '*{{en}}: [1] {{\u00dc|en|free rider}}\\n'\n '*{{fi}}: [1] {{\u00dc|fi|siipeilij\u00e4}}, {{\u00dc|fi|vapaamatkustaja}}\\n'\n '*{{fr}}: [1] {{\u00dc|fr|profiteur}}\\n'\n '|\u00dc-rechts=\\n'\n '*{{it}}: [1] {{\u00dc|it|scroccone}} {{m}}\\n'\n '*{{es}}: [1] {{\u00dc|es|}}\\n'\n '}}\\n'\n '\\n'\n '{{Referenzen}}\\n'\n ':[1] {{Wikipedia|Trittbrettfahrer}}\\n'\n ':[*] {{Ref-DWDS|Trittbrettfahrer}}\\n'\n ':[*] {{Ref-Canoo|Trittbrettfahrer}}\\n'\n ':[1] {{Ref-UniLeipzig|Trittbrettfahrer}}\\n'\n ':[1] {{Ref-FreeDictionary|Trittbrettfahrer}}\\n'\n '\\n'\n '{{Quellen}}'}\n```\n\n## Vendor packages\n- [lxml](https://lxml.de)\n- [pyphen](https://pyphen.org)\n\n## License\n[MIT](https://github.com/gambolputty/wiktionary_de_parser/blob/master/LICENSE.md) \u00a9 Gregor Weichbrodt\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/gambolputty/wiktionary_de_parser", "keywords": "wiktionary xml parser data-extraction german nlp", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "wiktionary-de-parser", "package_url": "https://pypi.org/project/wiktionary-de-parser/", "platform": "", "project_url": "https://pypi.org/project/wiktionary-de-parser/", "project_urls": { "Bug Reports": "https://github.com/gambolputty/wiktionary_de_parser/issues", "Homepage": "https://github.com/gambolputty/wiktionary_de_parser", "Source": "https://github.com/gambolputty/wiktionary_de_parser" }, "release_url": "https://pypi.org/project/wiktionary-de-parser/0.7.7/", "requires_dist": [ "lxml", "pyphen" ], "requires_python": ">=3.7", "summary": "Extracts data from German Wiktionary dump files. Allows you to add your own extraction methods \ud83d\ude80", "version": "0.7.7" }, "last_serial": 5539983, "releases": { "0.7.1": [ { "comment_text": "", "digests": { "md5": "7f9c1939f04137851c4ee7dae75f9e32", "sha256": "829fa737f38660d2ae39610e2e3e5fef2693f567ca2dda598ceff26b815e4a15" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.1-py3-none-any.whl", "has_sig": false, "md5_digest": "7f9c1939f04137851c4ee7dae75f9e32", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.7", "size": 12002, "upload_time": "2019-05-27T01:19:03", "url": "https://files.pythonhosted.org/packages/19/fa/0ee025622d469813f6353b387b8f90ba972f58f0561d6fcc1c52beaf5559/wiktionary_de_parser-0.7.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f568fe3a093c0b0b1b033547affe9839", "sha256": "caccc647221973a9d86ab46624387ceb66e24d7521fd9acd6478f6baed5a27e9" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.1.tar.gz", "has_sig": false, "md5_digest": "f568fe3a093c0b0b1b033547affe9839", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.7", "size": 11966, "upload_time": "2019-05-27T01:19:04", "url": "https://files.pythonhosted.org/packages/8c/44/400f6aae2ba77ada441c0c23eaf4343fe4b24765feeda051c56ee314803d/wiktionary_de_parser-0.7.1.tar.gz" } ], "0.7.2": [ { "comment_text": "", "digests": { "md5": "3f5064c8fe4da2ae653023267928484d", "sha256": "3adfeb4246b5ba0a24d28acd50e66511a59e9c4dcd86a5049b78b01850dba161" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.2-py3-none-any.whl", "has_sig": false, "md5_digest": "3f5064c8fe4da2ae653023267928484d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.7", "size": 12153, "upload_time": "2019-05-29T21:43:54", "url": "https://files.pythonhosted.org/packages/cf/e1/6b66e55c8f9ac8ee15197696a090c11ff7badf94c0d76758a0c91a40d958/wiktionary_de_parser-0.7.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "3a1209cb2585b090f0e8bf9fcefcdf5e", "sha256": "f4bf76c17f7eb659992dc18926805f9b9e356a5034dc0fdc46c5ddc464f3c265" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.2.tar.gz", "has_sig": false, "md5_digest": "3a1209cb2585b090f0e8bf9fcefcdf5e", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.7", "size": 12326, "upload_time": "2019-05-29T21:43:55", "url": "https://files.pythonhosted.org/packages/fa/11/cf2817847e23dfd160136656ec105878e99ad42c1316652bc542bd589bad/wiktionary_de_parser-0.7.2.tar.gz" } ], "0.7.3": [ { "comment_text": "", "digests": { "md5": "9ca301715acd81e7519a6a961237c18e", "sha256": "78b6ac6e57da94f4748f220d0732e36675fdf80412f28f36c82587da64f39ed1" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.3-py3-none-any.whl", "has_sig": false, "md5_digest": "9ca301715acd81e7519a6a961237c18e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.7", "size": 12168, "upload_time": "2019-05-29T22:49:22", "url": "https://files.pythonhosted.org/packages/1f/aa/75b4db5487213457acb5032da4678c1f4ac2fa69b7ee58a9bdfb138c8388/wiktionary_de_parser-0.7.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "b09241294e790bc0fff38bffa81f95e9", "sha256": "a170c3c9afe7f5b7cdfab897524906701f596517865e40b4ff0ac7d7db685fb8" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.3.tar.gz", "has_sig": false, "md5_digest": "b09241294e790bc0fff38bffa81f95e9", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.7", "size": 12371, "upload_time": "2019-05-29T22:49:24", "url": "https://files.pythonhosted.org/packages/eb/ec/b9924285cf1b41a57a97c96bfe0c86d3b521b50bbff6c8e5347be0bfa42c/wiktionary_de_parser-0.7.3.tar.gz" } ], "0.7.4": [ { "comment_text": "", "digests": { "md5": "84ae5cdb1a7cbac638402df8c044bcfc", "sha256": "61309afcc244ca877314bd40d450eadfadc69a493171a7a32740dbc8bd57d64c" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.4-py3-none-any.whl", "has_sig": false, "md5_digest": "84ae5cdb1a7cbac638402df8c044bcfc", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.7", "size": 13097, "upload_time": "2019-07-13T14:19:20", "url": "https://files.pythonhosted.org/packages/1a/4f/66f313e870d3d6588f890f9b75c55a331b17e9e52beae8d98f0430d53814/wiktionary_de_parser-0.7.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "83fcd6d041e19fbb6bf8343b7885306f", "sha256": "e04c8fe4a08c6a79ba2a1aff7163e05f34561f8ca23f3b8d0d3e178174e541be" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.4.tar.gz", "has_sig": false, "md5_digest": "83fcd6d041e19fbb6bf8343b7885306f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.7", "size": 13439, "upload_time": "2019-07-13T14:19:22", "url": "https://files.pythonhosted.org/packages/47/f0/9d12ac2e5003416b1250e87d6493a0810157ca917a1ded8e097573d91364/wiktionary_de_parser-0.7.4.tar.gz" } ], "0.7.5": [ { "comment_text": "", "digests": { "md5": "41385b6a2d1f84054c36d0f12a894fa5", "sha256": "53f87c41e5c2ef5ce60df54b114b6e19d6968d9a0e50d47f95489fded8c71224" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.5-py3-none-any.whl", "has_sig": false, "md5_digest": "41385b6a2d1f84054c36d0f12a894fa5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.7", "size": 13762, "upload_time": "2019-07-13T16:14:21", "url": "https://files.pythonhosted.org/packages/00/08/b8d7f772150e913e6eccc4afdce36827d33f26bffca14d62147a0f9f8407/wiktionary_de_parser-0.7.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "b8cbdec8d3d838a869ca052a6091357c", "sha256": "601cce46944b01cd5300a19d37f80ea23156ca0662844453b5848ff4d22be529" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.5.tar.gz", "has_sig": false, "md5_digest": "b8cbdec8d3d838a869ca052a6091357c", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.7", "size": 14130, "upload_time": "2019-07-13T16:14:22", "url": "https://files.pythonhosted.org/packages/c1/72/b4d2ed544ca0cb7dd536d7f5159184ad79a52db9e47c1483fa49adee471e/wiktionary_de_parser-0.7.5.tar.gz" } ], "0.7.6": [ { "comment_text": "", "digests": { "md5": "0e4c7bce6f5e18d19955e3f4f10a67b1", "sha256": "1968e922197014eb2deffb517a20606a140cb5dc88ac64f242201df50bf4c9da" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.6-py3-none-any.whl", "has_sig": false, "md5_digest": "0e4c7bce6f5e18d19955e3f4f10a67b1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.7", "size": 13844, "upload_time": "2019-07-13T18:05:00", "url": "https://files.pythonhosted.org/packages/3b/d3/91a418ab9ab4660194b0a73d485aa46e2d7ae428e043374e8dabb02d9e12/wiktionary_de_parser-0.7.6-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "590d5fbb3a5d1ae18c20d6e3835cc520", "sha256": "644719c4ac55832f95edc363caec14a47dbda394659b82f36787898df7b4352b" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.6.tar.gz", "has_sig": false, "md5_digest": "590d5fbb3a5d1ae18c20d6e3835cc520", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.7", "size": 14240, "upload_time": "2019-07-13T18:05:02", "url": "https://files.pythonhosted.org/packages/02/79/b6dfa817088d8757d6dfd3a8e82710bcf2a795ef78b5a869974e0fc6f38f/wiktionary_de_parser-0.7.6.tar.gz" } ], "0.7.7": [ { "comment_text": "", "digests": { "md5": "d40dfd42a213db1015a68ad48dbcbb44", "sha256": "6f9f72a887211ec18e36544588b25b6a384a4d6bb327221db7c72708265cdf7a" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.7-py3-none-any.whl", "has_sig": false, "md5_digest": "d40dfd42a213db1015a68ad48dbcbb44", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.7", "size": 13839, "upload_time": "2019-07-16T11:26:46", "url": "https://files.pythonhosted.org/packages/cb/4b/e85125dff7eca579088e597244233c89dcd143ce84e1b4e3b9c0d583c8be/wiktionary_de_parser-0.7.7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cf4a79615ae9565bf741493d9a1fdff5", "sha256": "3c2f138138e2b184eab4950ef873260a16d27bc513c65d41360804a7a3220a72" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.7.tar.gz", "has_sig": false, "md5_digest": "cf4a79615ae9565bf741493d9a1fdff5", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.7", "size": 14277, "upload_time": "2019-07-16T11:26:47", "url": "https://files.pythonhosted.org/packages/82/11/83e7c3987215a075851f9cdbf6ec232b340b219eb3f8c3f7c025a4f34b21/wiktionary_de_parser-0.7.7.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d40dfd42a213db1015a68ad48dbcbb44", "sha256": "6f9f72a887211ec18e36544588b25b6a384a4d6bb327221db7c72708265cdf7a" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.7-py3-none-any.whl", "has_sig": false, "md5_digest": "d40dfd42a213db1015a68ad48dbcbb44", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.7", "size": 13839, "upload_time": "2019-07-16T11:26:46", "url": "https://files.pythonhosted.org/packages/cb/4b/e85125dff7eca579088e597244233c89dcd143ce84e1b4e3b9c0d583c8be/wiktionary_de_parser-0.7.7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cf4a79615ae9565bf741493d9a1fdff5", "sha256": "3c2f138138e2b184eab4950ef873260a16d27bc513c65d41360804a7a3220a72" }, "downloads": -1, "filename": "wiktionary_de_parser-0.7.7.tar.gz", "has_sig": false, "md5_digest": "cf4a79615ae9565bf741493d9a1fdff5", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.7", "size": 14277, "upload_time": "2019-07-16T11:26:47", "url": "https://files.pythonhosted.org/packages/82/11/83e7c3987215a075851f9cdbf6ec232b340b219eb3f8c3f7c025a4f34b21/wiktionary_de_parser-0.7.7.tar.gz" } ] }