{ "info": { "author": "Markus Konrad", "author_email": "markus.konrad@wzb.eu", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Utilities" ], "description": "# GermaLemma\n\nJanuary 2019, Markus Konrad / [Berlin Social Science Center](https://www.wzb.eu/en)\n\n## A lemmatizer for German language text\n\nGermalemma lemmatizes Part-of-Speech-tagged German language words. To do so, it combines a large lemma dictionary (an excerpt of the [TIGER corpus from the University of Stuttgart](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html)), functions from the CLiPS \"Pattern\" package, and an algorithm to split composita.\n\n## Installation\n\n### Easy option: Installing from PyPI via `pip`\n\nYou can install the package from [PyPI](https://pypi.org/project/germalemma/) via `pip`:\n\n```\npip install -U germalemma\n```\n\n### Downloading and installing from source\n\nIn order to use GermaLemma, you will need to install some additional packages (see *Requirements* section below) and then download the [TIGER corpus from the University of Stuttgart](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html). You will need to use the CONLL09 format, *not* the XML format.\nThe corpus is free to use for non-commercial purposes (see [License Agreement](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/license/htmlicense.html)).\n\nThen, you should convert the corpus into pickle format for faster loading by executing *germalemma.py* and passing the path to the corpus file in CONLL09 format:\n\n```\npython germalemma.py tiger_release_[...].conll09\n```\n\nThis will place a `lemmata.pickle` file in the `data` directory which is then automatically loaded.\n\n## Part-of-Speech (POS) Tagging\n\nYou will need to apply [Part-of-Speech (POS) tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging) to your text before you can lemmatize its words. See [this blog post](https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-german-texts-with-nltk/) on how to do that.\n\n## Usage\n\nYou have set up GermaLemma to use the TIGER corpus (as explained above). You have tokenized your text (e.g. with NLTK). You have POS-tagged your tokens. Now you can use GermaLemma:\n\n```python\nfrom germalemma import GermaLemma\n\nlemmatizer = GermaLemma()\n\n# passing the word and the POS tag (\"N\" for noun)\nlemma = lemmatizer.find_lemma('Feinstaubbelastungen', 'N')\nprint(lemma)\n# -> lemma is \"Feinstaubbelastung\"\n```\n\n## Valid POS tags\n\nYou can pass POS tags from the [STTS tagset](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html), however, only four POS tags can be processed:\n\n* 'N...' (nouns)\n* 'V...' (verbs)\n* 'ADJ...' (adjectives)\n* 'ADV...' (adverbs)\n\n**All other POS tags will result in a `ValueError` so you should wrap the call to `find_lemma` in a *try-except block*.**\n\n## Accuracy\n\nGermaLemma's accuracy was evaluated using a sample of 696 POS tagged and manually lemmatized words from a sample of paragraphs from proceedings of the European Parliament, Goethe's \"Werther\", Kafka's \"Verwandlung\" and a news article from the website of the WZB (see samples in folder \"eval_texts\").\n\n**Under the assumption that the POS tag is correct** (only those words were selected), GermaLemma finds the correct lemma in 99.43% of the cases. For comparison, *Pattern* achieved 95.11% for the same sample.\n\n## Requirements\n\n* Python 3.x (Python 2 is not supported any more!)\n* required package [*Pyphen*](http://pyphen.org/)\n* optional package [*Pattern*](http://www.clips.ua.ac.be/pattern) (This package is optional but highly recommended as it boosts the lemmatizer's accuracy.)\n\n## License\n\nApache License 2.0. See *LICENSE* file.\n\nThe TIGER corpus is **not** part of this repository and has to be downloaded separately under separate license conditions.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/WZBSocialScienceCenter/germalemma", "keywords": "text lemmatization normalization textmining textanalysis mining preprocessing", "license": "Apache 2.0", "maintainer": "", "maintainer_email": "", "name": "germalemma", "package_url": "https://pypi.org/project/germalemma/", "platform": "", "project_url": "https://pypi.org/project/germalemma/", "project_urls": { "Homepage": "https://github.com/WZBSocialScienceCenter/germalemma", "Source": "https://github.com/WZBSocialScienceCenter/germalemma", "Tracker": "https://github.com/WZBSocialScienceCenter/germalemma/issues" }, "release_url": "https://pypi.org/project/germalemma/0.1.1/", "requires_dist": [ "Pattern (>=3.6)", "Pyphen (>=0.9.5)" ], "requires_python": ">=3.4", "summary": "A lemmatizer for German language text.", "version": "0.1.1" }, "last_serial": 4759459, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "65aa473976fefa70d9af8377b4fb89ae", "sha256": "46d5dd5ad9c82d8790c94e166be7e5ba49a70835072458f4bd96991774981a37" }, "downloads": -1, "filename": "germalemma-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "65aa473976fefa70d9af8377b4fb89ae", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.4", "size": 2249676, "upload_time": "2019-01-30T11:47:51", "url": "https://files.pythonhosted.org/packages/72/8d/c25a41774eb0102d6913f4e268fe75d1797097326435807ab730134d2c91/germalemma-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4d64b41b26472f5023c1571fe5389388", "sha256": "4ead9b29c31fc46f304479396854cf348caf5898da68c06366c9d8c8fb25dd51" }, "downloads": -1, "filename": "germalemma-0.1.0.tar.gz", "has_sig": false, "md5_digest": "4d64b41b26472f5023c1571fe5389388", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.4", "size": 2249744, "upload_time": "2019-01-30T11:47:55", "url": "https://files.pythonhosted.org/packages/b4/eb/1f96b4f785497b55e44adf3e0b9de42952185652212566d6b451c1f0edd2/germalemma-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "11a3b3c9018504ef57c884c8a1e8aac3", "sha256": "b376209b09eba5657feb47a8d4e90bdff5bec21dda96c2634f46b612dc5a5424" }, "downloads": -1, "filename": "germalemma-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "11a3b3c9018504ef57c884c8a1e8aac3", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.4", "size": 2249721, "upload_time": "2019-01-30T11:59:47", "url": "https://files.pythonhosted.org/packages/ff/f9/9fb28336e480b0e3744a8633813f9e1bc3f49a4eb3d7f6ad23e923a5a5b4/germalemma-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "51ab3910c534a6b00d9ee74b9301bf22", "sha256": "e3a745ec169004a3d8b537b7db2391887fa98e6ecf014536d8b96fe12b08e206" }, "downloads": -1, "filename": "germalemma-0.1.1.tar.gz", "has_sig": false, "md5_digest": "51ab3910c534a6b00d9ee74b9301bf22", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.4", "size": 2249870, "upload_time": "2019-01-30T11:59:50", "url": "https://files.pythonhosted.org/packages/93/07/f8329dbd98a961b10d22f479be81a46774e12d0eff859a33f98f65100bdc/germalemma-0.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "11a3b3c9018504ef57c884c8a1e8aac3", "sha256": "b376209b09eba5657feb47a8d4e90bdff5bec21dda96c2634f46b612dc5a5424" }, "downloads": -1, "filename": "germalemma-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "11a3b3c9018504ef57c884c8a1e8aac3", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.4", "size": 2249721, "upload_time": "2019-01-30T11:59:47", "url": "https://files.pythonhosted.org/packages/ff/f9/9fb28336e480b0e3744a8633813f9e1bc3f49a4eb3d7f6ad23e923a5a5b4/germalemma-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "51ab3910c534a6b00d9ee74b9301bf22", "sha256": "e3a745ec169004a3d8b537b7db2391887fa98e6ecf014536d8b96fe12b08e206" }, "downloads": -1, "filename": "germalemma-0.1.1.tar.gz", "has_sig": false, "md5_digest": "51ab3910c534a6b00d9ee74b9301bf22", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.4", "size": 2249870, "upload_time": "2019-01-30T11:59:50", "url": "https://files.pythonhosted.org/packages/93/07/f8329dbd98a961b10d22f479be81a46774e12d0eff859a33f98f65100bdc/germalemma-0.1.1.tar.gz" } ] }