{ "info": { "author": "Maciej Gawinecki", "author_email": "mgawinecki@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Intended Audience :: Education", "Intended Audience :: Information Technology", "Intended Audience :: Science/Research", "Operating System :: MacOS :: MacOS X", "Operating System :: Microsoft :: Windows", "Operating System :: POSIX :: Linux", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3 :: Only", "Topic :: Scientific/Engineering", "Topic :: Scientific/Engineering :: Artificial Intelligence", "Topic :: Scientific/Engineering :: Human Machine Interfaces", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Text Processing", "Topic :: Text Processing :: General", "Topic :: Text Processing :: Linguistic" ], "description": "Stempel Stemmer\n===============\n\n.. image:: https://badge.fury.io/py/pystempel.svg\n :target: https://badge.fury.io/py/pystempel\n\nPython port of Stempel, an algorithmic stemmer for Polish language, originally written in Java.\n\nThe original stemmer has been implemented as part of `Egothor Project`_, taken virtually unchanged to\n`Stempel Stemmer Java library`_ by Andrzej Bia\u0142ecki and next included as part of `Apache Lucene`_,\na free and open-source search engine library. It is also used by `Elastic Search`_ search engine.\n\n.. _Egothor Project: https://www.egothor.org/product/egothor2/\n.. _Stempel Stemmer Java library: http://www.getopt.org/stempel/index.html\n.. _Apache Lucene: https://lucene.apache.org/core/3_1_0/api/contrib-stempel/index.html\n.. _Elastic Search: https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-stempel.html\n\nThis package includes also high-quality stemming tables for Polish: original one pretrained by\nAndrzej Bia\u0142ecki on 20,000 training sets, and new one, pretrained on 259,080 training sets\nfrom Polimorf dictionary by me.\n\nThe port does not include code for compiling stemming tables.\n\n.. _sjp.pl: https://sjp.pl/slownik/en/\n\n\nHow to use\n----------\n\nInstall in your local environment:\n\n.. code:: console\n\n pip install pystempel\n\nUse in your code:\n\n.. code:: python\n\n >>> from stempel import StempelStemmer\n\nChoose original (called default) version of a stemmer:\n\n.. code:: python\n\n >>> stemmer = StempelStemmer.default()\n\nor a version with new stemming table pretrained on training sets from Polimorf dictionary:\n\n.. code:: python\n\n >>> stemmer = StempelStemmer.polimorf()\n\nStem:\n\n.. code:: python\n\n >>> for word in ['ksi\u0105\u017cka', 'ksi\u0105\u017cki', 'ksi\u0105\u017ckami', 'ksi\u0105\u017ckowa', 'ksi\u0105\u017ckowymi']:\n ... print(stemmer.stem(word))\n ...\n ksi\u0105\u017cek\n ksi\u0105\u017cek\n ksi\u0105\u017cek\n ksi\u0105\u017ckowy\n ksi\u0105\u017ckowy\n\n\nChoosing stemming table\n-----------------------\n\nPerformance between original (default) and new stemming table (Polimorf-based) varies significantly.\nThe stemmer for the default stemming table is *understemming*, i.e., for multiple forms of the\nsame lemma provides different stems more often (63%) than when using Polimorf-based stemming table\n(13%). However, the file footprint of the latter is bigger (2.2MB vs 0.3MB). Also loading takes\nlonger (7.5 seconds vs. 1.3 seconds), though this happens only once, when a stemmer is created. Also,\nfor original stemming table, the stemmer stems slightly faster: ~60000 vs ~51000 words per second.\nSee `Evaluation Jupyter Notebook`_ for the detailed evaluation results.\n\n.. _Evaluation Jupyter Notebook: http://htmlpreview.github.io/?https://github.com/dzieciou/pystempel/blob/master/Evaluation.html\n\nNote also, that the licensing schema of both stemming tables differs, and hence licensing of\ndata generated with each one. See \"Licensing\" section for the details.\n\n\n\nChoosing between port and wrapper\n---------------------------------\n\nIf you work on an NLP project in Python you can choose between Python port and Python wrapper.\nPython port is what pystempel tries to achieve: translation from Java implementation to Python.\nPython wrapper is what I used in `tests`_: Python functions to call the original Java implementation of\nstemmer. You can find more about wrappers and ports in `Stackoverflow comparision post`_. Here, I\ncompare both approaches to help you decide:\n\n* **Same accuracy**. I have verified Python port by comparing its output\n with output of original Java implementation for 331224 words from Free Polish dictionary\n (`sjp.pl`_) and for 100% of words it returns same output.\n* **Similar performance**. For mentioned dataset both stemmer versions achieved comparable performance.\n Python port completed stemming in 4.4 seconds, while Python wrapper -- in 5 seconds (Intel Core\n i5-6000 3.30 GHz, 16GB RAM, Windows 10, OpenJDK)\n* **Different setup**. Python wrapper requires additionally installation of Cython and pyjnius.\n Python wrapper will make also `debugging harder`_ (switching between two programming languages).\n\n.. _Stackoverflow comparision post: https://stackoverflow.com/questions/10113218/how-to-decide-when-to-wrap-port-write-from-scratch\n.. _debugging harder: https://stackoverflow.com/questions/6970359/find-an-efficient-way-to-integrate-different-language-libraries-into-one-project\n.. _tests: tests/\n\nDevelopment setup\n-----------------\n\nTo setup environment for development you will need `Anaconda`_ installed.\n\n.. _Anaconda: https://anaconda.org/\n\n.. code:: console\n\n conda env create --file environment.yml\n conda activate pystempel-env\n\nTo run tests:\n\n.. code:: console\n\n curl https://repo1.maven.org/maven2/org/apache/lucene/lucene-analyzers-stempel/8.1.1/lucene-analyzers-stempel-8.1.1.jar > stempel-8.1.1.jar\n python -m pytest ./\n\nTo run benchmark:\n\n.. code:: console\n\n set PYTHONPATH=%PYTHONPATH%;%cd%\n python tests\\test_benchmark.py\n\nLicensing\n---------\n\n* **Code**: Most of the code is covered by `Egothor`_ Open Source License, an Apache-style license.\n The rest of the code is covered by the `Apache License 2.0`_. This should be clear from a preamble\n of each file.\n\n* **Data**:\n\n * The original pretrained stemming table is covered by `Apache License 2.0`_.\n\n * The new pretrained stemming table is covered by `2-Clause BSD License`_, similarly to the\n `Polimorf dictionary` it has been derived from. The copyright owner of both the stemming table\n and the dictionary is `Institute of Computer Science at Polish Academy of Science`_ (IPI PAN).\n\n * Polish dictionary used by the unit tests comes from `sjp.pl`_ and is covered by\n `Apache License 2.0`_ as well.\n\n.. _Egothor: https://www.egothor.org/product/egothor2/\n.. _Apache License 2.0: https://www.apache.org/licenses/LICENSE-2.0\n.. _Polimorf dictionary: dicts/\n.. _2-Clause BSD License: data/polimorf/LICENSE.txt\n.. _Institute of Computer Science at Polish Academy of Science: https://ipipan.waw.pl/en/\n\n\n\nAlternatives\n------------\n\n* `Estem`_ is Erlang wrapper (not port) for Stempel stemmer.\n* `pl_stemmer`_ is a Python stemmer based on Porter's Algorithm.\n* `polish-stem`_ is a Python stemmer using Finite State Transducers.\n\n\n.. _Estem: https://github.com/arcusfelis/estem\n.. _pl_stemmer: https://github.com/Tutanchamon/pl_stemmer\n.. _polish-stem: https://github.com/eugeniashurko/polish-stem\n\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/dzieciou/pystempel", "keywords": "NLP,natural language processing,computational linguistics,stemming,linguistics,language,natural language,text analytics", "license": "See documentation", "maintainer": "", "maintainer_email": "", "name": "pystempel", "package_url": "https://pypi.org/project/pystempel/", "platform": "", "project_url": "https://pypi.org/project/pystempel/", "project_urls": { "Homepage": "https://github.com/dzieciou/pystempel", "Source": "https://github.com/dzieciou/pystempel" }, "release_url": "https://pypi.org/project/pystempel/1.1.0/", "requires_dist": [ "sortedcontainers", "tqdm" ], "requires_python": ">=3.5", "summary": "Polish stemmer.", "version": "1.1.0" }, "last_serial": 5976697, "releases": { "1.0.1": [ { "comment_text": "", "digests": { "md5": "d6651cd0a228cc017d225420be85b71e", "sha256": "787199b74ef4e9353aa538c66f5237286d27eca2c111ae7e36b04c90f463b428" }, "downloads": -1, "filename": "pystempel-1.0.1.tar.gz", "has_sig": false, "md5_digest": "d6651cd0a228cc017d225420be85b71e", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.7", "size": 428933, "upload_time": "2019-07-23T16:31:10", "url": "https://files.pythonhosted.org/packages/be/cd/fe9ab8760ff402b923841c5843f742052666508a5b84c4fc6a2e0b67dfdd/pystempel-1.0.1.tar.gz" } ], "1.0.post1": [ { "comment_text": "", "digests": { "md5": "833c7f26959a125d4a7a3489455e57fe", "sha256": "17f572d3b75459f7b5e2d007d1816e09e2493c137b8b94a9a196662c9936f7de" }, "downloads": -1, "filename": "pystempel-1.0.post1.tar.gz", "has_sig": false, "md5_digest": "833c7f26959a125d4a7a3489455e57fe", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.7", "size": 428953, "upload_time": "2019-07-23T16:27:48", "url": "https://files.pythonhosted.org/packages/7a/e6/8595b6b2706e06e01be7540eb88ff59852a55571ac9290ff1a4ac807a552/pystempel-1.0.post1.tar.gz" } ], "1.1.0": [ { "comment_text": "", "digests": { "md5": "a8882f56156266ec3b3dc8b6cdb5bff5", "sha256": "d5817b890b7221913bee5a3a73772d46bf994c5524b2d35be8a55428dfeabf89" }, "downloads": -1, "filename": "pystempel-1.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "a8882f56156266ec3b3dc8b6cdb5bff5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 2715960, "upload_time": "2019-10-15T12:34:17", "url": "https://files.pythonhosted.org/packages/0a/c5/292b9423cfd103e4a224ff76f81633cc19a476160eb077b27b4bb98cfd75/pystempel-1.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e23c71686a6cf2c1ed4257fe509a9576", "sha256": "98b30e9f702647c6788361b266b2df46c72a2c9ab899a8412fb028fbcc2046fa" }, "downloads": -1, "filename": "pystempel-1.1.0.tar.gz", "has_sig": false, "md5_digest": "e23c71686a6cf2c1ed4257fe509a9576", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 14647, "upload_time": "2019-10-15T12:34:20", "url": "https://files.pythonhosted.org/packages/cb/1a/ef339caef849b3c543211274d89fb218aee42c30b3c2e39eed0ddb330c44/pystempel-1.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "a8882f56156266ec3b3dc8b6cdb5bff5", "sha256": "d5817b890b7221913bee5a3a73772d46bf994c5524b2d35be8a55428dfeabf89" }, "downloads": -1, "filename": "pystempel-1.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "a8882f56156266ec3b3dc8b6cdb5bff5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 2715960, "upload_time": "2019-10-15T12:34:17", "url": "https://files.pythonhosted.org/packages/0a/c5/292b9423cfd103e4a224ff76f81633cc19a476160eb077b27b4bb98cfd75/pystempel-1.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e23c71686a6cf2c1ed4257fe509a9576", "sha256": "98b30e9f702647c6788361b266b2df46c72a2c9ab899a8412fb028fbcc2046fa" }, "downloads": -1, "filename": "pystempel-1.1.0.tar.gz", "has_sig": false, "md5_digest": "e23c71686a6cf2c1ed4257fe509a9576", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 14647, "upload_time": "2019-10-15T12:34:20", "url": "https://files.pythonhosted.org/packages/cb/1a/ef339caef849b3c543211274d89fb218aee42c30b3c2e39eed0ddb330c44/pystempel-1.1.0.tar.gz" } ] }