{ "info": { "author": "Vitaliy Grabovets", "author_email": "v.grabovets@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "Multilingual Rapid Automatic Keyword Extraction (RAKE)\n======================================================\n\nFeatures\n--------\n- Automatic keyword extraction from text written in any language\n- No need to know language of text beforehand\n- No need to have list of stopwords\n- 26 languages are currently available, for the rest - stopwords are generated from provided text\n- Just configure rake, plug in text and get keywords (see implementation details)\n\nInstallation\n------------\n.. code-block:: bash\n\n pip install multi-rake\n\nExamples\n--------\nEnglish text, we don't specify explicitly language nor list of stopwords (built-in list is used).\n\n.. code-block:: python\n\n text_en = (\n 'Compatibility of systems of linear constraints over the set of '\n 'natural numbers. Criteria of compatibility of a system of linear '\n 'Diophantine equations, strict inequations, and nonstrict inequations '\n 'are considered. Upper bounds for components of a minimal set of '\n 'solutions and algorithms of construction of minimal generating sets '\n 'of solutions for all types of systems are given. These criteria and '\n 'the corresponding algorithms for constructing a minimal supporting '\n 'set of solutions can be used in solving all the considered types of '\n 'systems and systems of mixed types.'\n )\n\n rake = Rake()\n\n keywords = rake.apply(text_en)\n\n print(keywords[:10])\n\n # ('minimal generating sets', 8.666666666666666),\n # ('linear diophantine equations', 8.5),\n # ('minimal supporting set', 7.666666666666666),\n # ('minimal set', 4.666666666666666),\n # ('linear constraints', 4.5),\n # ('natural numbers', 4.0),\n # ('strict inequations', 4.0),\n # ('nonstrict inequations', 4.0),\n # ('upper bounds', 4.0),\n # ('mixed types', 3.666666666666667),\n\n\nText written in Esperanto (article about `liberalism `_).\nThere is no list of stopwords for this language, they will be generated from provided text.\n\n:code:`text` consists of three first paragraphs of introduction. :code:`text_for_stopwords` - all other text.\n\n.. code-block:: python\n\n text = (\n 'Liberalismo estas politika filozofio a\u016d mondrigardo konstruita en '\n 'ideoj de libereco kaj egaleco. Liberaluloj apogas lar\u011dan aron de '\n 'vidpunktoj depende de sia kompreno de tiuj principoj, sed \u011denerale '\n 'ili apogas ideojn kiel ekzemple liberaj kaj justaj elektoj, '\n 'civitanrajtoj, gazetara libereco, religia libereco, libera komerco, '\n 'kaj privata posedrajto. Liberalismo unue i\u011dis klara politika movado '\n 'dum la Klerismo, kiam \u011di i\u011dis populara inter filozofoj kaj '\n 'ekonomikistoj en la okcidenta mondo. Liberalismo malaprobis heredajn '\n 'privilegiojn, \u015dtatan religion, absolutan monarkion kaj la Didevena '\n 'Rajto de Re\u011doj. La filozofo John Locke de la 17-a jarcento ofte '\n 'estas meritigita pro fondado de liberalismo kiel klara filozofia '\n 'tradicio. Locke argumentis ke \u0109iu homo havas naturon rekte al vivo, '\n 'libereco kaj posedrajto kaj la\u016d la socia '\n 'kontrakto, registaroj ne rajtas malobservi tiujn rajtojn. '\n 'Liberaluloj kontra\u016dbatalis tradician konservativismon kaj ser\u0109is '\n 'anstata\u016digi absolutismon en registaroj per reprezenta demokratio kaj '\n 'la jura hegemonio.'\n )\n\n rake = Rake(max_words_unknown_lang=3)\n\n keywords = rake.apply(text, text_for_stopwords=other_text)\n\n print(keywords)\n\n # ('ser\u0109is anstata\u016digi absolutismon', 9.0) # sought to replace absolutism\n # ('filozofo john locke', 8.5), # philosopher John Locke\n # ('locke argumentis', 4.5) # Locke argues\n # ('justaj elektoj', 4.0), # fair elections\n # ('libera komerco', 4.0), # free trade\n # ('okcidenta mondo', 4.0), # western world\n # ('\u015dtatan religion', 4.0), # state religion\n # ('absolutan monarkion', 4.0), # absolute monarchy\n # ('didevena rajto', 4.0), # Dominican Rights\n # ('socia kontrakto', 4.0), # social contract\n # ('jura hegemonio', 4.0), # legal hegemony\n # ('mondrigardo konstruita', 4.0) # worldview built\n # ('vidpunktoj depende', 4.0), # views based\n # ('sia kompreno', 4.0), # their understanding\n # ('tiuj principoj', 4.0), # these principles\n # ('gazetara libereco', 3.5), # freedom of press\n # ('religia libereco', 3.5), # religious freedom\n # ('privata posedrajto', 3.5), # private property\n # ('libereco', 1.5), # liberty\n # ('posedrajto', 1.5)] # property\n\nSo, we are able to get decent result without explicit set of stopwords.\n\nUsage\n-----\nInitialize rake object\n\n.. code-block:: python\n\n from multi_rake import Rake\n\n rake = Rake(\n min_chars=3,\n max_words=3,\n min_freq=1,\n language_code=None, # 'en'\n stopwords=None, # {'and', 'of'}\n lang_detect_threshold=50,\n max_words_unknown_lang=2,\n generated_stopwords_percentile=80,\n generated_stopwords_max_len=3,\n generated_stopwords_min_freq=2,\n )\n\n**min_chars** - word is selected to be part of keyword if its length is >= min_chars. *Default 3*\n\n**max_words** - maximum number of words in phrase considered to be a keyword. *Default 3*\n\n**min_freq** - minimum number of occurences of a phrase to be considered a keyword. *Default 1*\n\n**language_code** - provide language code as string to use built-in set of stopwords. See list of available languages. If language is not specified algorithm will try to determine language with `cld2 `_ and use corresponding set of built-in stopwords. *Default None*\n\n**stopwords** - provide own collection of stopwords (preferably as set, lowercased). Overrides :code:`language_code` if it was specified. *Default None*\n\nKeep :code:`language_code` and :code:`stopwords` as :code:`None` and stopwords will be generated from provided text.\n\n**lang_detect_threshold** - threshold for probability of detected language in `cld2 `_ (0-100). *Default 50*\n\n**max_words_unknown_lang** - the same as :code:`max_words` but will be used if language is unknown and stopwords are generated from provided text. Usually the best result is obtained when specifically crafted set of stopwords is used, in case of its absence and usage of generated stopwords resulting keywords may not be as pretty and it may be good idea, for example, to produce 2-word keywords for unknown languages and 3-word keywords for languages with predefined sets of stopwords. *Default 2*\n\n**generated_stopwords_percentile** - to generate stopwords we create distribution of every word in text by frequency. Words above this percentile (0 - 100) will be considered candidates to become stopwords. *Default 80*\n\n**generated_stopwords_max_len** - maximum character length of generated stopwords. *Default 3*\n\n**generated_stopwords_min_freq** - minimum frequency of generated stopwords in the distribution. *Default 2*\n\n|\n\nApply rake object to text.\n\n.. code-block:: python\n\n keywords = rake.apply(\n text,\n text_for_stopwords=None,\n )\n\n**text** - string containing text from which keywords should be generated.\n\n**text_for_stopwords** - string containing text which will be used for stopwords generation alongside :code:`text`. For example, you have article with introduction and several subsections. You know that for your purposes keywords from introduction will suffice, you don't know language of text nor you have list of stopwords. So stopwords can be generated from text itself and the more text you have, the better. Than you may specify :code:`text=introduction, text_for_stopwords=rest_of_your_text`.\n\nImplementation Details\n----------------------\nRAKE algorithm works as described in Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons\n\nThis implementation is different from others by its multilingual support.\nBasically you may provide text without knowing its language (it should be written with cyrillic or latin alphabets),\nwithout explicit list of stopwords and get decent result.\nThough the best result is achieved with thoroughly constructed list of stopwords.\n\nWhat is happening under the hood:\n\n1) if stopwords are specified, then they will be used\n2) if language is specified, then built-in stopwords for this language will be used, if there are no built-in stopwords --> 4\n3) if language is not specified, then `cld2 `_ will try to determine language --> 2\n4) stopwords are generated from :code:`text` and :code:`text_for_stopwords`\n\nWe generate stopwords by creating frequency distribution of words in text and filtering them with parameters :code:`generated_stopwords_percentile`, :code:`generated_stopwords_max_len`, :code:`generated_stopwords_min_freq`. We won't be able to generate them perfectly but it is rather easy to find articles and prepositions, because usually they consist of 3-4 characters and appear frequently. These stopwords, coupled with punctuation delimiters, enable us to get decent results for languages we don't understand.\n\nList of Currently Available Languages\n-------------------------------------\nDuring RAKE initialization only language code should be used.\n\n- bg - Bulgarian\n- cs - Czech\n- da - Danish\n- de - German\n- el - Greek\n- en - English\n- es - Spanish\n- fi - Finnish\n- fr - French\n- ga - Irish\n- hr - Croatian\n- hu - Hungarian\n- id - Indonesian\n- it - Italian\n- lt - Lithuanian\n- lv - latvian\n- nl - Dutch\n- no - Norwegian\n- pl - Polish\n- pt - Portuguese\n- ro - Romanian\n- ru - Russian\n- sk - Slovak\n- sv - Swedish\n- tr - Turkish\n- uk - Ukrainian\n\nDevelopment\n----------------------------\nRepository has configured linter, tests and coverage.\n\nCreate new virtual environment in order to use it.\n\n.. code-block:: bash\n\n virtualenv env\n source env/bin/activate\n\n make install-dev # install dependencies\n\n make lint # run linter\n\n make test # run tests and coverage\n\nReferences\n----------\nRAKE algorithm: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons\n\nAs a basis RAKE implementation by `fabianvf `_ was used.\n\nStopwords: `trec-kba `_, `Ranks NL `_\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/vgrabovets/multi_rake", "keywords": "nlp,keywords,rake", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "multi-rake", "package_url": "https://pypi.org/project/multi-rake/", "platform": "", "project_url": "https://pypi.org/project/multi-rake/", "project_urls": { "Homepage": "https://github.com/vgrabovets/multi_rake" }, "release_url": "https://pypi.org/project/multi-rake/0.0.1/", "requires_dist": [ "cld2-cffi (>=0.1.4)", "numpy (>=1.14.4)", "pyrsistent (>=0.14.2)", "regex (>=2018.6.6)" ], "requires_python": ">=3.5.0", "summary": "Multilingual Rapid Keyword Extraction (RAKE)", "version": "0.0.1" }, "last_serial": 3981648, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "d6dd46e628de6ac611d7723dbca757b2", "sha256": "4d6d551a4e73db8b89af0f45a9613789a48275f905003838bbe1d2e528ea8d01" }, "downloads": -1, "filename": "multi_rake-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "d6dd46e628de6ac611d7723dbca757b2", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5.0", "size": 29955, "upload_time": "2018-06-20T14:21:49", "url": "https://files.pythonhosted.org/packages/5f/26/8d1fd22c5e1bf65936bae9f8201df2fa8cefe6fa0b28f471384e8101b298/multi_rake-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cf31736ea27e75386b6c714d62702d57", "sha256": "885ab2fd795cb96015aa552f22b35b8ae9065ae2d2d5c1af42a1bb32e081bec4" }, "downloads": -1, "filename": "multi_rake-0.0.1.tar.gz", "has_sig": false, "md5_digest": "cf31736ea27e75386b6c714d62702d57", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5.0", "size": 34135, "upload_time": "2018-06-20T14:21:50", "url": "https://files.pythonhosted.org/packages/bb/85/a5bcad1857e6826d64d5ff9741ab100a8aea710e5d9fe5d91424544c1b93/multi_rake-0.0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d6dd46e628de6ac611d7723dbca757b2", "sha256": "4d6d551a4e73db8b89af0f45a9613789a48275f905003838bbe1d2e528ea8d01" }, "downloads": -1, "filename": "multi_rake-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "d6dd46e628de6ac611d7723dbca757b2", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5.0", "size": 29955, "upload_time": "2018-06-20T14:21:49", "url": "https://files.pythonhosted.org/packages/5f/26/8d1fd22c5e1bf65936bae9f8201df2fa8cefe6fa0b28f471384e8101b298/multi_rake-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cf31736ea27e75386b6c714d62702d57", "sha256": "885ab2fd795cb96015aa552f22b35b8ae9065ae2d2d5c1af42a1bb32e081bec4" }, "downloads": -1, "filename": "multi_rake-0.0.1.tar.gz", "has_sig": false, "md5_digest": "cf31736ea27e75386b6c714d62702d57", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5.0", "size": 34135, "upload_time": "2018-06-20T14:21:50", "url": "https://files.pythonhosted.org/packages/bb/85/a5bcad1857e6826d64d5ff9741ab100a8aea710e5d9fe5d91424544c1b93/multi_rake-0.0.1.tar.gz" } ] }