{ "info": { "author": "Pavel Naydenov", "author_email": "naydenov.p.v@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Natural Language :: Russian", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# Description \nPython library `ruword_frequency` returns frequency (ipm - items per million) of russian words, case insensitive.\nIt based on huge collection of russian documents and prepared word frequency sources. Full list:\n- [Wikipedia dump, russian segment](https://dumps.wikimedia.org/ruwiki/latest)\n- [Flibusta dump](https://rutracker.org/forum/viewtopic.php?t=5385741), more then 200 Gb of texts\n- [Pyhlyi's library](https://rutracker.org/forum/viewtopic.php?t=1874223)\n- [\u041d\u043e\u0432\u044b\u0439 \u0447\u0430\u0441\u0442\u043e\u0442\u043d\u044b\u0439 \u0441\u043b\u043e\u0432\u0430\u0440\u044c \u0440\u0443\u0441\u0441\u043a\u043e\u0439 \u043b\u0435\u043a\u0441\u0438\u043a\u0438](http://dict.ruslang.ru/freq.php)\n- [\u0421\u043b\u043e\u0432\u0430\u0440\u044c \u0440\u0443\u0441\u0441\u043a\u043e\u0439 \u043b\u0438\u0442\u0435\u0440\u0430\u0442\u0443\u0440\u044b](http://www.artint.ru/projects/frqlist.php) from http://speakrus.ru/dict/index.htm\n- [\u0427\u0430\u0441\u0442\u043e\u0442\u043d\u044b\u0439 \u0441\u043b\u043e\u0432\u0430\u0440\u044c \u041c\u0430\u0440\u043a\u0430 \u0444\u043e\u043d \u0425\u0430\u0433\u0435\u043d\u0430](http://speakrus.ru/dict/index.htm) see [description](http://speakrus.ru/dict/hagen_freq_desc.txt)\n\nWord's ipm from all enumerated sources was extracted and mean values used. \nFull index contains more them 7 billions word forms including mistakes from raw data sources (unfortunately).\n\n# Requirements:\n- Python 3\n- Word index occupies near 50 Mb on hard disk and will be downloaded first time you invoke `frequency.load()` method\n\n# Installation\n```\n# TODO\n```\n\n# Usage\n```\nfrom ruword_frequency import Frequency\nfreq = Frequency()\nfreq.load()\n\nfreq.ipm('\u043f\u0440\u0438\u0432\u0435\u0442')\n>>> 53.51823806762695\n\nfreq.ipm('\u043d\u0435\u0442\u0442\u0430\u043a\u043e\u0433\u043e\u0441\u043b\u043e\u0432\u0430')\n>>> 0.0\n\n# get max ipm value. For weights normalization, for example\nfreq.max_ipm()\n>>> 42329.2890625\n\n# get list of most used words with ipm more then 10000\nfor w in freq.iterate_words(10000):\n print(w)\n```\n\nFor other useful methods see [marisa-trie](https://marisa-trie.readthedocs.io/en/latest/tutorial.html) documentations.\nTree index available as `freq.tree`\n\n# Rebuild tree by yourself\n```\nfrom ruword_frequency.source_reader import SourceReader\nreader = SourceReader()\n\n# increase socket timeout, sometimes helpful for huge file downloading:\nimport socket\nsocket.setdefaulttimeout(60)\n\nreader.download_all_sources()\ntree = reader.build_tree_from_dictionaries()\nreader.save_tree(tree)\n\n# use it \nfreq = Frequency()\nfreq.ipm('\u043f\u0440\u0438\u0432\u0435\u0442')\n```\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Somewater/ruword_frequency", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "ruword-frequency", "package_url": "https://pypi.org/project/ruword-frequency/", "platform": "", "project_url": "https://pypi.org/project/ruword-frequency/", "project_urls": { "Homepage": "https://github.com/Somewater/ruword_frequency" }, "release_url": "https://pypi.org/project/ruword-frequency/0.0.1/", "requires_dist": [ "marisa-trie" ], "requires_python": "", "summary": "Library returns word frequence (ipm) by almost all russian words", "version": "0.0.1" }, "last_serial": 4275198, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "01dad46a07065763c36bfbcd42bb73a5", "sha256": "f62af42acdce21d0d86fb98f69ab7380c792352fd962a9b0c66874919e2a3826" }, "downloads": -1, "filename": "ruword_frequency-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "01dad46a07065763c36bfbcd42bb73a5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 4793, "upload_time": "2018-09-15T16:08:55", "url": "https://files.pythonhosted.org/packages/cc/e2/a429f2b183c1f6ecfba96062f01a440f3bc00d960fc8ef8cf63d72f2b669/ruword_frequency-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "82fb859da816b5a7c4f4f48a92d0d6c3", "sha256": "b5e8861142d5a51a10bb631e894936fcd88c097838c7401bc4629649ff034f43" }, "downloads": -1, "filename": "ruword_frequency-0.0.1.tar.gz", "has_sig": false, "md5_digest": "82fb859da816b5a7c4f4f48a92d0d6c3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4007, "upload_time": "2018-09-15T16:08:56", "url": "https://files.pythonhosted.org/packages/d5/96/6d52257dcef1522dea45efa9a05ff9e5e235f2141da9f3f26f5cf6023029/ruword_frequency-0.0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "01dad46a07065763c36bfbcd42bb73a5", "sha256": "f62af42acdce21d0d86fb98f69ab7380c792352fd962a9b0c66874919e2a3826" }, "downloads": -1, "filename": "ruword_frequency-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "01dad46a07065763c36bfbcd42bb73a5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 4793, "upload_time": "2018-09-15T16:08:55", "url": "https://files.pythonhosted.org/packages/cc/e2/a429f2b183c1f6ecfba96062f01a440f3bc00d960fc8ef8cf63d72f2b669/ruword_frequency-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "82fb859da816b5a7c4f4f48a92d0d6c3", "sha256": "b5e8861142d5a51a10bb631e894936fcd88c097838c7401bc4629649ff034f43" }, "downloads": -1, "filename": "ruword_frequency-0.0.1.tar.gz", "has_sig": false, "md5_digest": "82fb859da816b5a7c4f4f48a92d0d6c3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4007, "upload_time": "2018-09-15T16:08:56", "url": "https://files.pythonhosted.org/packages/d5/96/6d52257dcef1522dea45efa9a05ff9e5e235f2141da9f3f26f5cf6023029/ruword_frequency-0.0.1.tar.gz" } ] }