{ "info": { "author": "Mikhail Korobov", "author_email": "kmike84@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Natural Language :: Russian", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: Implementation :: CPython", "Programming Language :: Python :: Implementation :: PyPy", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Text Processing :: Linguistic" ], "description": "=================\nopencorpora-tools\n=================\n\nThis package provides Python interface to http://opencorpora.org/\n\nInstall\n=======\n\n::\n\n pip install opencorpora-tools\n\nUsage\n=====\n\nObtaining corpora\n-----------------\n\nOpencorpora-tools works with XML from http://opencorpora.org/.\n\nYou can download and unpack the XML manually (from 'Downloads' page) or\njust use the provided command-line util::\n\n $ opencorpora download\n\nRun ``opencorpora download --help`` for more options.\n\nUsing corpora\n-------------\n\nopencorpora-tools package provides two entirely different APIs for working\nwith OpenCorpora XML files:\n\n* ``opencorpora.load`` - loads XML in memory using lxml library and uses lxml\n custom Element classes to provide a nice API. Note that the full\n OpenCorpora XML corpus can take up to 10GB RAM.\n* ``opencorpora.CorpusReader`` - it is slower and likely less convenient,\n but it allows to avoid loading the whole XML in memory. It also doesn't\n depend on lxml.\n\nopencorpora.load API\n--------------------\n\nFirst, load corpus to memory:\n\n >>> import opencorpora\n >>> corpus = opencorpora.load('annot.opcorpora.xml')\n >>> corpus\n \n >>> corpus.revision\n '4213997'\n >>> corpus.version\n '0.12'\n\nAccess documents:\n\n >>> len(corpus.docs)\n 3489\n >>> corpus.docs[42]\n \n >>> corpus[42] # it is the same as corpus.docs[42]\n \n\nSentences, paragraphs and tokens can be accessed directly:\n\n >>> corpus.sentences[0]\n \n >>> corpus.paragraphs[0]\n \n >>> len(corpus.tokens)\n 1740169\n\nWork with Doc objects:\n\n >>> doc = corpus[42]\n >>> doc\n \n >>> doc.id\n '44'\n >>> doc.name\n '18801 \u0425\u0438\u0442\u0440\u043e\u0441\u0442\u044c \u0434\u0443\u0445\u0430'\n >>> doc.source\n '\u0425\u0438\u0442\u0440\u043e\u0441\u0442\u044c \u0434\u0443\u0445\u0430 \u041f\u043e\u0447\u0435\u043c\u0443 \u043a\u043d\u044f\u0437\u044c \u0412\u043b\u0430\u0434\u0438\u043c\u0438\u0440 \u043a\u0440\u0435\u0441\u0442\u0438\u043b \u0420\u0443\u0441\u044c\\n\\n28 \u0438\u044e\u043b\u044f \u043f\u0440\u0430\u0432\u043e\u0441\u043b\u0430\u0432\u043d\u0430\u044f ...'\n >>> doc.tags\n ['\u0410\u0432\u0442\u043e\u0440:\u041e\u043b\u0435\u0433 \u0414\u0430\u0432\u044b\u0434\u043e\u0432', '\u0413\u043e\u0434:2010', '\u0414\u0430\u0442\u0430:28/07', 'url:http://www.chaskor.ru/article/28_iyulyahitrost_duha_18801', '\u0422\u0435\u043c\u0430:\u0427\u0430\u0441\u041a\u043e\u0440:\u041e\u0431\u0449\u0435\u0441\u0442\u0432\u043e']\n >>> doc.paragraphs\n [, ...]\n >>> doc.sentences\n [,\n ,\n ...\n ]\n >>> doc.tokens\n [, , ...]\n >>> doc[0] # the same as doc.tokens[0]\n \n\nParagraph objects:\n\n >>> para = doc.paragraphs[0]\n >>> para.id\n '1176'\n >>> para.source\n '\u0425\u0438\u0442\u0440\u043e\u0441\u0442\u044c \u0434\u0443\u0445\u0430 \u041f\u043e\u0447\u0435\u043c\u0443 \u043a\u043d\u044f\u0437\u044c \u0412\u043b\u0430\u0434\u0438\u043c\u0438\u0440 \u043a\u0440\u0435\u0441\u0442\u0438\u043b \u0420\u0443\u0441\u044c'\n >>> para.sentences\n [, ]\n >>> para.tokens\n [, , ...]\n >>> para[0] # the same as para.tokens[0]\n \n\nSentence objects:\n\n >>> sent = doc.sentences[6]\n >>> sent\n \n >>> sent.id\n '3439'\n >>> sent.source\n '\u0423 \u043a\u043d\u044f\u0437\u044f \u0421\u0432\u044f\u0442\u043e\u0441\u043b\u0430\u0432\u0430 \u0418\u0433\u043e\u0440\u0435\u0432\u0438\u0447\u0430 \u0431\u044b\u043b\u043e \u0442\u0440\u0438 \u0441\u044b\u043d\u0430: \u042f\u0440\u043e\u043f\u043e\u043b\u043a, \u041e\u043b\u0435\u0433 \u0438 \u0412\u043b\u0430\u0434\u0438\u043c\u0438\u0440.'\n >>> sent.tokens\n [, , , ...]\n >>> sent[1] # the same as sent.tokens[1]\n \n\nToken objects:\n\n >>> token = sent[1]\n >>> token\n \n >>> token.id\n '64913'\n >>> token.source\n '\u043a\u043d\u044f\u0437\u044f'\n >>> token.parses\n [,\n ]\n >>> token.lemma # lemma of a first parse, the same as token.parses[0].lemma\n '\u043a\u043d\u044f\u0437\u044c'\n >>> token.grammemes # the same as token.parses[0].grammemes\n ['NOUN', 'anim', 'masc', 'sing', 'gent']\n >>> token.parse # the same as token.parses[0]\n \n\nCorpus, Doc, Paragraph, Sentence, Token and Parse are custom etree Element\nsubclasses. You're not limited to the API described above - e.g. it is possible\nto process corpus using XPath expressions, and lxml will return\nthese custom Element classes in results if a tree is loaded using\n``opencorpora.load``.\n\nopencorpora.CorpusReader API\n----------------------------\n\nInitialize::\n\n >>> import opencorpora\n >>> corpus = opencorpora.CorpusReader('annot.opcorpora.xml')\n\nGet table of contents::\n\n >>> corpus.catalog()\n [('1', '\"\u0427\u0430\u0441\u0442\u043d\u044b\u0439 \u043a\u043e\u0440\u0440\u0435\u0441\u043f\u043e\u043d\u0434\u0435\u043d\u0442\"'),\n ('2', '00021 \u0428\u043a\u043e\u043b\u0430 \u0437\u043b\u043e\u0441\u043b\u043e\u0432\u0438\u044f'),\n ('3', '00022 \u041f\u043e\u0441\u043b\u0435\u0434\u043d\u0435\u0435 \u0432\u043e\u0441\u0441\u0442\u0430\u043d\u0438\u0435 \u0432 \u0421\u0435\u0443\u043b\u0435'),\n ('4', '00023 \u0417\u0430 \u043a\u043e\u0442\u0430 - \u043e\u0442\u0432\u0435\u0442\u0438\u0448\u044c!'),\n ...\n\nWork with documents::\n\n >>> seoul_words = corpus.words('3')\n >>> seoul_words\n ['\u00ab', '\u041f\u043e\u0441\u043b\u0435\u0434\u043d\u0435\u0435', '\u0432\u043e\u0441\u0441\u0442\u0430\u043d\u0438\u0435', '\u00bb', '\u0432', '\u0421\u0435\u0443\u043b\u0435', ...\n\n >>> corpus.documents(categories='\u0422\u0435\u043c\u0430:\u0427\u0430\u0441\u041a\u043e\u0440:\u041a\u043d\u0438\u0433\u0438*')\n [Document: 21759 2001-2010-\u0439: \u043a\u043d\u0438\u0433\u0438, \u043a\u043e\u0442\u043e\u0440\u044b\u0435 \u043f\u043e\u0442\u0440\u044f\u0441\u0430\u043b\u0438,\n Document: 12824 86 \u0441\u043d\u043e\u0432, \u0432\u044b\u0437\u0432\u0430\u043d\u043d\u044b\u0445 \u043f\u043e\u043b\u0451\u0442\u043e\u043c \u043f\u0447\u0435\u043b\u044b \u0432\u043e\u043a\u0440\u0443\u0433 \u0433\u0440\u0430\u043d\u0430\u0442\u0430 \u0437\u0430 \u0441\u0435\u043a\u0443\u043d\u0434\u0443 \u0434\u043e \u043f\u0440\u043e\u0431\u0443\u0436\u0434\u0435\u043d\u0438\u044f,\n Document: 10930 \u0410 \u0431\u043e\u0439\u0442\u0435\u0441\u044c \u0435\u0434\u0438\u043d\u0441\u0442\u0432\u0435\u043d\u043d\u043e \u0442\u043e\u043b\u044c\u043a\u043e \u0442\u043e\u0433\u043e, \u043a\u0442\u043e \u0441\u043a\u0430\u0436\u0435\u0442: \u00ab\u042f \u0437\u043d\u0430\u044e, \u043a\u0430\u043a \u043d\u0430\u0434\u043e!\u00bb,\n ...\n\n``opencorpora.Corpora`` is modelled after NLTK's CorpusReader interface;\nconsult with http://nltk.googlecode.com/svn/trunk/doc/book/ch02.html to\nget an idea how to work with the API. It it not exactly the same,\nbut should be very similar.\n\nCurrently CorposReader doesn't provide a way to access original OpenCorpora\nids of paragraphs/sentences/tokens.\n\n\nPerformance\n===========\n\nOpenCorpora XML is huge (>250MB) so building full DOM tree requires\na lot of memory (several GB, it coud be 10GB+ for full corpus).\nIf RAM is not an issue ``opencorpora.load`` should be faster and\nmore convenient; otherwise ``opencorpora.CorpusReader`` should work better.\n\n``opencorpora.CorpusReader`` handles it this way:\n\n1. ``corpus.get_document(doc_id)`` or ``corpus.documents(doc_ids)``\n don't load the original XML to memory and don't parse the whole XML.\n They use precomputed offset information to slice the XML instead.\n The offset information is computed on first access and\n saved to \".~\" file.\n\n Consider document loading O(1) regarding full XML size.\n Individual documents are not huge so they and loaded and parsed as usual.\n\n2. There are iterator methods for all corpora API (``corpus.iter_words``, etc).\n\n\nDevelopment\n===========\n\nDevelopment happens at github: https://github.com/kmike/opencorpora-tools\nIssue tracker: https://github.com/kmike/opencorpora-tools/issues.\nFeel free to submit ideas, bugs or pull requests.\n\nRunning tests\n-------------\n\nMake sure `tox `_ is installed and run\n\n::\n\n $ tox\n\nfrom the source checkout. Tests should pass under Python 2.7 and 3.3+.\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/kmike/opencorpora-tools/", "keywords": "", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "opencorpora-tools", "package_url": "https://pypi.org/project/opencorpora-tools/", "platform": "", "project_url": "https://pypi.org/project/opencorpora-tools/", "project_urls": { "Homepage": "https://github.com/kmike/opencorpora-tools/" }, "release_url": "https://pypi.org/project/opencorpora-tools/0.5/", "requires_dist": null, "requires_python": "", "summary": "opencorpora.org python interface", "version": "0.5" }, "last_serial": 2485367, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "a957101171e3bcd715008f54d42da546", "sha256": "0d626885ffd16e61badca990ac33c5a27a4f225ea712d4cb6cdbd249c8813080" }, "downloads": -1, "filename": "opencorpora-tools-0.1.tar.gz", "has_sig": false, "md5_digest": "a957101171e3bcd715008f54d42da546", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7295, "upload_time": "2012-04-18T13:20:14", "url": "https://files.pythonhosted.org/packages/24/8c/7ffeea2c675554d3f5c1002f3d87a714823eb5c534f49f8b0d1b8483d322/opencorpora-tools-0.1.tar.gz" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "05f9ed031cb2237d055b7439617b5715", "sha256": "f4c8e6abc0195ece3b3eca030e8b485cb59f55c20f64d677679d79621ad25fb8" }, "downloads": -1, "filename": "opencorpora-tools-0.2.tar.gz", "has_sig": false, "md5_digest": "05f9ed031cb2237d055b7439617b5715", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7913, "upload_time": "2012-07-05T21:27:28", "url": "https://files.pythonhosted.org/packages/dd/9f/7cfbc36598f6e8deffce48b3d0140835152380e8c3d13f63dbe38be61398/opencorpora-tools-0.2.tar.gz" } ], "0.3": [ { "comment_text": "", "digests": { "md5": "4ad8b8fe8ba171d7bdba3ba7c2aec67d", "sha256": "3733fe6c2751f38cb5d9f1b5d67c99ffae6507d6ca1f89940fab1b5addcbe98a" }, "downloads": -1, "filename": "opencorpora-tools-0.3.tar.gz", "has_sig": false, "md5_digest": "4ad8b8fe8ba171d7bdba3ba7c2aec67d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7600, "upload_time": "2013-04-03T20:44:04", "url": "https://files.pythonhosted.org/packages/ed/62/3032d7744360ca0ae6f3395de77c159a641e2a6d4f86db32036a35ee6872/opencorpora-tools-0.3.tar.gz" } ], "0.4": [ { "comment_text": "", "digests": { "md5": "c6046ca2e66a059f78dc90fce2049403", "sha256": "1e213b390479388db5cd1d3a5a9a71214f0516f75d24bddcbea0e7d2df817695" }, "downloads": -1, "filename": "opencorpora-tools-0.4.tar.gz", "has_sig": false, "md5_digest": "c6046ca2e66a059f78dc90fce2049403", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7767, "upload_time": "2013-10-06T09:35:39", "url": "https://files.pythonhosted.org/packages/71/cc/a5e19c932c23029910db3d9c1c9efe02a6f0c832cdf8c6a688797d6312f5/opencorpora-tools-0.4.tar.gz" } ], "0.4.2": [ { "comment_text": "", "digests": { "md5": "2e0c9fc0c63fa952fc8566f45c878979", "sha256": "5ac6cc9edd226fdb36975e545e484ba52a26828a2018b299f33deb8564e40382" }, "downloads": -1, "filename": "opencorpora-tools-0.4.2.tar.gz", "has_sig": false, "md5_digest": "2e0c9fc0c63fa952fc8566f45c878979", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7825, "upload_time": "2013-10-13T20:59:36", "url": "https://files.pythonhosted.org/packages/60/2e/f57693ab2cb614974c2ec42c3a57de1cdb3bf59bbf6acb3bfeb38c9b041e/opencorpora-tools-0.4.2.tar.gz" } ], "0.4.3": [ { "comment_text": "", "digests": { "md5": "dbd1e5cf29af6c5523a4edd6560067f3", "sha256": "b88b7403fd7915aeebe51f3d8674f7df75ccfc790b8a7f34ef861d0b55b1f522" }, "downloads": -1, "filename": "opencorpora-tools-0.4.3.tar.gz", "has_sig": false, "md5_digest": "dbd1e5cf29af6c5523a4edd6560067f3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7911, "upload_time": "2013-11-04T16:47:38", "url": "https://files.pythonhosted.org/packages/96/b3/d6c06faad39ff84e9c5a0bf4cb5c880339d068288ed0a11e2e38b3361daa/opencorpora-tools-0.4.3.tar.gz" } ], "0.4.4": [ { "comment_text": "", "digests": { "md5": "85ef298f3040e41377c1b2a7ee080cf2", "sha256": "e84b683d30f666d331e6ef278df5084ab01d1df2ab540ca3900a33a95b357224" }, "downloads": -1, "filename": "opencorpora-tools-0.4.4.tar.gz", "has_sig": false, "md5_digest": "85ef298f3040e41377c1b2a7ee080cf2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7936, "upload_time": "2013-11-14T15:02:55", "url": "https://files.pythonhosted.org/packages/2a/62/592f3a3bc755da6162472db56efe609716bd1f7659fe0297b42bf06cb5ff/opencorpora-tools-0.4.4.tar.gz" } ], "0.5": [ { "comment_text": "", "digests": { "md5": "142b6eda1c573b430635b5b805c08f6a", "sha256": "868b5ee2972578ae5d5d659f28706bea6c3ae0204154cc09e9e0c76ac04d525c" }, "downloads": -1, "filename": "opencorpora_tools-0.5-py3-none-any.whl", "has_sig": false, "md5_digest": "142b6eda1c573b430635b5b805c08f6a", "packagetype": "bdist_wheel", "python_version": "3.4", "requires_python": null, "size": 16315, "upload_time": "2016-11-27T20:37:16", "url": "https://files.pythonhosted.org/packages/0f/70/eb266b415754f22423af378aa66e6f618710c2e4d0f7646183cb02909a5b/opencorpora_tools-0.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a804390ee1861949789d767536241005", "sha256": "aac8879a641fd165fc852d48c4f4cdc2881662d42fab3cd217d910ffdc8da880" }, "downloads": -1, "filename": "opencorpora-tools-0.5.tar.gz", "has_sig": false, "md5_digest": "a804390ee1861949789d767536241005", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11245, "upload_time": "2016-11-27T20:37:06", "url": "https://files.pythonhosted.org/packages/28/af/daf1b4abf722970aff1ca9108f90b5509355fb8da383cedad0bc40921668/opencorpora-tools-0.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "142b6eda1c573b430635b5b805c08f6a", "sha256": "868b5ee2972578ae5d5d659f28706bea6c3ae0204154cc09e9e0c76ac04d525c" }, "downloads": -1, "filename": "opencorpora_tools-0.5-py3-none-any.whl", "has_sig": false, "md5_digest": "142b6eda1c573b430635b5b805c08f6a", "packagetype": "bdist_wheel", "python_version": "3.4", "requires_python": null, "size": 16315, "upload_time": "2016-11-27T20:37:16", "url": "https://files.pythonhosted.org/packages/0f/70/eb266b415754f22423af378aa66e6f618710c2e4d0f7646183cb02909a5b/opencorpora_tools-0.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a804390ee1861949789d767536241005", "sha256": "aac8879a641fd165fc852d48c4f4cdc2881662d42fab3cd217d910ffdc8da880" }, "downloads": -1, "filename": "opencorpora-tools-0.5.tar.gz", "has_sig": false, "md5_digest": "a804390ee1861949789d767536241005", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11245, "upload_time": "2016-11-27T20:37:06", "url": "https://files.pythonhosted.org/packages/28/af/daf1b4abf722970aff1ca9108f90b5509355fb8da383cedad0bc40921668/opencorpora-tools-0.5.tar.gz" } ] }