{ "info": { "author": "Peter Bedn\u00e1r", "author_email": "peter.bednar@tuke.sk", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "Intended Audience :: Education", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3 :: Only", "Programming Language :: Python :: Implementation :: CPython", "Topic :: Text Processing :: Linguistic", "Topic :: Utilities" ], "description": "# conllutils\n\n**CoNLL-Utils** is a Python library for processing of [CoNLL-U](https://universaldependencies.org) treebanks. It\nprovides mutable Python types for the representation of tokens, sentences and dependency trees. Additionally, the \nsentences can be indexed into the compact numerical form with the data stored in the [NumPy](https://numpy.org)\narrays, which can be directly processed by the _machine learning_ algorithms.\n\nThe library also provides a flexible _pipeline_ API for the treebank pre-processing, which allows you to:\n* parse and write data from/to CoNLL-U files,\n* filter or transform sentences, tokens or token's fields using the arbitrary Python function,\n* filter only the syntactic words (i.e. without empty or multiword tokens),\n* filter only sentences, which can be represented as the (non)projective dependency trees,\n* extract only [Universal dependency relations](https://universaldependencies.org/u/dep/index.html) without \n the language-specific extensions for DEPREL and DEPS fields,\n* generate concatenated UPOS and FEATS field,\n* extract the text of the sentences reconstructed from the tokens,\n* replace the field's values matched by the regular expressions, or replace the missing values,\n* create unlimited data stream, randomly shuffle data and form batches of instances\n* ... and more.\n\n### Installation\n\nThe library supports Python 3.6 and later.\n\n#### pip\n\nThe CoNLL-Utils is available on [PyPi](https://pypi.python.org/pypi) and can be installed via `pip`. To install simply\nrun:\n```\npip install conllutils\n```\n\nTo upgrade the previous installation to the newest release, use:\n```\npip install conllutils -U\n```\n\n#### From source\n\nAlternatively, you can also install library from this git repository, which will give you more flexibility and allows\nyou to start contributing to the CoNLL-Utils code. For this option, run:\n```\ngit clone https://github.com/peterbednar/conllutils.git\ncd conllutils\npip install -e .\n```\n\n### Getting started with CoNLL-Utils\n\n#### Preparing pipeline for sentence pre-processing\n\nAt first, we will create a _pipeline_ for pre-processing of sentences, which will:\n1. Filter only syntactic words (i.e. skip the empty and multiword tokens).\n2. Extract only Universal dependency relations without the language-specific extensions.\n3. Generate a concatenated UPOS & FEATS field.\n4. Transform FORM values to lowercase.\n5. Replace numbers expressions in FORM field with the constant value.\n\n```python\nfrom conllutils import pipe\n\nNUM_REGEX = r'[0-9]+|[0-9]+\\.[0-9]+|[0-9]+[0-9,]+'\nNUM_FORM = '__number__'\n\np = pipe()\np.only_words()\np.only_universal_deprel()\np.upos_feats()\np.lowercase('form')\np.replace('form', NUM_REGEX, NUM_FORM)\n```\n\n#### Indexing sentences for machine learning\n\nNext, we will transform sentences into the training instances. This operation requires two iterations over the training\ndata. In the first pass, we will create an index mapping the string values to the integers for FORM, UPOS & FEATS and\nDEPREL fields. In the second pass, we will pre-process and index training data and collect them in the list.\n\n```python\ntrain_file = 'en_ewt-ud-train.conllu'\nindexed_fields = {'form', 'upos_feats', 'deprel'}\n\nindex = pipe().read_conllu(train_file).pipe(p).create_index(fields=indexed_fields)\ntrain_data = pipe().read_conllu(train_file).pipe(p).to_instance(index).collect()\n```\n\n#### Iterating over batches of training instances\n\nNow we can use the data for the training of machine learning models. Next pipeline will stream 10 000 of instances in a\ncycle from the list, randomly shuffle the order and form the batches of 100 training instances per batch.\n\n```python\ntotal_size = 10000\nbatch_size = 100\n\nfor batch in pipe(train_data).stream(total_size).shuffle().batch(batch_size):\n # Update your model for the next batch of instances.\n instance = batch[0]\n # Instance values are indexed and stored in the NumPy arrays.\n length = instance['form'].shape[0]\n pass\n```\n\nAlternatively, whole data doesn't have to be loaded into the memory, and you can stream instances directly from the\nfile.\n\n### Documentation\n\nThe reference documentation is available on [this site](https://peterbednar.github.io/conllutils/html/conllutils/index.html).\n\n### LICENSE\n\nCoNLL-Utils is released under the MIT License. See the [LICENSE](https://github.com/peterbednar/conllutils/blob/master/LICENSE)\nfile for more details.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/peterbednar/conllutils", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "conllutils", "package_url": "https://pypi.org/project/conllutils/", "platform": "", "project_url": "https://pypi.org/project/conllutils/", "project_urls": { "Homepage": "https://github.com/peterbednar/conllutils" }, "release_url": "https://pypi.org/project/conllutils/1.1.4/", "requires_dist": null, "requires_python": ">=3.6", "summary": "A library for processing of CoNLL-U treebanks.", "version": "1.1.4", "yanked": false, "yanked_reason": null }, "last_serial": 7285002, "releases": { "0.9.0": [ { "comment_text": "", "digests": { "md5": "24cac1b54ef843cb9c6d714c0773179d", "sha256": "d566c27e78398f57cc2ddebe81f113577a2cff70f6476c9afab8148bdd0572c2" }, "downloads": -1, "filename": "CoNLLUtils-0.9.0.tar.gz", "has_sig": false, "md5_digest": "24cac1b54ef843cb9c6d714c0773179d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4619, "upload_time": "2019-09-04T21:22:48", "upload_time_iso_8601": "2019-09-04T21:22:48.144917Z", "url": "https://files.pythonhosted.org/packages/ba/e6/754ea78a8c5d5b928128c2069781f84f92a5a1ab110df282260cbaf238da/CoNLLUtils-0.9.0.tar.gz", "yanked": false, "yanked_reason": null } ], "0.9.1": [ { "comment_text": "", "digests": { "md5": "2ea79a55963e36482132a8e02ad00dcc", "sha256": "3e315d994635da58ba5641aec2e4031779a4c37ad2c8e28169335473bdb9e044" }, "downloads": -1, "filename": "CoNLLUtils-0.9.1.tar.gz", "has_sig": false, "md5_digest": "2ea79a55963e36482132a8e02ad00dcc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4724, "upload_time": "2019-09-18T12:33:44", "upload_time_iso_8601": "2019-09-18T12:33:44.158781Z", "url": "https://files.pythonhosted.org/packages/43/88/7ead39e441acb6bc4ac8cc30d6f33d1864d2a7acde1db9d0c22856fe8ec3/CoNLLUtils-0.9.1.tar.gz", "yanked": false, "yanked_reason": null } ], "0.9.2": [ { "comment_text": "", "digests": { "md5": "abe810b129ca11433102c20f9ab14510", "sha256": "be0993f7b54ec2c54c744c58dc7cad17ed02d0d7d4c1a80b6621ab21dd9fde76" }, "downloads": -1, "filename": "CoNLLUtils-0.9.2.tar.gz", "has_sig": false, "md5_digest": "abe810b129ca11433102c20f9ab14510", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4786, "upload_time": "2019-10-23T11:51:50", "upload_time_iso_8601": "2019-10-23T11:51:50.284628Z", "url": "https://files.pythonhosted.org/packages/e6/31/128478ce10546e54b29a599ce5de6894fb4122fe72bb4d9fdfcc54cccd0a/CoNLLUtils-0.9.2.tar.gz", "yanked": false, "yanked_reason": null } ], "0.9.3": [ { "comment_text": "", "digests": { "md5": "4c4484fc9539a43087f5e3ba7ca49cc1", "sha256": "a66cd85ec604c922e30973338f91b92d359b5022fc3b2f82b8f143289a1ae35e" }, "downloads": -1, "filename": "CoNLLUtils-0.9.3.tar.gz", "has_sig": false, "md5_digest": "4c4484fc9539a43087f5e3ba7ca49cc1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7079, "upload_time": "2020-03-30T20:22:41", "upload_time_iso_8601": "2020-03-30T20:22:41.269318Z", "url": "https://files.pythonhosted.org/packages/1d/ba/1a82d2d760f2c95c5744423ddf77898ad1f993b8b4eae5dbb6192550c821/CoNLLUtils-0.9.3.tar.gz", "yanked": false, "yanked_reason": null } ], "0.9.4": [ { "comment_text": "", "digests": { "md5": "9abef02e1a6012ef4f9d6d64a3010ee0", "sha256": "becaeb650344090cf96564d06bba0829a2ba4bb572919201ef4b8b46ec604b5a" }, "downloads": -1, "filename": "CoNLLUtils-0.9.4-py3-none-any.whl", "has_sig": false, "md5_digest": "9abef02e1a6012ef4f9d6d64a3010ee0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 14894, "upload_time": "2020-04-13T19:26:36", "upload_time_iso_8601": "2020-04-13T19:26:36.074445Z", "url": "https://files.pythonhosted.org/packages/35/a2/1b5dd7fe004216d7cd4d882b58c752f5ef13b1796e4d17fd30d3b69242d8/CoNLLUtils-0.9.4-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "3c6bdaa5bc3892bc6ac3cb5a7b5f3d49", "sha256": "518f6b9dd294c481d1bf41145f5dc46ccd148a62f389fc725f324d22f57ffcb0" }, "downloads": -1, "filename": "CoNLLUtils-0.9.4.tar.gz", "has_sig": false, "md5_digest": "3c6bdaa5bc3892bc6ac3cb5a7b5f3d49", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14324, "upload_time": "2020-04-13T19:26:39", "upload_time_iso_8601": "2020-04-13T19:26:39.463345Z", "url": "https://files.pythonhosted.org/packages/bf/66/354032599616ad4af25435fbcc89c8d2c5b1d1b9da68b4c88003beba21be/CoNLLUtils-0.9.4.tar.gz", "yanked": false, "yanked_reason": null } ], "1.0.0": [ { "comment_text": "", "digests": { "md5": "6229fb8e28e5138338358e3ffbad69bc", "sha256": "bb013a1e004903e183a608ae67a22c2f96af85d3e32841aebdede838a20f4870" }, "downloads": -1, "filename": "conllutils-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "6229fb8e28e5138338358e3ffbad69bc", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 16805, "upload_time": "2020-04-15T09:52:58", "upload_time_iso_8601": "2020-04-15T09:52:58.257553Z", "url": "https://files.pythonhosted.org/packages/eb/c0/c8f2102a746aa39596b49af1947888214bb5cd7c9c0315197224bf2e4038/conllutils-1.0.0-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "bf4a8f5069ea545d719bc91f19776b84", "sha256": "4030b3b1ef4cda0377fedff1b758e16d227f961e280d023e9fdb5ab10b0f3ae7" }, "downloads": -1, "filename": "conllutils-1.0.0.tar.gz", "has_sig": false, "md5_digest": "bf4a8f5069ea545d719bc91f19776b84", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16590, "upload_time": "2020-04-15T09:53:03", "upload_time_iso_8601": "2020-04-15T09:53:03.246780Z", "url": "https://files.pythonhosted.org/packages/51/d8/5466e316692699f133c575a8f8885cfa68e4f01f03868598eb1ba14205a8/conllutils-1.0.0.tar.gz", "yanked": false, "yanked_reason": null } ], "1.1.0": [ { "comment_text": "", "digests": { "md5": "20e928a103c5ac042d3ecfcc9afe1c9a", "sha256": "33df9ccd9c95d7e86c72c88d6b0bde2417e84ab2f6a3217711086896682cfaa5" }, "downloads": -1, "filename": "conllutils-1.1.0.tar.gz", "has_sig": false, "md5_digest": "20e928a103c5ac042d3ecfcc9afe1c9a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15945, "upload_time": "2020-04-21T16:20:39", "upload_time_iso_8601": "2020-04-21T16:20:39.208118Z", "url": "https://files.pythonhosted.org/packages/49/bc/41fc14fa9fdad4d21fccbc7b662e2dd15c14b7a0edb9d4521cf330163644/conllutils-1.1.0.tar.gz", "yanked": false, "yanked_reason": null } ], "1.1.1": [ { "comment_text": "", "digests": { "md5": "dba9ede028b19fa356dbb1d299e9a88d", "sha256": "72e72aaaa237af485bcc7057e59139cd5d7be1f2b22fbecb3db27444bae487e1" }, "downloads": -1, "filename": "conllutils-1.1.1.tar.gz", "has_sig": false, "md5_digest": "dba9ede028b19fa356dbb1d299e9a88d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 16776, "upload_time": "2020-04-23T11:34:15", "upload_time_iso_8601": "2020-04-23T11:34:15.355592Z", "url": "https://files.pythonhosted.org/packages/8b/61/157c5bd278033e0d9baee87e43825626debc31db13c59bebc58fb80a5419/conllutils-1.1.1.tar.gz", "yanked": false, "yanked_reason": null } ], "1.1.2": [ { "comment_text": "", "digests": { "md5": "069183f3251065d6798eab553c59ec42", "sha256": "48c9a46d22043bf2fd4261750a898e81de597522b4be3de0f7112bd3ccb9f654" }, "downloads": -1, "filename": "conllutils-1.1.2.tar.gz", "has_sig": false, "md5_digest": "069183f3251065d6798eab553c59ec42", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 19554, "upload_time": "2020-04-25T18:51:29", "upload_time_iso_8601": "2020-04-25T18:51:29.898494Z", "url": "https://files.pythonhosted.org/packages/88/16/94f11c36aaf1c2799b37743c7eeb5bc6fa226d394c3fa841c4adc1353d44/conllutils-1.1.2.tar.gz", "yanked": false, "yanked_reason": null } ], "1.1.3": [ { "comment_text": "", "digests": { "md5": "9cd2733d11eb06c68795ef63f936b8b7", "sha256": "daa58b1d40537b79001b89f94f049abbcdfc84762ddd1c11c0962bdd8f561fac" }, "downloads": -1, "filename": "conllutils-1.1.3.tar.gz", "has_sig": false, "md5_digest": "9cd2733d11eb06c68795ef63f936b8b7", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 18260, "upload_time": "2020-04-29T12:51:33", "upload_time_iso_8601": "2020-04-29T12:51:33.909092Z", "url": "https://files.pythonhosted.org/packages/ca/f4/1eb2979d195cd4442fbd06e6ec596e69b9b90242432fe007bac74a347c4f/conllutils-1.1.3.tar.gz", "yanked": false, "yanked_reason": null } ], "1.1.4": [ { "comment_text": "", "digests": { "md5": "304237702c3e672bdca874bddbf072e1", "sha256": "c044b0602b522026a70ff05416976222e1778659fb44a2858dacca21369aebef" }, "downloads": -1, "filename": "conllutils-1.1.4.tar.gz", "has_sig": false, "md5_digest": "304237702c3e672bdca874bddbf072e1", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 18879, "upload_time": "2020-05-20T10:15:31", "upload_time_iso_8601": "2020-05-20T10:15:31.581052Z", "url": "https://files.pythonhosted.org/packages/20/33/508eabad4f84f65971f3eca651478e4c378db1f44070d76a20c0eacf347d/conllutils-1.1.4.tar.gz", "yanked": false, "yanked_reason": null } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "304237702c3e672bdca874bddbf072e1", "sha256": "c044b0602b522026a70ff05416976222e1778659fb44a2858dacca21369aebef" }, "downloads": -1, "filename": "conllutils-1.1.4.tar.gz", "has_sig": false, "md5_digest": "304237702c3e672bdca874bddbf072e1", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 18879, "upload_time": "2020-05-20T10:15:31", "upload_time_iso_8601": "2020-05-20T10:15:31.581052Z", "url": "https://files.pythonhosted.org/packages/20/33/508eabad4f84f65971f3eca651478e4c378db1f44070d76a20c0eacf347d/conllutils-1.1.4.tar.gz", "yanked": false, "yanked_reason": null } ], "vulnerabilities": [] }