{ "info": { "author": "Martin van Harmelen", "author_email": "Martin@vanharmelen.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved :: Apache Software License", "Programming Language :: Python :: 3.6" ], "description": "# mmax2conll\nScript to convert coreference data in MMAX format to [CoNLL][] format or raw text files.\n\nSee `CoNLL-specification.md` and `MMAX-specification.md` for extensive descriptions of the CoNLL and MMAX formats.\n\n\n## Usage\n\n### `mmax2conll.py`\nBecause the COREA corpus saves its sentence information in the `*_words.xml` files\nbut the [SoNaR-1][] corpus saves this separately in `*_sentence_level.xml` files,\nspecifying a sentences file is optional.\n\nTo automatically find all (sub)folders that contain a `Basedata` and `Markables` folder as direct children and convert all data in those folders, run:\n\n```sh\npython -m mmax2conll path/to/config.yml path/to/output_dir -d path/to/some/folder [-d path/to/another/folder ...]\n```\n\nTo only convert one pair (or triple) of files, run:\n```sh\npython -m mmax2conll path/to/config.yml path/to/output.conll path/to/some_words.xml path/to/a_coref_level.xml [path/to/a_sentence_level.xml]\n```\n\n### `mmax2raw.py`\nTo automatically find all (sub)folders that contain a `Basedata` and `Markables` folder as direct children and convert all data in those folders, run:\n\n```sh\nmmax2raw.py path/to/output_dir -d path/to/some/folder [-d path/to/another/folder ...]\n```\n\nTo only convert one file, run:\n```sh\nmmax2raw.py path/to/output.txt path/to/some_words.xml\n```\n\n\n## Columns of CoNLL output\nThese scripts were first used to convert data from the [COREA][] (Ch.7 p.115 -- 128) dataset to CoNLL and\nCOREA does not contain the following information:\n\n - POS tags\n - constituency tree\n - predicates\n - speaker/author information\n - named entities\n\nTherefore **these scripts strictly do not output data in CoNLL format**.\nThe following values and place-holders are used.\n\n| Column | Description | Value | Conform CoNLL specification?\n| ---: | --- | --- | ---\n| 1 | Document ID | file path without extension | Yes\n| 2 | Part number | `0` or as extracted from `.alpsent` from MMAX `*_words.xml` files \\[1\\] | Yes\n| 3 | Word number | `.alppos` or `.pos` from MMAX `*_words.xml` files | Yes\n| 4 | Word itself | content of `` tags from MMAX `*_words.xml` files | Yes\n| 5 | POS | `[POS]` | No\n| 6 | Parse bit | `*` | No\n| 7 | Predicate lemma | `-` | Yes\n| 8 | Predicate Frameset ID | `-` | Yes\n| 9 | Word sense | `-` | Yes\n| 10 | Speaker/Author | `UNKNOWN` | ???\n| 11 | Named Entities | `*` | Yes\n| - | Predicate Arguments | None: column(s) left out entirely | Yes, conform example in CoNLL 2012\n| 12 | Coreference | extracted from MMAX `*_coref_level.xml` files (ISSUE! \\[2\\]) | Yes\n\n\\[1\\]:\n The part numbers of DCOI start at 1, where the part numbers in a CoNLL file start at 0.\n To keep the origin of the data clear this 1-based part number is not changed,\n but instead an empty part 0 is added to those files.\n\n\\[2\\]:\n The reference spans are not closed in the correct order if they end at the same word. The following is an example of output from `mmax2conll`:\n\n (10\n -\n (52|(55\n 52)\n -\n 10)|55)|(133)\n\n While pedantically correct would be:\n\n (10\n -\n (55|(52\n 52)\n -\n (133)|55)|10)\n\n\n# Issues\n\n - [ ] Skipping a whole file one any error is too wasteful\n - [ ] 'on_missing' config key is not validated before use\n - [ ] `basedata_dir` and `markables_dir` should not be configuration keys\n - [ ] Too many methods in `main.py` are marked as class methods\n\n# References\nChristoph M\u00fcller and Michael Strube.
\nMulti-Level Annotation in MMAX
\nIn Proceedings of the 4th SIGDIAL Workshop. 2003.
\nURL http://www.speech.cs.cmu.edu/sigdial2003/proceedings/07_LONG_strube_paper.pdf\n\n\nIris Hendrickx, Gosse Bouma, Walter Daelemans and V\u00e9ronique Hoste.
\nCOREA: Coreference Resolution for Extracting Answers for Dutch
\nEssential Speech and Language Technology for Dutch, Ch.7, p.115 -- 128. 2013.
\nEditors: Peter Spyns, Jan Odijk
\nhttps://link.springer.com/book/10.1007/978-3-642-30910-6
\n10.1007/978-3-642-30910-6
\n\nSoNaR: https://ivdnt.org/downloads/taalmaterialen/tstc-sonar-corpus\n\n\n[COREA]: https://link.springer.com/book/10.1007/978-3-642-30910-6\n[CoNLL]: http://conll.cemantix.org/2012/data.html\n[SoNaR-1]: https://ivdnt.org/downloads/taalmaterialen/tstc-sonar-corpus", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/cltl/FormatConversions/tree/master/mmax2conll", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "mmax2conll", "package_url": "https://pypi.org/project/mmax2conll/", "platform": "", "project_url": "https://pypi.org/project/mmax2conll/", "project_urls": { "Homepage": "https://github.com/cltl/FormatConversions/tree/master/mmax2conll" }, "release_url": "https://pypi.org/project/mmax2conll/1.0.1/", "requires_dist": null, "requires_python": "", "summary": "Script to convert data in MMAX format to CoNLL format", "version": "1.0.1" }, "last_serial": 5144661, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "39b62f977e857b52ff67528fbf64b9ba", "sha256": "899f43dbd2bf5bbba3f1f91bbdb964c382c4843a8d729bb2e7fc5c9974a1474d" }, "downloads": -1, "filename": "mmax2conll-1.0.0.zip", "has_sig": false, "md5_digest": "39b62f977e857b52ff67528fbf64b9ba", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30710, "upload_time": "2018-07-12T10:06:33", "url": "https://files.pythonhosted.org/packages/10/1c/4151a19041793023352644ecdb3e4a2801e9413c5cc3cfdb9cb76c7c2d4f/mmax2conll-1.0.0.zip" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "66d554e429dec57766a322cea8e737cc", "sha256": "b32fac96b3c987c92fbaaca5da2f653215f4c85b0d617cbffbc7e06e9aa285c5" }, "downloads": -1, "filename": "mmax2conll-1.0.1.zip", "has_sig": false, "md5_digest": "66d554e429dec57766a322cea8e737cc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30711, "upload_time": "2019-04-15T13:15:05", "url": "https://files.pythonhosted.org/packages/69/6b/fe50ffa021bb6376de67ffa40fe3435f0d2fbc3de2d2aea80806f32d29fa/mmax2conll-1.0.1.zip" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "66d554e429dec57766a322cea8e737cc", "sha256": "b32fac96b3c987c92fbaaca5da2f653215f4c85b0d617cbffbc7e06e9aa285c5" }, "downloads": -1, "filename": "mmax2conll-1.0.1.zip", "has_sig": false, "md5_digest": "66d554e429dec57766a322cea8e737cc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30711, "upload_time": "2019-04-15T13:15:05", "url": "https://files.pythonhosted.org/packages/69/6b/fe50ffa021bb6376de67ffa40fe3435f0d2fbc3de2d2aea80806f32d29fa/mmax2conll-1.0.1.zip" } ] }