{ "info": { "author": "Maarten van Gompel, Merijn Beeksma, Florian Kunneman", "author_email": "proycon@anaproy.nl", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Operating System :: POSIX", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Text Processing :: Linguistic" ], "description": "# CLIN 2018 Shared Task: Spelling Correction\n\n## Introduction\n\nThis repository harbors the scripts for handling the data that is part of the CLIN28 shared task on spelling correction.\n\nAutomatic spell checking and correction has been subject of research for decades. Although state of the art spell checkers perform reasonably well for everyday-life applications, reaching high accuracy remains to be a challenging task. This shared task focuses on the detection and correction of spelling errors in Dutch Wikipedia texts. Wikipedia articles aim to be standard-Dutch texts, which may contain jargon. In particular, this task addresses the detection and correction of the types of spelling errors listed in the next section.\n\nNote the following:\n* Submitted spelling correctors will be evaluated for detection and correction of these \u2013 and only these \u2013 types of errors.\n* The spelling errors do not have to be categorized into the categories that are listed below \u2013 only detected and corrected.\n* In case of officially accepted spelling variation or doubt about the correct spelling, all correct variants are accepted.\n* The corrections are evaluated in accordance with the Woordenlijst Nederlandse Taal (http://woordenlijst.org/) and the Leidraad (http://woordenlijst.org/leidraad).\n\n## Errors to detect and correct\n\n* **real-word confusions** (``confusion``), word is confused with a near neighbor (confusion with non-native spelling, homophony, grammatical errors, et cetera):\n * ik wordt \u2192 ik word\n * stijl \u2192 steil\n * hobbies \u2192 hobby\u2019s\n * me \u2192 mijn\n * als \u2192 dan\n* **split errors** (``spliterror``), compound words which are incorrectly separated:\n * beleids medewerker \u2192 beleidsmedewerker\n * lang durig \u2192 langdurig\n* **runon errors** (``runonerror``), incorrect concatenation of words:\n * etcetera \u2192 et cetera\n * zeidat \u2192 zei dat\n* **missing words** (``missingword``), sentence is ungrammatical due to missing elements:\n * samen met vrouw die \u2192 samen met de vrouw die\n* **redundant words** (``redundantword``), sentence is ungrammatical due to redundant elements:\n * door doordat \u2192 doordat\n* **missing punctuation** (``missingpunctuation``), missing diacritical symbols and hyphenation marks (other cases of missing punctuation are excluded from the task):\n * een en ander \u2192 \u00e9\u00e9n en ander\n * financiele \u2192 financi\u00eble\n * autoongeluk \u2192 auto-ongeluk\n* **redundant punctuation** (``redundantpunctuation``), redundant diacritical symbols and hyphenation marks (other cases of redundant punctuation are excluded from the task):\n * co-assistent \u2192 coassistent\n* **capitalisation errors** (``capitalizationerror``), incorrect use of capital letters:\n * Joodse \u2192 joodse\n * Minister van Onderwijs \u2192 minister van Onderwijs\n * amstelveen \u2192 Amstelveen\n* **archaic spelling** (``archaic``), outdated spelling:\n *\taktie \u2192 actie\n * paardebloem \u2192 paardenbloem\n* **non-word errors** (``nonworderror``), words that do not exist in Dutch:\n * voek \u2192 boek\n * assrtief \u2192 assertief\n\nIn parentheses are the class IDs for the error categories, this is how they should always be referred to in the data,\nthis is in line with [this FoLiA Set Definition](https://github.com/proycon/folia/blob/master/setdefinitions/spellingcorrection.foliaset.xml).\n\n## Data\n\nWe initially deliver three annotated documents for validation purposes. A validation set consisting of 50 Wikipedia articles will follow before the end of October. The documents may contain zero, one, or more spelling errors. The validation set contains all of the spelling error categories listed below. In December, a full test set will be published in the same format.\n\n### Data format\n\nWe deliver the trial set, the test set, and eventually the gold-standard reference in two formats: [FoLiA\nXML](https://proycon.github.io/folia) and a JSON format. This JSON representation is automatically derived from the\nFoLiA documents and acts as a simplified format for this task to make it more accessible and not place an unnecessarily\nhigh burden on document parsing. It can act as input to your system as it contains all vital information, however, it is\nnot as rich as the original FoLiA document.\n\nLikewise, your system may output either FoLiA XML or our JSON format. In either case, it is important to ensure your\noutput is valid by using the validator tools we provide. For FoLiA use the ``foliavalidator`` tool (part of\nhttps://github.com/proycon/folia), for JSON, use the validator provided in this repository.\n\nAll data will be delivered to you in tokenised form. Tokenisation has been conducted using\n[ucto](https://languagemachines.github.io/ucto). You're expected to adhere to this tokenisation, the data formats have\nspecial facilities for merges, splits, insertions and deletions of tokens, as may naturally arise in spelling\ncorrection.\n\n#### JSON\n\nFamiliarity with JSON is assumed; we will merely state the specifics of our representation. At the root level, we have\n``words`` and ``corrections``. Words contains a list of all words/tokens along with their ID and some other information.\nCorrections contains a list of corrections on those words, this will be provided for the trial data and for the\ngold-standard release after the task's end. For the test data, it will be an empty list which your system is expected to\nfill. Consider the following example:\n\n```json\n{\n \"words\": [\n { \"text\": \"Dit\", \"id\": \"word.1\", \"space\": true, \"in\": \"sentence.1\" },\n { \"text\": \"is\", \"id\": \"word.2\", \"space\": true, \"in\": \"sentence.1\" },\n { \"text\": \"een\", \"id\": \"word.3\", \"space\": true, \"in\": \"sentence.1\" },\n { \"text\": \"vooorbeeld\", \"id\": \"word.4\", \"space\": false, \"in\": \"sentence.1\" },\n { \"text\": \".\", \"id\": \"word.5\", \"space\": true, \"in\": \"sentence.1\" }\n ],\n \"corrections\": [\n { \"class\": \"nonworderror\", \"span\": [\"word.4\"], \"text\": \"voorbeeld\" }\n ]\n}\n```\n\nThis example shows one correction.\n\n**Word Specification**:\n* ``text`` - The text of the word/token, a string\n* ``id`` - The ID of the word/token (string). This is used to refer back to the token. Note that although the ID often has implicit numbering indicating ordering, this is **NOT** guaranteed. The order of the words should be derived from the order they appear in the ``words`` list only. IDs are case sensitive!\n* ``space`` - A boolean indicating whether the word/token is followed by a space. This can be used to reconstruct the\n text prior to tokenisation.\n* ``in`` - This refers to the ID of the structural element in which the word occurs, almost always a sentence. Sentence\n breaks can be detected by changes in this value. For more structural information, you'll need the original FoLiA\n documents.\n\n**Correction specification:**\n* ``span`` - A list of word IDs to which this correction applies.\n* ``text`` - The text of the correction, i.e. the new word(s). This text may be an empty string in case of a deletion\n (e.g. redundant word/punctuation), or may consist of multiple space separated words in case of a run-on\n error (for example *naarhuis* -> *naar huis*).\n* ``after`` (instead of ``span``) - Should be used instead of ``span`` in cases of an insertion (insertion of a new word/token where\n previously none existed). The value is a string and is the ID of the word **after which** the correction is to be\n inserted.\n* ``confidence`` (optional) - A floating number between 0.0 and 1.0 indicating the confidence in this correction. (If not explicitly mentioned, 1.0 is assumed)\n* ``class`` (optional) - The type of the error; i.e. one of the classes defined in [our set definition](https://github.com/proycon/folia/blob/master/setdefinitions/spellingcorrection.foliaset.xml) (use the IDs, not the labels!). Your system does **not** need to output this, it merely serves as extra information in the gold standard.\n\nNote that all JSON for this task should be UTF-8 encoded.\n\n#### FoLiA\n\nThe JSON option is the simpler and sufficient option for this task. But if you want to leverage the full\ninformation available in the input document, you can fall back on using the original FoLiA input.\n\nThe FoLiA format is extensively documented; consult the [FoLiA website](https://proycon.github.io/folia), we\nparticularly refer to section 2.10.8 on corrections. Python\nusers may benefit from using our Python FoLiA library, part of [pynlpl](https://github.com/proycon/pynlpl) and\ndocumented [here](http://pynlpl.readthedocs.io/en/latest/folia.html).\n\nThe FoLiA documents may also act as a source for further linguistic enrichment using FoLiA-aware tools such as\n[frog](https://languagemachines.github.io/frog).\n\n#### Notes\n\nSome things to keep in mind:\n* A correction should span one or more words, span should be specified in the right order and may only be consecutive (use multiple corrections in case of gaps).\n* A correction should span the minimum amount of words/tokens, do not include leading/trailing tokens that are not corrected; E.g for *een en ander* \u2192 *\u00e9\u00e9n en ander*, correct only *een* \u2192 *\u00e9\u00e9n*.\n\n## Evaluation\n\nDetection and correction of spelling errors in the test documents are evaluated separately (although correction\nnecessarily implies detection as well), in the following way:\n\n* Matching the detected errors to the marked errors in the gold standard annotation:\n * Precision will be measured as the proportion of correctly detected errors by the corrector in all test documents. (i.e. irregardless of the actual correction)\n * Recall will be measured as the proportion of the marked errors in the gold standard annotations of all test documents that were detected by the corrector.\n * The F-score is the harmonic mean of the Precision and Recall.\n* Matching the proposed corrections of detected errors by the spelling corrector to the corrections made in the gold standard annotations:\n * Precision will be measured as the proportion of correctly corrected errors by the corrector in all test documents.\n * Recall will be measured as the proportion of the marked errors in the gold standard annotations of all test documents that were corrected well by the corrector.\n * The F-score is the harmonic mean of the Precision and Recall.\n\nThe spelling corrector is not requested to predict the class of the spelling error. These correction classes are marked in order to describe the quality of submissions in more detail and will appear in the evaluation output.\n\nTo evaluate, run ``clin28-evaluate --ref reference.json --out youroutput.json``, see the next section.\n\n## Tools and Installation\n\nWe provide the following tools for validation, evaluation and conversion:\n\n * ``clin28-validator`` - Validates whether the JSON is valid. (**Use this on your system output prior to submission!**)\n * ``clin28-evaluate`` - Implements the evaluation as specified above: evaluates a JSON output file against a JSON reference file\n * ``clin28-folia2json`` - Converts FoLiA XML to the JSON format for this task\n\nThe tools are written in Python (3.4 or above!) and available from the Python Package Index, download and install with a\nsimple:\n\n pip3 install clin28tools\n\nFor global installation, provided you have administrative rights, you can prepend ``sudo``, but we recommend using a Python virtual environment instead.\n\nAlternatively, you can install the tools after having cloned this git repository:\n\n python3 setup.py install\n\n## Good Software development\n\nFor this task we would like to encourage practice of good and sustainable academic software development. We therefore\nencourage participants to make their software for this task publicly available as open source. We will judge the\nsoftware for this task using the [CLARIAH Software Quality Survey](http://softwarequality.clariah.nl/) and the highest scoring system (i.e. not necessarily\nthe most performant system!) will be awarded an honourable mention.\n\n## Important dates\n\n* 4 December 2017: test data online\n* 12 January 2017: deadline for submission of source code and output\n* 19 January 2018: feedback to submissions\n* 26 January 2018: presenting the results at the CLIN conference\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/LanguageMachines/CLIN28_ST_spelling_correction", "keywords": "nlp computational_linguistics spelling_correction", "license": "GPL", "maintainer": "", "maintainer_email": "", "name": "clin28tools", "package_url": "https://pypi.org/project/clin28tools/", "platform": "", "project_url": "https://pypi.org/project/clin28tools/", "project_urls": { "Homepage": "https://github.com/LanguageMachines/CLIN28_ST_spelling_correction" }, "release_url": "https://pypi.org/project/clin28tools/0.3.1/", "requires_dist": null, "requires_python": "", "summary": "Scripts for the CLIN28 Shared Task on spelling correction", "version": "0.3.1" }, "last_serial": 3504633, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "48e0a1635fcb3156de9e902cf8e3cb8e", "sha256": "5d35549e117dc1406654bdbdf5a4ebfbb133e27e690ba2ee46a546449f3b88ed" }, "downloads": -1, "filename": "clin28tools-0.1.tar.gz", "has_sig": false, "md5_digest": "48e0a1635fcb3156de9e902cf8e3cb8e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13889, "upload_time": "2017-11-23T15:50:13", "url": "https://files.pythonhosted.org/packages/e2/7d/b81570cbc527ab0bc834db20c506c2c71a5daaaefd54ec388a1ae4ccc879/clin28tools-0.1.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "7d4955b0113b0c0325901acbadf20d93", "sha256": "fdb400e31067fa781239a462f3c78d9f8cd543a586c259afc4fda3c4442c1146" }, "downloads": -1, "filename": "clin28tools-0.1.1.tar.gz", "has_sig": false, "md5_digest": "7d4955b0113b0c0325901acbadf20d93", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14207, "upload_time": "2017-11-28T21:54:56", "url": "https://files.pythonhosted.org/packages/ca/0b/08230593fa936eeb8fefb7e28e26086ce783427d13a5ba3b09f27dd56177/clin28tools-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "bcd82e97b2b4848a76b4407c91d544b2", "sha256": "a1a9ada0482e9d8ee2de502e1659bc810d0a363b64f7495f304842e25a60a0e2" }, "downloads": -1, "filename": "clin28tools-0.1.2.tar.gz", "has_sig": false, "md5_digest": "bcd82e97b2b4848a76b4407c91d544b2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14197, "upload_time": "2017-12-14T13:35:46", "url": "https://files.pythonhosted.org/packages/37/e1/f39adaa99188307902629ce10871053e30b868059e5c58d57a5621db9839/clin28tools-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "80c592368bdc1a08a5f4777110321ac3", "sha256": "ab468ba0b3224c8981aba1c7f13cb902f2245492752a8d01bc8ce265f2f6acf6" }, "downloads": -1, "filename": "clin28tools-0.1.3.tar.gz", "has_sig": false, "md5_digest": "80c592368bdc1a08a5f4777110321ac3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14195, "upload_time": "2017-12-21T11:07:45", "url": "https://files.pythonhosted.org/packages/16/23/a663a395dd35d4e45d662d723694baad18fb3722404da7786d4358828033/clin28tools-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "609e28e1d3a74168bd0600e9e670539f", "sha256": "63778711402afa738f6737e56137cfafedf0a6a90f64d2570eb030a5d2259bfb" }, "downloads": -1, "filename": "clin28tools-0.1.4.tar.gz", "has_sig": false, "md5_digest": "609e28e1d3a74168bd0600e9e670539f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14224, "upload_time": "2018-01-15T14:53:04", "url": "https://files.pythonhosted.org/packages/5d/5f/107b664eec30e8160cfd59e1e3523fa273e04b1ad5036fe65fdd9b9bcdd8/clin28tools-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "c5299bcf763a768ebfa68f3223ae4a34", "sha256": "6d0f95355246577cac36fb85889d602c9206bfde37d3ded3b196a0675c271a93" }, "downloads": -1, "filename": "clin28tools-0.1.5.tar.gz", "has_sig": false, "md5_digest": "c5299bcf763a768ebfa68f3223ae4a34", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14221, "upload_time": "2018-01-15T15:04:20", "url": "https://files.pythonhosted.org/packages/6c/0c/ba0b19a9f704b1b60f0cac44f627d5f73148ec0b7c0b3a390d50e8b383fe/clin28tools-0.1.5.tar.gz" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "9271c94e656f9a4c94127c3b2582e66f", "sha256": "f39dcffb00bec1efc9eb812e6e4ba81f9664dd3c4c1ffa04357c356a60e7efe2" }, "downloads": -1, "filename": "clin28tools-0.2.tar.gz", "has_sig": false, "md5_digest": "9271c94e656f9a4c94127c3b2582e66f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15003, "upload_time": "2018-01-18T09:11:47", "url": "https://files.pythonhosted.org/packages/96/b0/8d4be1d70298eefdd9197ee328004e5d7d01af042c41dfded7d1e5174ca3/clin28tools-0.2.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "0e9cea203b627787d0e9e92d8b12a6d9", "sha256": "94ac2714f097e290ae3ed724e1da9f3c048ce3caf6188643aec0333f8dd8177c" }, "downloads": -1, "filename": "clin28tools-0.3.0.tar.gz", "has_sig": false, "md5_digest": "0e9cea203b627787d0e9e92d8b12a6d9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14957, "upload_time": "2018-01-19T13:32:21", "url": "https://files.pythonhosted.org/packages/6b/8f/f0d6baf2d55d43a45dc53d7d1daa45df6b87ecbf88483054c1b0d6c2e79a/clin28tools-0.3.0.tar.gz" } ], "0.3.1": [ { "comment_text": "", "digests": { "md5": "1c89b4f149666bb5c8da7a9e7e67d938", "sha256": "de97acb6425b6f9d2ef0347bcf987c5ee1c0c7ac7f14ed51186069c1f521b35a" }, "downloads": -1, "filename": "clin28tools-0.3.1.tar.gz", "has_sig": false, "md5_digest": "1c89b4f149666bb5c8da7a9e7e67d938", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15194, "upload_time": "2018-01-19T14:22:26", "url": "https://files.pythonhosted.org/packages/78/c4/83cd9a56742ec40c3c79f7fb5a4c4bd98fe6775799a60d9e60856a8f94b7/clin28tools-0.3.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "1c89b4f149666bb5c8da7a9e7e67d938", "sha256": "de97acb6425b6f9d2ef0347bcf987c5ee1c0c7ac7f14ed51186069c1f521b35a" }, "downloads": -1, "filename": "clin28tools-0.3.1.tar.gz", "has_sig": false, "md5_digest": "1c89b4f149666bb5c8da7a9e7e67d938", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15194, "upload_time": "2018-01-19T14:22:26", "url": "https://files.pythonhosted.org/packages/78/c4/83cd9a56742ec40c3c79f7fb5a4c4bd98fe6775799a60d9e60856a8f94b7/clin28tools-0.3.1.tar.gz" } ] }