{ "info": { "author": "Mika H\u00e4m\u00e4l\u00e4inen, Dept. of Digital Humanities, University of Helsinki", "author_email": "mika.hamalainen@helsinki.fi", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "Programming Language :: Python :: 3", "Topic :: Text Processing :: Linguistic" ], "description": "# NATAS\n\nThis library will have methods for processing historical English corpora, especially for studying neologisms. The first functionalities to be released relate to normalization of historical spelling and OCR post-correction. This library is maintained by [Mika H\u00e4m\u00e4l\u00e4inen](https://mikakalevi.com).\n\n**NOTE: The normalization methods depend on Spacy, which takes some time to load. If you want to speed this up, you can change the Spacy model in use**\n\n## Installation\n\nNote: It is highly recommended to use a virtual environment because of the strict version requirements for dependencies. The library has been tested with Python 3.6\n\n pip3 --no-cache-dir install pip==18.1\n pip3 install natas --process-dependency-links\n python3 -m natas.download\n python3 - m spacy download en_core_web_md\n\n## Historical normalization\n\nFor a list of non-modern spelling variants, the tool can produce an ordered list of the candidate normalizations. The candidates are ordered based on the prediction score of the NMT model.\n\n import natas\n natas.normalize_words([\"seacreat\", \"wi\u00fee\"])\n >> [['secret', 'secrete'], ['with', 'withe', 'wide', 'white', 'way']]\n\nPossible keyword arguments are n_best=10, dictionary=None, all_candidates=True, correct_spelling_cache=True. \n- *n_best* sets the number of candidates the NMT will output\n- *dictionary* sets a custom dictionary to be used to filter the NMT output (see more in the next section)\n- *all_candidates*, if False, the method will return only the topmost normalization candidate (this will improve the speed of the method)\n- *correct_spelling_cache*, used only when checking if a candidate word is correctly spelled. Set this to False if you are testing with multiple *dictionaries*.\n\n## OCR post correction\n\nYou can use our pretrained model for OCR post correction by doing the following\n\n import natas\n natas.ocr_correct_words([\"paft\", \"friendlhip\"])\n >> [['past', 'pall', 'part', 'part'], ['friendship']]\n\nThis will return a list of possible correction candidates in the order of probability according to the NMT model. The same parameters can be used as for historical text normalization.\n\n### Training your own OCR error correction model\n\nYou can extract the parallel data for the OCR model if you have an access to a word embeddings model on your OCR data, a list of known correctly spelled words and a vocabulary of the language.\n\n from natas import ocr_builder\n from natas.normalize import wiktionary\n from gensim.models import Word2Vec\n\n model = Word2Vec.load(\"/path/to/your_model.w2v\")\n seed_words = set([\"logic\", \"logical\"]) #list of correctly spelled words you want to find matching OCR errors for\n dictionary = wiktionary #Lemmas of the English Wiktionary, you will need to change this if working with any other language\n lemmatize = True #Uses Spacy with English model, use natas.set_spacy(nlp) for other models and languages\n\n results = ocr_builder.extract_parallel(seed_words, model, dictionary=dictionary, lemmatize=lemmatize)\n >> {\"logic\": {\n \"fyle\": 5, \n \"ityle\": 5, \n \"lofophy\": 5, \n \"logick\": 1\n }, \n \"logical\": {\n \"lofophy\": 5, \n \"matical\": 3, \n \"phical\": 3, \n \"praaical\": 4, \n \"pracical\": 4, \n \"pratical\": 4\n }}\n\nThe code results in a dictionary of correctly spelled English words (from *seed_words*) and their mapping to semantically similar non-correctly spelled words (not in *dictionary*). Each non-correct word has a [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) calculated with the correctly spelled word. In our paper, we used 3 as the maximum edit distance.\n\nUse the dictionary to make parallel data files for OpenNMT on a character level. This means splitting the words into letters, such as *l o g i c k* -> *l o g i c*. See [their documentation on how to train the model](https://github.com/OpenNMT/OpenNMT-py).\n\n## Check if a word is correctly spelled\n\nYou can check whether a word is correctly spelled easily\n\n import natas\n natas.is_correctly_spelled(\"cat\")\n natas.is_correctly_spelled(\"ca7\")\n >> True\n >> False\n\nThis will compare the word with Wiktionary lemmas with and without Spacy lemmatization. The normalization method depends on this step. By default, *natas* uses Spacy's *en_core_web_md* model. To change this model, do the following\n\n import natas, spacy\n nlp = spacy.load('en')\n natas.set_spacy(nlp)\n\nIf you want to replace the Wiktionary dictionary with another one, it can be passed as a keyword argument. Use *set* instead of *list* for a faster look-up. Notice that the models operate on lowercased words.\n\n import natas\n my_dictionary= set([\"hat\", \"rat\"])\n natas.is_correctly_spelled(\"cat\", dictionary=my_dictionary)\n natas.normalize_words([\"ratte\"], dictionary=my_dictionary)\n\n\nBy default, caching is enabled. If you want to use the method with multiple different parameters, you will need to set *cache=False*.\n\n import natas\n natas.is_correctly_spelled(\"cat\") #The word is looked up and the result cached\n natas.is_correctly_spelled(\"cat\") #The result will be served from the cache\n natas.is_correctly_spelled(\"cat\", cache=False) #The word will be looked up again\n\n# Cite\n\nIf you use the library, please cite one of the following publications depending on whether you used it for normalization or OCR correction.\n\n## Normalization\n\nMika Ha\u0308ma\u0308la\u0308inen, Tanja Sa\u0308ily, Jack Rueter, Jo\u0308rg Tiedemann, and Eetu Ma\u0308kela\u0308. 2019. [Revisiting NMT for Normalization of Early English Letters](https://www.aclweb.org/anthology/papers/W/W19/W19-2509/). In *Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature*.\n\n## OCR correction\n\nMika H\u00e4m\u00e4l\u00e4inen, and Simon Hengchen. 2019. [From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction](https://helda.helsinki.fi//bitstream/handle/10138/305149/SN_Mika_Simon_5_.pdf?sequence=1). In *the Proceedings of Recent Advances in Natural Language Processing*.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/mikahama/natas", "keywords": "historical English,spelling normalization,OCR error correction", "license": "Apache License, Version 2.0", "maintainer": "", "maintainer_email": "", "name": "natas", "package_url": "https://pypi.org/project/natas/", "platform": "", "project_url": "https://pypi.org/project/natas/", "project_urls": { "Bug Reports": "https://github.com/mikahama/natas/issues", "Developer": "https://mikakalevi.com/", "Homepage": "https://github.com/mikahama/natas" }, "release_url": "https://pypi.org/project/natas/1.0.4/", "requires_dist": [ "configargparse", "distance", "torch (==1.0.0)", "torchtext (>=0.4.0@https://mikakalevi.com/downloads/text-master.zip#egg=torchtext-0.4.0)", "spacy (>=2.1.4)", "mikatools (>=0.0.6)", "OpenNMT-py (>=0.8.2@https://github.com/OpenNMT/OpenNMT-py/archive/0.8.2.zip#egg=OpenNMT-py-0.8.2)" ], "requires_python": "", "summary": "Python library for processing historical English", "version": "1.0.4" }, "last_serial": 5862398, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "13b06dca35daa6d5eacc71b00b768b01", "sha256": "3650df79637d46a1dff6922f070f616ed6ff4d3feeb06e2a21fa1d97aed3360f" }, "downloads": -1, "filename": "natas-1.0.0.tar.gz", "has_sig": false, "md5_digest": "13b06dca35daa6d5eacc71b00b768b01", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 35282533, "upload_time": "2019-05-24T14:10:34", "url": "https://files.pythonhosted.org/packages/f9/bb/a984ae27ba1875c90d5f9ae3e2f01a7d4bf1a674c1a54ce9683db03ca871/natas-1.0.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "1558164380416d1488afcbab0360abfd", "sha256": "2301f823c18db45c6b1de4eb5bf8b5e63b770736a26d800dc7b2eae3f6584fd9" }, "downloads": -1, "filename": "natas-1.0.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "1558164380416d1488afcbab0360abfd", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 2174668, "upload_time": "2019-07-24T10:39:13", "url": "https://files.pythonhosted.org/packages/08/47/c0abc852d3c244fdeadaca81bec1ceba4eef26d398200a6357949f781958/natas-1.0.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2414a76451dba41831a032b6fa016cde", "sha256": "121049c59abef0a7321c0e57c9da76bed0f1f5d213c09d567167d44fc7c23ca9" }, "downloads": -1, "filename": "natas-1.0.1.tar.gz", "has_sig": false, "md5_digest": "2414a76451dba41831a032b6fa016cde", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2157867, "upload_time": "2019-07-24T10:39:31", "url": "https://files.pythonhosted.org/packages/28/16/bcc0d3594ff4b8bcd425ae75973282d5c25deb9e23f3f11b1e9fcf7a103d/natas-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "2a8721201bf279c01f16cd50cc44fef3", "sha256": "21851023edc10b67d648b14b701210b55eac5e40f94056413f16c7967d61fcd6" }, "downloads": -1, "filename": "natas-1.0.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "2a8721201bf279c01f16cd50cc44fef3", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 2174759, "upload_time": "2019-09-20T08:18:25", "url": "https://files.pythonhosted.org/packages/e9/f1/6a90f6077b3bac3e958de8e6c90227d079b3e1192e11e9d66172ade4eefe/natas-1.0.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2d2c9c0f120f08a15c27e7a82f2c69df", "sha256": "3638d911b4aaa74c33743c951c8a7abcd1f220efb39e2abb5ddf9aa476387dc2" }, "downloads": -1, "filename": "natas-1.0.2.tar.gz", "has_sig": false, "md5_digest": "2d2c9c0f120f08a15c27e7a82f2c69df", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2157986, "upload_time": "2019-09-20T08:18:39", "url": "https://files.pythonhosted.org/packages/02/73/6b916a7478ae2c24f74f02a2de036b20f6e9c4cd3469c636b107c4a263bd/natas-1.0.2.tar.gz" } ], "1.0.3": [ { "comment_text": "", "digests": { "md5": "64febe6ea0acfcff674f94834ae326fe", "sha256": "1bce0fba55bc90fc193b2318dac922f88cb9ad686c0abb741dbd08a37a6eb3b1" }, "downloads": -1, "filename": "natas-1.0.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "64febe6ea0acfcff674f94834ae326fe", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 2174769, "upload_time": "2019-09-20T13:35:23", "url": "https://files.pythonhosted.org/packages/39/aa/d7fa92d082232ac617c50655376faadfe66dec367c79ddfa14186b5e5520/natas-1.0.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "38b42d859ca7109916bf38090a820d79", "sha256": "9da167e81791dd1ecfb8ef2962f6b00f1c62c36d80cd6780990ba31f033bb6b8" }, "downloads": -1, "filename": "natas-1.0.3.tar.gz", "has_sig": false, "md5_digest": "38b42d859ca7109916bf38090a820d79", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2158018, "upload_time": "2019-09-20T13:35:33", "url": "https://files.pythonhosted.org/packages/75/70/0d50900a3393802b4e1ced563bd94b54717a73fcc4b2b3dfe827f6add0bb/natas-1.0.3.tar.gz" } ], "1.0.4": [ { "comment_text": "", "digests": { "md5": "b187dbdc6df53ca735ccd464e1852645", "sha256": "d7eb2c805fffab453b991e51739b8730b9742bbf73077ed6bea70c66001f8bee" }, "downloads": -1, "filename": "natas-1.0.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "b187dbdc6df53ca735ccd464e1852645", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 2174635, "upload_time": "2019-09-20T13:47:33", "url": "https://files.pythonhosted.org/packages/9b/a3/67f9760778eea69aa6498c24c588c4d98099575c372ceee95d881437df2f/natas-1.0.4-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6ebdfae1efb82f2a7c375172e2319428", "sha256": "2eaf7f704d162ff1418ee1b43b05e3a2c4fb95da3cf02104964a2594843b5c6f" }, "downloads": -1, "filename": "natas-1.0.4.tar.gz", "has_sig": false, "md5_digest": "6ebdfae1efb82f2a7c375172e2319428", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2157897, "upload_time": "2019-09-20T13:47:45", "url": "https://files.pythonhosted.org/packages/89/b6/e6c7673de5f5074c7e772cf43e12bedd0e25b4baa684ab215c167caef4c3/natas-1.0.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "b187dbdc6df53ca735ccd464e1852645", "sha256": "d7eb2c805fffab453b991e51739b8730b9742bbf73077ed6bea70c66001f8bee" }, "downloads": -1, "filename": "natas-1.0.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "b187dbdc6df53ca735ccd464e1852645", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 2174635, "upload_time": "2019-09-20T13:47:33", "url": "https://files.pythonhosted.org/packages/9b/a3/67f9760778eea69aa6498c24c588c4d98099575c372ceee95d881437df2f/natas-1.0.4-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6ebdfae1efb82f2a7c375172e2319428", "sha256": "2eaf7f704d162ff1418ee1b43b05e3a2c4fb95da3cf02104964a2594843b5c6f" }, "downloads": -1, "filename": "natas-1.0.4.tar.gz", "has_sig": false, "md5_digest": "6ebdfae1efb82f2a7c375172e2319428", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2157897, "upload_time": "2019-09-20T13:47:45", "url": "https://files.pythonhosted.org/packages/89/b6/e6c7673de5f5074c7e772cf43e12bedd0e25b4baa684ab215c167caef4c3/natas-1.0.4.tar.gz" } ] }