{
    "info": {
        "author": "Mika H\u00e4m\u00e4l\u00e4inen, Dept. of  Digital Humanities, University of Helsinki",
        "author_email": "mika.hamalainen@helsinki.fi",
        "bugtrack_url": null,
        "classifiers": [
            "Development Status :: 5 - Production/Stable",
            "Intended Audience :: Developers",
            "Programming Language :: Python :: 3",
            "Topic :: Text Processing :: Linguistic"
        ],
        "description": "# NATAS\n\nThis library will have methods for processing historical English corpora, especially for studying neologisms. The first functionalities to be released relate to normalization of historical spelling and OCR post-correction. This library is maintained by [Mika H\u00e4m\u00e4l\u00e4inen](https://mikakalevi.com).\n\n**NOTE: The normalization methods depend on Spacy, which takes some time to load. If you want to speed this up, you can change the Spacy model in use**\n\n## Installation\n\nNote: It is highly recommended to use a virtual environment because of the strict version requirements for dependencies. The library has been tested with Python 3.6\n\n    pip3 --no-cache-dir install pip==18.1\n    pip3 install natas --process-dependency-links\n    python3 -m natas.download\n    python3 - m spacy download en_core_web_md\n\n## Historical normalization\n\nFor a list of non-modern spelling variants, the tool can produce an ordered list of the candidate normalizations. The candidates are ordered based on the prediction score of the NMT model.\n\n    import natas\n    natas.normalize_words([\"seacreat\", \"wi\u00fee\"])\n    >> [['secret', 'secrete'], ['with', 'withe', 'wide', 'white', 'way']]\n\nPossible keyword arguments are n_best=10, dictionary=None, all_candidates=True, correct_spelling_cache=True. \n- *n_best* sets the number of candidates the NMT will output\n- *dictionary* sets a custom dictionary to be used to filter the NMT output (see more in the next section)\n- *all_candidates*, if False, the method will return only the topmost normalization candidate (this will improve the speed of the method)\n- *correct_spelling_cache*, used only when checking if a candidate word is correctly spelled. Set this to False if you are testing with multiple *dictionaries*.\n\n## OCR post correction\n\nYou can use our pretrained model for OCR post correction by doing the following\n\n    import natas\n    natas.ocr_correct_words([\"paft\", \"friendlhip\"])\n    >> [['past', 'pall', 'part', 'part'], ['friendship']]\n\nThis will return a list of possible correction candidates in the order of probability according to the NMT model. The same parameters can be used as for historical text normalization.\n\n### Training your own OCR error correction model\n\nYou can extract the parallel data for the OCR model if you have an access to a word embeddings model on your OCR data, a list of known correctly spelled words and a vocabulary of the language.\n\n    from natas import ocr_builder\n    from natas.normalize import wiktionary\n    from gensim.models import Word2Vec\n\n    model = Word2Vec.load(\"/path/to/your_model.w2v\")\n    seed_words = set([\"logic\", \"logical\"]) #list of correctly spelled words you want to find matching OCR errors for\n    dictionary = wiktionary #Lemmas of the English Wiktionary, you will need to change this if working with any other language\n    lemmatize = True #Uses Spacy with English model, use natas.set_spacy(nlp) for other models and languages\n\n    results = ocr_builder.extract_parallel(seed_words, model, dictionary=dictionary, lemmatize=lemmatize)\n    >> {\"logic\": {\n        \"fyle\": 5, \n        \"ityle\": 5, \n        \"lofophy\": 5, \n        \"logick\": 1\n    }, \n    \"logical\": {\n        \"lofophy\": 5, \n        \"matical\": 3, \n        \"phical\": 3, \n        \"praaical\": 4, \n        \"pracical\": 4, \n        \"pratical\": 4\n    }}\n\nThe code results in a dictionary of correctly spelled English words (from *seed_words*) and their mapping to semantically similar non-correctly spelled words (not in *dictionary*). Each non-correct word has a [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) calculated with the correctly spelled word. In our paper, we used 3 as the maximum edit distance.\n\nUse the dictionary to make parallel data files for OpenNMT on a character level. This means splitting the words into letters, such as *l o g i c k* -> *l o g i c*. See [their documentation on how to train the model](https://github.com/OpenNMT/OpenNMT-py).\n\n## Check if a word is correctly spelled\n\nYou can check whether a word is correctly spelled easily\n\n    import natas\n    natas.is_correctly_spelled(\"cat\")\n    natas.is_correctly_spelled(\"ca7\")\n    >> True\n    >> False\n\nThis will compare the word with Wiktionary lemmas with and without Spacy lemmatization. The normalization method depends on this step. By default, *natas* uses Spacy's *en_core_web_md* model. To change this model, do the following\n\n    import natas, spacy\n    nlp = spacy.load('en')\n    natas.set_spacy(nlp)\n\nIf you want to replace the Wiktionary dictionary with another one, it can be passed as a keyword argument. Use *set* instead of *list* for a faster look-up. Notice that the models operate on lowercased words.\n\n    import natas\n    my_dictionary= set([\"hat\", \"rat\"])\n    natas.is_correctly_spelled(\"cat\", dictionary=my_dictionary)\n    natas.normalize_words([\"ratte\"], dictionary=my_dictionary)\n\n\nBy default, caching is enabled. If you want to use the method with multiple different parameters, you will need to set *cache=False*.\n\n    import natas\n    natas.is_correctly_spelled(\"cat\") #The word is looked up and the result cached\n    natas.is_correctly_spelled(\"cat\") #The result will be served from the cache\n    natas.is_correctly_spelled(\"cat\", cache=False) #The word will be looked up again\n\n# Cite\n\nIf you use the library, please cite one of the following publications depending on whether you used it for normalization or OCR correction.\n\n## Normalization\n\nMika Ha\u0308ma\u0308la\u0308inen, Tanja Sa\u0308ily, Jack Rueter, Jo\u0308rg Tiedemann, and Eetu Ma\u0308kela\u0308. 2019. [Revisiting NMT for Normalization of Early English Letters](https://www.aclweb.org/anthology/papers/W/W19/W19-2509/). In *Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature*.\n\n## OCR correction\n\nMika H\u00e4m\u00e4l\u00e4inen, and Simon Hengchen. 2019. [From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction](https://helda.helsinki.fi//bitstream/handle/10138/305149/SN_Mika_Simon_5_.pdf?sequence=1). In *the Proceedings of Recent Advances in Natural Language Processing*.\n\n\n",
        "description_content_type": "text/markdown",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/mikahama/natas",
        "keywords": "historical English,spelling normalization,OCR error correction",
        "license": "Apache License, Version 2.0",
        "maintainer": "",
        "maintainer_email": "",
        "name": "natas",
        "package_url": "https://pypi.org/project/natas/",
        "platform": "",
        "project_url": "https://pypi.org/project/natas/",
        "project_urls": {
            "Bug Reports": "https://github.com/mikahama/natas/issues",
            "Developer": "https://mikakalevi.com/",
            "Homepage": "https://github.com/mikahama/natas"
        },
        "release_url": "https://pypi.org/project/natas/1.0.4/",
        "requires_dist": [
            "configargparse",
            "distance",
            "torch (==1.0.0)",
            "torchtext (>=0.4.0@https://mikakalevi.com/downloads/text-master.zip#egg=torchtext-0.4.0)",
            "spacy (>=2.1.4)",
            "mikatools (>=0.0.6)",
            "OpenNMT-py (>=0.8.2@https://github.com/OpenNMT/OpenNMT-py/archive/0.8.2.zip#egg=OpenNMT-py-0.8.2)"
        ],
        "requires_python": "",
        "summary": "Python library for processing historical English",
        "version": "1.0.4"
    },
    "last_serial": 5862398,
    "releases": {
        "1.0.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "13b06dca35daa6d5eacc71b00b768b01",
                    "sha256": "3650df79637d46a1dff6922f070f616ed6ff4d3feeb06e2a21fa1d97aed3360f"
                },
                "downloads": -1,
                "filename": "natas-1.0.0.tar.gz",
                "has_sig": false,
                "md5_digest": "13b06dca35daa6d5eacc71b00b768b01",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 35282533,
                "upload_time": "2019-05-24T14:10:34",
                "url": "https://files.pythonhosted.org/packages/f9/bb/a984ae27ba1875c90d5f9ae3e2f01a7d4bf1a674c1a54ce9683db03ca871/natas-1.0.0.tar.gz"
            }
        ],
        "1.0.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "1558164380416d1488afcbab0360abfd",
                    "sha256": "2301f823c18db45c6b1de4eb5bf8b5e63b770736a26d800dc7b2eae3f6584fd9"
                },
                "downloads": -1,
                "filename": "natas-1.0.1-py2.py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "1558164380416d1488afcbab0360abfd",
                "packagetype": "bdist_wheel",
                "python_version": "py2.py3",
                "requires_python": null,
                "size": 2174668,
                "upload_time": "2019-07-24T10:39:13",
                "url": "https://files.pythonhosted.org/packages/08/47/c0abc852d3c244fdeadaca81bec1ceba4eef26d398200a6357949f781958/natas-1.0.1-py2.py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "2414a76451dba41831a032b6fa016cde",
                    "sha256": "121049c59abef0a7321c0e57c9da76bed0f1f5d213c09d567167d44fc7c23ca9"
                },
                "downloads": -1,
                "filename": "natas-1.0.1.tar.gz",
                "has_sig": false,
                "md5_digest": "2414a76451dba41831a032b6fa016cde",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 2157867,
                "upload_time": "2019-07-24T10:39:31",
                "url": "https://files.pythonhosted.org/packages/28/16/bcc0d3594ff4b8bcd425ae75973282d5c25deb9e23f3f11b1e9fcf7a103d/natas-1.0.1.tar.gz"
            }
        ],
        "1.0.2": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "2a8721201bf279c01f16cd50cc44fef3",
                    "sha256": "21851023edc10b67d648b14b701210b55eac5e40f94056413f16c7967d61fcd6"
                },
                "downloads": -1,
                "filename": "natas-1.0.2-py2.py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "2a8721201bf279c01f16cd50cc44fef3",
                "packagetype": "bdist_wheel",
                "python_version": "py2.py3",
                "requires_python": null,
                "size": 2174759,
                "upload_time": "2019-09-20T08:18:25",
                "url": "https://files.pythonhosted.org/packages/e9/f1/6a90f6077b3bac3e958de8e6c90227d079b3e1192e11e9d66172ade4eefe/natas-1.0.2-py2.py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "2d2c9c0f120f08a15c27e7a82f2c69df",
                    "sha256": "3638d911b4aaa74c33743c951c8a7abcd1f220efb39e2abb5ddf9aa476387dc2"
                },
                "downloads": -1,
                "filename": "natas-1.0.2.tar.gz",
                "has_sig": false,
                "md5_digest": "2d2c9c0f120f08a15c27e7a82f2c69df",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 2157986,
                "upload_time": "2019-09-20T08:18:39",
                "url": "https://files.pythonhosted.org/packages/02/73/6b916a7478ae2c24f74f02a2de036b20f6e9c4cd3469c636b107c4a263bd/natas-1.0.2.tar.gz"
            }
        ],
        "1.0.3": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "64febe6ea0acfcff674f94834ae326fe",
                    "sha256": "1bce0fba55bc90fc193b2318dac922f88cb9ad686c0abb741dbd08a37a6eb3b1"
                },
                "downloads": -1,
                "filename": "natas-1.0.3-py2.py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "64febe6ea0acfcff674f94834ae326fe",
                "packagetype": "bdist_wheel",
                "python_version": "py2.py3",
                "requires_python": null,
                "size": 2174769,
                "upload_time": "2019-09-20T13:35:23",
                "url": "https://files.pythonhosted.org/packages/39/aa/d7fa92d082232ac617c50655376faadfe66dec367c79ddfa14186b5e5520/natas-1.0.3-py2.py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "38b42d859ca7109916bf38090a820d79",
                    "sha256": "9da167e81791dd1ecfb8ef2962f6b00f1c62c36d80cd6780990ba31f033bb6b8"
                },
                "downloads": -1,
                "filename": "natas-1.0.3.tar.gz",
                "has_sig": false,
                "md5_digest": "38b42d859ca7109916bf38090a820d79",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 2158018,
                "upload_time": "2019-09-20T13:35:33",
                "url": "https://files.pythonhosted.org/packages/75/70/0d50900a3393802b4e1ced563bd94b54717a73fcc4b2b3dfe827f6add0bb/natas-1.0.3.tar.gz"
            }
        ],
        "1.0.4": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "b187dbdc6df53ca735ccd464e1852645",
                    "sha256": "d7eb2c805fffab453b991e51739b8730b9742bbf73077ed6bea70c66001f8bee"
                },
                "downloads": -1,
                "filename": "natas-1.0.4-py2.py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "b187dbdc6df53ca735ccd464e1852645",
                "packagetype": "bdist_wheel",
                "python_version": "py2.py3",
                "requires_python": null,
                "size": 2174635,
                "upload_time": "2019-09-20T13:47:33",
                "url": "https://files.pythonhosted.org/packages/9b/a3/67f9760778eea69aa6498c24c588c4d98099575c372ceee95d881437df2f/natas-1.0.4-py2.py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "6ebdfae1efb82f2a7c375172e2319428",
                    "sha256": "2eaf7f704d162ff1418ee1b43b05e3a2c4fb95da3cf02104964a2594843b5c6f"
                },
                "downloads": -1,
                "filename": "natas-1.0.4.tar.gz",
                "has_sig": false,
                "md5_digest": "6ebdfae1efb82f2a7c375172e2319428",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 2157897,
                "upload_time": "2019-09-20T13:47:45",
                "url": "https://files.pythonhosted.org/packages/89/b6/e6c7673de5f5074c7e772cf43e12bedd0e25b4baa684ab215c167caef4c3/natas-1.0.4.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "b187dbdc6df53ca735ccd464e1852645",
                "sha256": "d7eb2c805fffab453b991e51739b8730b9742bbf73077ed6bea70c66001f8bee"
            },
            "downloads": -1,
            "filename": "natas-1.0.4-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b187dbdc6df53ca735ccd464e1852645",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 2174635,
            "upload_time": "2019-09-20T13:47:33",
            "url": "https://files.pythonhosted.org/packages/9b/a3/67f9760778eea69aa6498c24c588c4d98099575c372ceee95d881437df2f/natas-1.0.4-py2.py3-none-any.whl"
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "6ebdfae1efb82f2a7c375172e2319428",
                "sha256": "2eaf7f704d162ff1418ee1b43b05e3a2c4fb95da3cf02104964a2594843b5c6f"
            },
            "downloads": -1,
            "filename": "natas-1.0.4.tar.gz",
            "has_sig": false,
            "md5_digest": "6ebdfae1efb82f2a7c375172e2319428",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 2157897,
            "upload_time": "2019-09-20T13:47:45",
            "url": "https://files.pythonhosted.org/packages/89/b6/e6c7673de5f5074c7e772cf43e12bedd0e25b4baa684ab215c167caef4c3/natas-1.0.4.tar.gz"
        }
    ]
}