{ "info": { "author": "Antoine Amarilli", "author_email": "a3nm@a3nm.net", "bugtrack_url": null, "classifiers": [ "Programming Language :: Python :: 3" ], "description": "haspirater -- a toolkit to detect initial aspirated 'h' in French words\nCopyright (C) 2011-2019 by Antoine Amarilli\nRepository URL: https://gitlab.com/a3nm/haspirater\n\n== 0. License (MIT license) ==\n\nPermission is hereby granted, free of charge, to any person obtaining a\ncopy of this software and associated documentation files (the\n\"Software\"), to deal in the Software without restriction, including\nwithout limitation the rights to use, copy, modify, merge, publish,\ndistribute, sublicense, and/or sell copies of the Software, and to\npermit persons to whom the Software is furnished to do so, subject to\nthe following conditions:\n\nThe above copyright notice and this permission notice shall be included\nin all copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS\nOR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF\nMERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.\nIN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY\nCLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,\nTORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE\nSOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n\n== 1. Features ==\n\nhaspirater is a tool to detect if a French word starts with an aspirated\n'h' or not. It is not based on a list of words but on a trie trained\nfrom a corpus, which ensures that it should do a reasonable job for\nunseen words which are similar to known ones, without carrying a big\nexceptions list. The json trie used is less than 10 Kio, and the lookup\nscript is 40 lines of Python.\n\n== 2. Usage ==\n\nIf you just want to use the included training data, you can either run\nhaspirater.py, giving one word per line in stdin and getting the\nannotation on stdout, or you can import it in a Python file and call\nhaspirater.lookup(word) which returns True if the leading 'h' is\naspirated, False if it isn't, and raises ValueError if there is no\nleading 'h'.\n\nPlease report any errors in the training data, keeping in mind than only\none possibility is returned even when both are attested.\n\n== 3. Training ==\n\nThe training data used by haspirater.py is loaded at runtime from the\nhaspirater.json file which has been trained from French texts taken from\nProject Gutenberg , from the list in the Wikipedia\narticle , from the\ncategories in the French Wiktionary\n and\n and from a custom\nset of exceptions. If you want to create your own data, or adapt the\napproach here to other linguistic features, read on.\n\nThe master script is make.sh which accepts French text on stdin and a\nlist of exceptions files as arguments. Included exception files are\nadditions and wikipedia. These exceptions are just like training data\nand are not stored as-is; they are just piped later on in the training\nphase. make.sh produces on stdout the json trie. Thus, you would run\nsomething like:\n\n $ cat corpus | ./make.sh exceptions > haspirater/haspirater.json\n\n== 4. Training details ==\n\n=== 4.1. Corpus preparation (prepare.sh) ===\n\nThis script removes useless characters, and separates words (one per\nline).\n\n=== 4.2. Property inference (detect.pl) ===\n\nThis script examines the output, notices occurrences of words for which\nthe preceding word indicates the aspirated or non-aspirated status, and\noutputs them.\n\nThe format used for the output of this phase is the same as that of the\nexceptions file.\n\n=== 4.3. Removing leading 'h' ===\n\nThis is a quick optimization.\n\n=== 4.4. Trie construction (buildtrie.py) ===\n\nThe occurrences are read one after the other and are used to populate a\ntrie carrying the value count for each occurrence having a given prefix.\n\n=== 4.5. Trie compression (compresstrie.py) ===\n\nThe trie is then compressed by removing branches which are not needed to\ninfer a value. This step could be followed by a removal of branches with\nvery little dissent from the majority value if we wanted to reduce the\ntrie size at the expense of accuracy: for aspirated h, this isn't\nneeded.\n\n=== 4.6. Trie majority relabeling (majoritytrie.py) ===\n\nInstead of the list of values with their counts, nodes are relabeled to\ncarry the most common value. This step could be skipped to keep\nconfidence values. We also drop useless leaf nodes there.\n\n== 5. Additional stuff ==\n\nYou can use trie2dot.py to convert the output of buildtrie.py or\ncompresstrie.py in the dot format which can be used to render a drawing\nof the trie (\"trie2dot.py h 0 1\"). The result of such a drawing is given\nas haspirater.pdf (before majoritytrie.py: contains frequency info, but\nmore nodes) and haspirater_majority.pdf (no frequency, less nodes).\n\nYou can use leavestrie.py to get the leaves of a trie.\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://gitlab.com/a3nm/haspirater", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "haspirater", "package_url": "https://pypi.org/project/haspirater/", "platform": "", "project_url": "https://pypi.org/project/haspirater/", "project_urls": { "Homepage": "https://gitlab.com/a3nm/haspirater" }, "release_url": "https://pypi.org/project/haspirater/0.2/", "requires_dist": null, "requires_python": "", "summary": "detect aspirated 'h' in French words", "version": "0.2" }, "last_serial": 5683060, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "999931827d1e65e07c5168ee74221bf1", "sha256": "6984dab9405693c0134cee0b47aa2efa4c8d4bb2b456c63ee9fbdef77129d322" }, "downloads": -1, "filename": "haspirater-0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "999931827d1e65e07c5168ee74221bf1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 4173, "upload_time": "2019-08-15T16:15:12", "url": "https://files.pythonhosted.org/packages/e4/ad/2f72b933d4a258b5b3f19bde25a9a47e1b61084a30bed8bb193de9b89c71/haspirater-0.1-py3-none-any.whl" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "20655c1a417a458040fca73812991402", "sha256": "52cd04d5ec74acea24c1490836007e7e1e0ba145e130208df308c4483c1fd045" }, "downloads": -1, "filename": "haspirater-0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "20655c1a417a458040fca73812991402", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 10211, "upload_time": "2019-08-15T16:29:59", "url": "https://files.pythonhosted.org/packages/4f/18/f451ccd267b3d0a97bd46e8854b3226164219b24afb6c669d89626df5653/haspirater-0.2-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "20655c1a417a458040fca73812991402", "sha256": "52cd04d5ec74acea24c1490836007e7e1e0ba145e130208df308c4483c1fd045" }, "downloads": -1, "filename": "haspirater-0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "20655c1a417a458040fca73812991402", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 10211, "upload_time": "2019-08-15T16:29:59", "url": "https://files.pythonhosted.org/packages/4f/18/f451ccd267b3d0a97bd46e8854b3226164219b24afb6c669d89626df5653/haspirater-0.2-py3-none-any.whl" } ] }