{ "info": { "author": "Ayoub RMIDI", "author_email": "ayoub.rmidi@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6" ], "description": "# French word segmentation\n\nUsually when extracting text from open source OCRs like Tesseract, we're most likely to encounter linked words duo to OCR quality extraction. \n\nFor example :\ninstead of extracting **\"Tr\u00e8s bon service\"**, one might get sudenlly **\"Tr\u00e8s bonservice\"**. So when doing feature engineering with BOW, TFIDF or even word2vec models, the algorithm will consider that **\"bonservice\"** as a unique feature, while it is not.\n\nTo deal with this problem, I built a module dealing with semantic word segmentation without any predefined corpus.\n\n## Installation\n\nUse the package manager [pip](https://pypi.org/project/fr-word-segment/) to install fr_word_segment.\n\n```bash\npip3 install fr-word-segment\npython3 -m spacy download fr\n```\n\n## Usage\n\n```python\nfrom fr_word_segment import wordseg\n# suppose that a french spellchecker detect this token as misspelled\ntoken = \"soitmoinscompliqu\u00e9\"\n\n# apply segmentation function on the given token\nresult = wordseg.segment_token(token)\n\n# show results\nprint(\"raw token is {}\".format(token)) # \"soitmoinscompliqu\u00e9\"\nprint(\"processed token is {}\".format(result)) # \"soit moins compliqu\u00e9\"\n```\n\n## Contributing\nPull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.\n\n\n## License\n[MIT](https://choosealicense.com/licenses/mit/)\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/PaacMaan/semantic_word_segmentation.git", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "fr-word-segment", "package_url": "https://pypi.org/project/fr-word-segment/", "platform": "", "project_url": "https://pypi.org/project/fr-word-segment/", "project_urls": { "Homepage": "https://github.com/PaacMaan/semantic_word_segmentation.git" }, "release_url": "https://pypi.org/project/fr-word-segment/0.1.3/", "requires_dist": [ "spacy" ], "requires_python": ">3.5.2", "summary": "A package that split mispelled words semantically", "version": "0.1.3" }, "last_serial": 5619142, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "d15b7e08afd58ba7eef99988cec35af4", "sha256": "b3993e2835392c6a3b6b551e278ff2c18c53578783b591df8c18c2ad8b866bcb" }, "downloads": -1, "filename": "fr_word_segment-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "d15b7e08afd58ba7eef99988cec35af4", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8922, "upload_time": "2019-07-29T22:58:16", "url": "https://files.pythonhosted.org/packages/fa/3f/e68cc91f1a375d91ce9c1dbb71fb359aa4fa9f3edeb00d04ccddfda76b09/fr_word_segment-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6c004e9bbdb602f19bc263facdcf6c40", "sha256": "2eb3fdf163b56184482456722203617a336ab23a78e0580648b3a0ea87d292fe" }, "downloads": -1, "filename": "fr_word_segment-0.1.0.tar.gz", "has_sig": false, "md5_digest": "6c004e9bbdb602f19bc263facdcf6c40", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6898, "upload_time": "2019-07-29T22:58:18", "url": "https://files.pythonhosted.org/packages/a6/0a/5f4f86aea7aa59fcf2351d514ace8aa0b6a07cdb2d60b76b07b15c932d52/fr_word_segment-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "a33906ffeb454b2d4c96b7b51dc3bb56", "sha256": "eb9638fac796607baa924f5aa0c8f9b896514eb9ac7c714443a98ed99aab6193" }, "downloads": -1, "filename": "fr_word_segment-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "a33906ffeb454b2d4c96b7b51dc3bb56", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">3.5.2", "size": 8957, "upload_time": "2019-07-30T00:31:53", "url": "https://files.pythonhosted.org/packages/22/57/db545e06594708866deb3fa18bf4252354765791ec235216a72a6d6c6e39/fr_word_segment-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "7db5a98d56179907df54dae28586fc52", "sha256": "b2fda439e577585ebbe4b41590fd94d30e4246edeb589c4cea89c8214a12dbb4" }, "downloads": -1, "filename": "fr_word_segment-0.1.1.tar.gz", "has_sig": false, "md5_digest": "7db5a98d56179907df54dae28586fc52", "packagetype": "sdist", "python_version": "source", "requires_python": ">3.5.2", "size": 6996, "upload_time": "2019-07-30T00:31:55", "url": "https://files.pythonhosted.org/packages/5d/c9/5602ec32026c0729665dc0e85bb0331b26d1a9506a929088aa4d150b7cb2/fr_word_segment-0.1.1.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "4daac8bc71ba446a696b9a32ab3b4572", "sha256": "46910181fed43e0543a3c1c098faa42f4fb8d7ddb0cc772c41892720202dbd6d" }, "downloads": -1, "filename": "fr_word_segment-0.1.3-py3-none-any.whl", "has_sig": false, "md5_digest": "4daac8bc71ba446a696b9a32ab3b4572", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">3.5.2", "size": 8954, "upload_time": "2019-08-01T14:39:41", "url": "https://files.pythonhosted.org/packages/3f/f0/b36b01dcc644c7e508381075061376d8b1607af1c3963408e147c51b030c/fr_word_segment-0.1.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "55b0a1cd39a123b647f1abb267b148e9", "sha256": "8740d6a110994ca9368891fbdcf1a6d4e03a22fc462f69d08329893fb2766985" }, "downloads": -1, "filename": "fr_word_segment-0.1.3.tar.gz", "has_sig": false, "md5_digest": "55b0a1cd39a123b647f1abb267b148e9", "packagetype": "sdist", "python_version": "source", "requires_python": ">3.5.2", "size": 6967, "upload_time": "2019-08-01T14:39:43", "url": "https://files.pythonhosted.org/packages/37/94/6d912d1ba5d63b00772d0ca62356d4cd366addba66717e3ab2f9d573c381/fr_word_segment-0.1.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4daac8bc71ba446a696b9a32ab3b4572", "sha256": "46910181fed43e0543a3c1c098faa42f4fb8d7ddb0cc772c41892720202dbd6d" }, "downloads": -1, "filename": "fr_word_segment-0.1.3-py3-none-any.whl", "has_sig": false, "md5_digest": "4daac8bc71ba446a696b9a32ab3b4572", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">3.5.2", "size": 8954, "upload_time": "2019-08-01T14:39:41", "url": "https://files.pythonhosted.org/packages/3f/f0/b36b01dcc644c7e508381075061376d8b1607af1c3963408e147c51b030c/fr_word_segment-0.1.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "55b0a1cd39a123b647f1abb267b148e9", "sha256": "8740d6a110994ca9368891fbdcf1a6d4e03a22fc462f69d08329893fb2766985" }, "downloads": -1, "filename": "fr_word_segment-0.1.3.tar.gz", "has_sig": false, "md5_digest": "55b0a1cd39a123b647f1abb267b148e9", "packagetype": "sdist", "python_version": "source", "requires_python": ">3.5.2", "size": 6967, "upload_time": "2019-08-01T14:39:43", "url": "https://files.pythonhosted.org/packages/37/94/6d912d1ba5d63b00772d0ca62356d4cd366addba66717e3ab2f9d573c381/fr_word_segment-0.1.3.tar.gz" } ] }