{ "info": { "author": "nppoly", "author_email": "nppoly@foxmail.com", "bugtrack_url": null, "classifiers": [ "Operating System :: MacOS :: MacOS X", "Operating System :: Microsoft :: Windows", "Operating System :: POSIX :: Linux", "Programming Language :: Python", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Text Processing", "Topic :: Text Processing :: Linguistic" ], "description": "# cyac\nHigh performance Trie & Keyword Match & Replace Tool.\n\nIt's implemented by cython, and will be compiled to cpp. The trie data structure is cedar, which is an optimized double array trie. it supports Python2.7 and 3.4+. It supports pickle to dump and load.\n\n If you found this useful please give a star!\n\n# Quick Start\nThis module is written in cython. You need cython installed.\n\n```\npip install cyac\n```\n\nThen create a Trie\n```\n>>> from cyac import Trie\n>>> trie = Trie()\n```\n\nadd/get/remove keyword\n```\n>>> trie.insert(u\"\u54c8\u54c8\") # return keyword id in trie, return -1 if doesn't exist\n>>> trie.get(u\"\u54c8\u54c8\") # return keyword id in trie, return -1 if doesn't exist\n>>> trie.remove(u\"\u5475\u5475\") # return keyword in trie\n>>> trie[id] # return the word corresponding to the id\n>>> trie[u\"\u5475\u5475\"] # similar to get but it will raise exeption if doesn't exist\n>>> u\"\u5475\u5475\" in trie # test if the keyword is in trie\n```\n\nget all keywords\n```\n>>> for key, id_ in trie.items():\n>>> print(key, id_)\n```\n\nprefix/ predict\n```\n>>> # return the string in the trie which starts with given string\n>>> for id_ in trie.predict(u\"\u5475\u5475\"):\n>>> print(id_)\n>>> # return the prefix of given string which is in the trie.\n>>> for id_, len_ in trie.prefix(u\"\u5475\u5475\"):\n>>> print(id_, len_)\n```\n\ntrie extract,replace\n```\n>>> python_id = trie.insert(u\"python\")\n>>> trie.replace_longest(\"python\", {python_id: u\"hahah\"}, set([ord(\" \")])) # the second parameter is seperator. If you specify seperators. it only matches strings tween seperators. e.g. It won't match 'apython'\n>>> for id_, start, end in trie.match_longest(u\"python\", set([ord(\" \")])):\n>>> print(id_, start, end)\n```\n\nAho Corasick extract\n```\n>>> ac = AC.build([u\"python\", u\"ruby\"])\n>>> for id, start, end in ac.match(u\"python ruby\"):\n>>> print(id, start, end)\n```\n\n\n# Performance\nOn Ubuntu 14.04.5/Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz. \n\n## Trie\nCompared With HatTrie\uff0c Horizon axis is token num. Vertical axis is used time(seconds).\n### Insert \n![insert performance](./bench/insert_performance.png)\n\n### Get\n![get performance](./bench/get_performance.png)\n\n### Remove\n![remove performance](./bench/remove_performance.png)\n\n## KeyWord Extract/Replace\n\nCompared With flashText. Regular Expression is too slow in this task (See flashText's bench mark). Horizon axis is char num to be match. Vertical axis is used time(seconds).\n\n![extract performance](./bench/extract_performance.png)\n![replace performance](./bench/replace_performance.png)\n\n## Aho Corasick Algorithm\nCompared With pyahocorasick, Horizon axis is char num to be match. Vertical axis is used time(seconds).\n![ac performance](./bench/ac_performance.png)\n\n# Unicode\n\n```\n>>> len(char.lower()) == len(char) # this is always true in python2, but not in python3\n>>> len(u\"\u0130stanbul\") != len(u\"\u0130stanbul\".lower()) # in python3\n```\n\nIn case insensitive matching, this library take care of the fact, and returns correct offset. \n\n# Run Test\n```bash\npython setup.py build\n\nPYTHONPATH=$(pwd)/build/BUILD_DST python3 tests/test_all.py\nPYTHONPATH=$(pwd)/build/BUILD_DST python3 bench/bench_*.py\n```\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/nppoly/cyac", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "cyac", "package_url": "https://pypi.org/project/cyac/", "platform": "", "project_url": "https://pypi.org/project/cyac/", "project_urls": { "Homepage": "https://github.com/nppoly/cyac" }, "release_url": "https://pypi.org/project/cyac/1.0/", "requires_dist": null, "requires_python": "", "summary": "High performance Trie and Ahocorasick automata (AC automata) for python", "version": "1.0" }, "last_serial": 4849452, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "c0f80bdd8167fb1f747f02036de67a19", "sha256": "5e41cd64708824ed9069bb838882bfc5cc21093dd597e5a6d073bb5b15a7a119" }, "downloads": -1, "filename": "cyac-1.0-cp37-cp37m-macosx_10_7_x86_64.whl", "has_sig": false, "md5_digest": "c0f80bdd8167fb1f747f02036de67a19", "packagetype": "bdist_wheel", "python_version": "cp37", "requires_python": null, "size": 155170, "upload_time": "2019-02-21T09:58:38", "url": "https://files.pythonhosted.org/packages/dc/97/f3510f8921ad35ce7d93b8aca459315fe8504e77a865473cecf629102546/cyac-1.0-cp37-cp37m-macosx_10_7_x86_64.whl" }, { "comment_text": "", "digests": { "md5": "8733dd40d07905ffad8d945b1a0bed3f", "sha256": "cf5b76101e457d11336eb92d21040c7ce829fff2c8707faa040ab890096d485a" }, "downloads": -1, "filename": "cyac-1.0.tar.gz", "has_sig": false, "md5_digest": "8733dd40d07905ffad8d945b1a0bed3f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14860, "upload_time": "2019-02-21T09:58:40", "url": "https://files.pythonhosted.org/packages/e4/31/df5bd99eb84b061b3bf8aeea9feed496ff5554147df8c3acdcb8186bb8de/cyac-1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "c0f80bdd8167fb1f747f02036de67a19", "sha256": "5e41cd64708824ed9069bb838882bfc5cc21093dd597e5a6d073bb5b15a7a119" }, "downloads": -1, "filename": "cyac-1.0-cp37-cp37m-macosx_10_7_x86_64.whl", "has_sig": false, "md5_digest": "c0f80bdd8167fb1f747f02036de67a19", "packagetype": "bdist_wheel", "python_version": "cp37", "requires_python": null, "size": 155170, "upload_time": "2019-02-21T09:58:38", "url": "https://files.pythonhosted.org/packages/dc/97/f3510f8921ad35ce7d93b8aca459315fe8504e77a865473cecf629102546/cyac-1.0-cp37-cp37m-macosx_10_7_x86_64.whl" }, { "comment_text": "", "digests": { "md5": "8733dd40d07905ffad8d945b1a0bed3f", "sha256": "cf5b76101e457d11336eb92d21040c7ce829fff2c8707faa040ab890096d485a" }, "downloads": -1, "filename": "cyac-1.0.tar.gz", "has_sig": false, "md5_digest": "8733dd40d07905ffad8d945b1a0bed3f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14860, "upload_time": "2019-02-21T09:58:40", "url": "https://files.pythonhosted.org/packages/e4/31/df5bd99eb84b061b3bf8aeea9feed496ff5554147df8c3acdcb8186bb8de/cyac-1.0.tar.gz" } ] }