{ "info": { "author": "Timo Petmanson @Funderbeam", "author_email": "tpetmanson@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Text Processing", "Topic :: Text Processing :: Linguistic" ], "description": "# Aho\u2013Corasick automaton + keyword tree implementation for Python\n\nBy Timo Petmanson @ Funderbeam ( https://funderbeam.com )\n\nThis package is a C++ implementation of the Aho-Corasick automaton and wrapped in Python with the following features:\n\n* dictionary matching with linear O(n) complexity \n* efficient String -> String dictionary\n* serialization\n* functionality for removing overlaps while maximizing the number of matched tokens\n\nPlease refer to examples below for more details.\n\n\n## The data structure\n\nIn computer science, the Aho\u2013Corasick algorithm is a string searching algorithm invented by Alfred V. Aho and Margaret J. Corasick.\nIt is a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the \"dictionary\") within an input text.\nIt matches all strings simultaneously.\nThe complexity of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches.\nNote that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa).\n\nInformally, the algorithm constructs a finite state machine that resembles a trie with additional links between the various internal nodes.\nThese extra internal links allow fast transitions between failed string matches (e.g. a search for cat in a trie that does not contain cat, but contains cart, and thus would fail at the node prefixed by ca), to other branches of the trie that share a common prefix (e.g., in the previous case, a branch for attribute might be the best lateral transition).\nThis allows the automaton to transition between string matches without the need for backtracking.\n\nWhen the string dictionary is known in advance (e.g. a computer virus database), the construction of the automaton can be performed once off-line and the compiled automaton stored for later use.\nIn this case, its run time is linear in the length of the input plus the number of matched entries.\n\nSee https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm for more information.\n\nSee https://github.com/tpetmanson/aca/blob/master/docs/slides04.pdf to learn more about how AC automatons work.\n\n\n## Example usage\n\n### Example 1: basic use case\n\nCreate a dictionary of medicines and find where they match in a text.\n\n```python\n# create a new AC automaton\nfrom aca import Automaton\nautomaton = Automaton()\n\n# add a dictionary of words to the automaton\npainkillers = ['paracetamol', 'ibuprofen', 'hydrocloride']\nautomaton.add_all(painkillers)\n\n# match the dictionary on a text\ntext = 'paracetamol and hydrocloride are a medications to relieve pain and fever. paracetamol is less efficient than ibuprofen'\n\nfor match in automaton.get_matches(text):\n print (match.start, match.end, match.elems)\n\n```\n\nOutput:\n\n```\n0 11 paracetamol\n16 28 hydrocloride\n74 85 paracetamol\n109 118 ibuprofen\n```\n\n### Example 2: use case with tokenized keys and labels\n\n```python\n# create a new AC automaton\nfrom aca import Automaton\nautomaton = Automaton()\n\n# instead of plain strings, you can also use lists of tokens\nnames = [\n (['Yuri', 'Artyukhin'], 'developer'),\n (['Tom', 'Anderson', 'Jr'], 'designer'),\n]\nautomaton.add_all(names)\n\n# you can add an item like this as well\nautomaton[['Tom', 'Anderson']] = 'manager'\n\n\n# if you are not using plain strings, make sure you tokenize the text as well\ntext = 'Tom Anderson Jr and Yuri Artyukhin work on my project'.split()\n\nprint ('matches that maximize the number of matched words')\nfor match in automaton.get_matches(text):\n print (match.start, match.end, match.elems, match.label)\n```\n\nOutput:\n\n```\nmatches that maximize the number of matched words\n0 3 ['Tom', 'Anderson', 'Jr'] designer\n4 6 ['Yuri', 'Artyukhin'] developer\n```\n\nNote that your dictionary contains both Tom Anderson and Tom Anderson Jr.\nBy default, the matcher removes the matches that overlap, but this feature\ncan be disabled.\n\n```python\nprint ('all matches')\nfor match in automaton.get_matches(text, exclude_overlaps=False):\n print (match.start, match.end, match.elems, match.label)\n```\n\nOutput:\n\n```\n0 2 ['Tom', 'Anderson'] manager\n0 3 ['Tom', 'Anderson', 'Jr'] designer\n4 6 ['Yuri', 'Artyukhin'] developer\n```\n\n### Example 3: dictionary use case\n\nYou can use the automaton as a space-efficient dictionary.\nHowever, there are some implementation specific constraints:\n* keys can be only strings or string lists\n* values must be non-empty strings (with length greater than 0)\n* deleting keys won't free up memory, to do that you need to rebuild the Automaton\n* items() will always yield a list of strings\n\n```python\n# create a new AC automaton\nfrom aca import Automaton\nmap = Automaton()\n\n# use the automaton as a map\nmap['electrify'] = 'verb'\nmap['elegant'] = 'adjective'\nmap['acid'] = 'noun'\nmap['acidic'] = 'adjective'\n\n# access it like a Python dictionary\nprint (map['acid'])\n```\n\nOutput:\n\n```\nnoun\n```\n\n---\n\n```python\n# Trying to access an non-existent key will raise KeyError\nprint (map['invalid key'])\n```\n\nOutput:\n\n```\nKeyError: 'invalid key'\n```\n\n---\n\n```python\n# you can use get to provide a default value when key is missing\nprint (map.get('invalid key', 'default value'))\n```\n\nOutput:\n```\ndefault value\n```\n\n---\n\n```python\n# NB! Implementation specific special case: empty strings\n# denote \"missing\" values, so you can't use these\nmap['special'] = ''\nprint (map['special'])\n```\n\nOutput:\n\n```\nKeyError: 'special'\n```\n\n---\n\n```python\n# you can delete items\ndel map['electrify']\n\n# trying to delete a non-existent item raises KeyError\ndel map['invalid key']\n```\n\nOutput:\n```\nKeyError: 'invalid key'\n```\n\n---\n\n```python\n# NB! Implementation specific special case: empty strings\n# denote \"missing\" values, so you can't use these\nmap['special'] = ''\nprint (map['special'])\n```\n\nOutput:\n\n```\nKeyError: 'special'\n```\n\n---\n\n```python\n# iterate items like a dict\n# NB! Due to implementation specifics, this will always yield list of strings.\nprint ('items:')\nfor key, value in map.items():\n print ('{}: {}'.format(key, value))\n```\n\nOutput:\n```\nitems:\n['a', 'c', 'i', 'd']: noun\n['a', 'c', 'i', 'd', 'i', 'c']: adjective\n['e', 'l', 'e', 'g', 'a', 'n', 't']: adjective\n```\n\n---\n\n```python\n# you can also iterate prefixes\nprint ('prefixes:')\nfor prefix, value in map.prefixes():\n print ('{}: {}'.format(prefix, value))\n```\n\nOutput:\n\n```\n[]:\n['a']:\n['a', 'c']:\n['a', 'c', 'i']:\n['a', 'c', 'i', 'd']: noun\n['a', 'c', 'i', 'd', 'i']:\n['a', 'c', 'i', 'd', 'i', 'c']: adjective\n['e']:\n['e', 'l']:\n['e', 'l', 'e']:\n['e', 'l', 'e', 'c']:\n['e', 'l', 'e', 'c', 't']:\n['e', 'l', 'e', 'c', 't', 'r']:\n['e', 'l', 'e', 'c', 't', 'r', 'i']:\n['e', 'l', 'e', 'c', 't', 'r', 'i', 'f']:\n['e', 'l', 'e', 'c', 't', 'r', 'i', 'f', 'y']:\n['e', 'l', 'e', 'g']:\n['e', 'l', 'e', 'g', 'a']:\n['e', 'l', 'e', 'g', 'a', 'n']:\n['e', 'l', 'e', 'g', 'a', 'n', 't']: adjective\n['s']:\n['s', 'p']:\n['s', 'p', 'e']:\n['s', 'p', 'e', 'c']:\n['s', 'p', 'e', 'c', 'i']:\n['s', 'p', 'e', 'c', 'i', 'a']:\n['s', 'p', 'e', 'c', 'i', 'a', 'l']:\n```\n\n### Example 4: saving and loading\n\n```python\n\nfrom aca import Automaton\n\nautomaton = Automaton()\nautomaton['Estonia'] = 'Tallinn'\nautomaton['Germany'] = 'Berlin'\nautomaton['Finland'] = 'Helsinki'\n\n# serialize to disk\nautomaton.save_to_file('myautomaton.bin')\n\n# load from disk\nautomaton2 = Automaton()\nautomaton2.load_from_file('myautomaton.bin')\n\n# save / load to binary string\nautomaton3 = Automaton()\nautomaton3.load_from_string(automaton.save_to_string())\n\nprint (automaton2['Estonia'])\nprint (automaton3['Germany'])\n```\n\nOutput:\n\n```\nTallinn\nBerlin\n```\n\n## Install\n\n```\npip install wheel\npip install cython\npip install aca\n```\n\n### Development\n\nFor write / test cycles, use the following command to build the code in the project folder.\n```\npython setup.py build_ext --inplace\n```\n\n### Distributing the library\n\n```\npython setup.py build\npython setup.py sdist bdist_wheel upload\n```\n\n### Debugging\n\nDefine ```ACA_DEBUG``` macro in ```aca.h``` header and recompile to see more debugging output.\n\n### License\n\nGPLv3", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/tpetmanson/aca", "keywords": "", "license": "GPLv3", "maintainer": "", "maintainer_email": "", "name": "aca", "package_url": "https://pypi.org/project/aca/", "platform": "", "project_url": "https://pypi.org/project/aca/", "project_urls": { "Homepage": "https://github.com/tpetmanson/aca" }, "release_url": "https://pypi.org/project/aca/1.1/", "requires_dist": null, "requires_python": "", "summary": "Aho-Corasick automaton implementation in C++", "version": "1.1" }, "last_serial": 3526581, "releases": { "0.5": [ { "comment_text": "", "digests": { "md5": "1f57774132de77e6eed43e9f32ae90c2", "sha256": "9609d43a4c0a30b1647095ad105dd543c46b38601b345ac95418c3b65fc97ba9" }, "downloads": -1, "filename": "aca-0.5-cp35-cp35m-macosx_10_6_intel.whl", "has_sig": false, "md5_digest": "1f57774132de77e6eed43e9f32ae90c2", "packagetype": "bdist_wheel", "python_version": "3.5", "requires_python": null, "size": 264100, "upload_time": "2016-09-30T14:14:36", "url": "https://files.pythonhosted.org/packages/83/a1/cc04c99438ba63dafe329e8c317f8362f8ad8842ab5c1d445b9b23fec46d/aca-0.5-cp35-cp35m-macosx_10_6_intel.whl" }, { "comment_text": "", "digests": { "md5": "905134d224007d92ef9563d6ac6e3265", "sha256": "8f150d63c3d97a7ff9a142996bba5616803fcac47f859fb9dd51dd5600e643c8" }, "downloads": -1, "filename": "aca-0.5.tar.gz", "has_sig": false, "md5_digest": "905134d224007d92ef9563d6ac6e3265", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 82297, "upload_time": "2016-09-30T14:12:07", "url": "https://files.pythonhosted.org/packages/4c/ad/fe883151d6119e68a8ae13aa9103f702496cc76095fcffb7078627eba48f/aca-0.5.tar.gz" } ], "0.6": [ { "comment_text": "", "digests": { "md5": "e0ad394b0c9fa05e948473bf2cbeada4", "sha256": "7abc78cef71ba174d98c114618224d3267cbd5f0d160b74e3a3a019c353e95dd" }, "downloads": -1, "filename": "aca-0.6-cp35-cp35m-macosx_10_6_intel.whl", "has_sig": false, "md5_digest": "e0ad394b0c9fa05e948473bf2cbeada4", "packagetype": "bdist_wheel", "python_version": "3.5", "requires_python": null, "size": 270168, "upload_time": "2016-10-03T20:21:10", "url": "https://files.pythonhosted.org/packages/03/8e/292fe07e9b2a3f1cd8ce34dba9866426be170a351942ffa1cc5505fcce3d/aca-0.6-cp35-cp35m-macosx_10_6_intel.whl" }, { "comment_text": "", "digests": { "md5": "54f08ed6cffe01c16e4f7c54576b7595", "sha256": "cb0084423e6b07dd0f07907d2bce28104f6ddfb03a0fd0af16ed0c55fbf36f97" }, "downloads": -1, "filename": "aca-0.6.tar.gz", "has_sig": false, "md5_digest": "54f08ed6cffe01c16e4f7c54576b7595", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 82888, "upload_time": "2016-10-03T20:21:21", "url": "https://files.pythonhosted.org/packages/b2/ec/78e930a1ae48101228c58e57d01dc891b5be87ac7d12d13c3bfd5ccb0e92/aca-0.6.tar.gz" } ], "0.7": [ { "comment_text": "", "digests": { "md5": "40a85fb23832801242444868403a0627", "sha256": "055aed0ad39e33ecf07e0306e234148d51ddcbc29790fe054195364a0653c3a6" }, "downloads": -1, "filename": "aca-0.7-cp35-cp35m-macosx_10_6_intel.whl", "has_sig": false, "md5_digest": "40a85fb23832801242444868403a0627", "packagetype": "bdist_wheel", "python_version": "3.5", "requires_python": null, "size": 264583, "upload_time": "2016-10-07T10:22:16", "url": "https://files.pythonhosted.org/packages/d2/f9/44bcaede621bec3faff6511858d800a5c136d1adb7f937707cf2fcff4495/aca-0.7-cp35-cp35m-macosx_10_6_intel.whl" }, { "comment_text": "", "digests": { "md5": "4b50439211e39eb118845538c41a0878", "sha256": "6c3947ed9ee306bcb16c844767908e5840fe1934dd00a4d4c956289b640559b0" }, "downloads": -1, "filename": "aca-0.7.tar.gz", "has_sig": false, "md5_digest": "4b50439211e39eb118845538c41a0878", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 82892, "upload_time": "2016-10-07T10:21:11", "url": "https://files.pythonhosted.org/packages/c5/47/050dcfed1493bd3370f1078e0cb30b90d7765c8114b96efd67c7f436d1ab/aca-0.7.tar.gz" } ], "0.8": [ { "comment_text": "", "digests": { "md5": "cc2fb2690279ad6a1f8b93afb329dd9a", "sha256": "bbaac0b325af85936f90a40b542925a146a7698cc9520d35d9f9ad74097e6fc3" }, "downloads": -1, "filename": "aca-0.8-cp34-cp34m-macosx_10_6_intel.whl", "has_sig": false, "md5_digest": "cc2fb2690279ad6a1f8b93afb329dd9a", "packagetype": "bdist_wheel", "python_version": "3.4", "requires_python": null, "size": 278318, "upload_time": "2016-12-22T13:45:53", "url": "https://files.pythonhosted.org/packages/1a/29/33232ec77bb73176781d0401b141176bcdf3f57b675f4d7dc6f590be3033/aca-0.8-cp34-cp34m-macosx_10_6_intel.whl" }, { "comment_text": "", "digests": { "md5": "b4f90bcde242b2c519560bec55e61c72", "sha256": "fb51828fdc98fef64ec8c38a88ed222e2c554d68b7506720a73be76fd89ff118" }, "downloads": -1, "filename": "aca-0.8.tar.gz", "has_sig": false, "md5_digest": "b4f90bcde242b2c519560bec55e61c72", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 88235, "upload_time": "2016-12-22T13:45:50", "url": "https://files.pythonhosted.org/packages/19/6c/03b791f490f8527fa162e526047c9a809fa5b9e7f46639322e335a2fde35/aca-0.8.tar.gz" } ], "0.9": [ { "comment_text": "", "digests": { "md5": "663bcba5ee3b4d50bec3c0b67297ffb1", "sha256": "2f8a4f2bfbad13104798155f372947def9632e3ae796eaedbdf425b09c6bb7ea" }, "downloads": -1, "filename": "aca-0.9-cp34-cp34m-macosx_10_6_intel.whl", "has_sig": false, "md5_digest": "663bcba5ee3b4d50bec3c0b67297ffb1", "packagetype": "bdist_wheel", "python_version": "3.4", "requires_python": null, "size": 284435, "upload_time": "2016-12-28T08:22:09", "url": "https://files.pythonhosted.org/packages/1d/3c/c0d8ee382ab009fa4a9f6bad8f4e5142119433f5b789c72e59ec99ae88bc/aca-0.9-cp34-cp34m-macosx_10_6_intel.whl" }, { "comment_text": "", "digests": { "md5": "7f7a343ca81a7b39116270c385ced05d", "sha256": "1a11c4551abf5bc9c78dda7f226b3d1b4c8b78bc3e627cc136386d078fca3ed5" }, "downloads": -1, "filename": "aca-0.9.tar.gz", "has_sig": false, "md5_digest": "7f7a343ca81a7b39116270c385ced05d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 92373, "upload_time": "2016-12-28T08:22:06", "url": "https://files.pythonhosted.org/packages/0f/fb/e24278ef7ec1b1bfa30ff602c0d27f7f06f11e1db9b45a0101e6d626967e/aca-0.9.tar.gz" } ], "1.0": [ { "comment_text": "", "digests": { "md5": "322ed2b3c2b61bdbae741659f1453c15", "sha256": "9d3acd765921758c5c5cc52b51b7c9c8e4bfce9b3a7ea9ba9093730ca305e2c8" }, "downloads": -1, "filename": "aca-1.0-cp36-cp36m-macosx_10_6_intel.whl", "has_sig": false, "md5_digest": "322ed2b3c2b61bdbae741659f1453c15", "packagetype": "bdist_wheel", "python_version": "3.6", "requires_python": null, "size": 296842, "upload_time": "2018-01-17T13:00:27", "url": "https://files.pythonhosted.org/packages/50/6d/674ee2520f97ca1817aafe2c998ccb587251142825dc74764c0c65bdb8fa/aca-1.0-cp36-cp36m-macosx_10_6_intel.whl" }, { "comment_text": "", "digests": { "md5": "26adc6ad1217d2dfcce64b272e51f8ef", "sha256": "5c4db3f44d509935a0329b80938225d832ecf22b21e82018a53e34a0e964c3c7" }, "downloads": -1, "filename": "aca-1.0.tar.gz", "has_sig": false, "md5_digest": "26adc6ad1217d2dfcce64b272e51f8ef", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 97838, "upload_time": "2018-01-17T13:00:24", "url": "https://files.pythonhosted.org/packages/95/08/37ed1e45dd24825c2526cab335fc849b8f5e57d1ef2e655e2efd387938c9/aca-1.0.tar.gz" } ], "1.1": [ { "comment_text": "", "digests": { "md5": "c9d8c88f5faa7e60f93c4969b9ebbb54", "sha256": "905ad29682db12931b42425a7632a8976b4e444e33564fd40c1097315199661c" }, "downloads": -1, "filename": "aca-1.1.tar.gz", "has_sig": false, "md5_digest": "c9d8c88f5faa7e60f93c4969b9ebbb54", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 97021, "upload_time": "2018-01-27T10:13:56", "url": "https://files.pythonhosted.org/packages/fe/36/c05024a6c2941a06f2386877e2c63373cf68bcc80046ed28f831069c25dc/aca-1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "c9d8c88f5faa7e60f93c4969b9ebbb54", "sha256": "905ad29682db12931b42425a7632a8976b4e444e33564fd40c1097315199661c" }, "downloads": -1, "filename": "aca-1.1.tar.gz", "has_sig": false, "md5_digest": "c9d8c88f5faa7e60f93c4969b9ebbb54", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 97021, "upload_time": "2018-01-27T10:13:56", "url": "https://files.pythonhosted.org/packages/fe/36/c05024a6c2941a06f2386877e2c63373cf68bcc80046ed28f831069c25dc/aca-1.1.tar.gz" } ] }