{ "info": { "author": "Wojciech Mu\u0142a", "author_email": "wojciech_mula@poczta.onet.pl", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "License :: OSI Approved :: BSD License", "Programming Language :: C", "Topic :: Software Development :: Libraries", "Topic :: Text Editors :: Text Processing" ], "description": "========================================================================\n pyDAWG\n========================================================================\n\n.. contents::\n\nIntroduction\n============\n\n``PyDAWG`` is a python module implements DAWG__ graph structure,\nwhich allow to store set of strings and check existence of a string in\nlinear time (in terms of a string length).\n\nDAWG is constructed by **incremental algorithm** described in *Incremental\nalgorithm for sorted data*, **Jan Daciuk**, **Stoyan Mihov**, **Bruce Watson**,\nand **Richard Watson**, Computational Linguistics, 26(1), March 2000.\nProf. Jan Daciuk offers also some useful documentation, presentations and\neven sample code on `his site`__.\n\nThe algorithm asserts that input words are **sorted** in\n`lexicographic order`__; default Python ``sort()``\norders strings correctly.\n\nAlso **minimal perfect hashing** (MPH) is supported, i.e. there is a function\nthat maps words to unique number; this function is bidirectional, its possible\nto find number for given word or get word from number.\n\n__ http://en.wikipedia.org/wiki/DAWG\n__ http://www.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/personal/\n__ http://en.wikipedia.org/wiki/lexicographic%20order\n\n------------------------------------------------------------------------\n\nThere are two versions of module:\n\n* **C extension**, compatible only with Python3;\n* pure python module, compatible with Python 2 and 3.\n\nPython module implements subset of C extension API.\n\n\nLicense\n=======\n\nLibrary is licensed under very liberal three-clauses BSD license.\nSome portions has been released into public domain.\n\nFull text of license is available in LICENSE file.\n\n\nAuthor\n======\n\nWojciech Mu\u0142a, wojciech_mula@poczta.onet.pl\n\n\nInstallation\n============\n\nCompile time settings (can be change in setup.py):\n\n* ``DAWG_UNICODE`` --- if defined, DAWG accepts and returns\n unicode strings, else bytes are supported\n\n* ``DAWG_PERFECT_HASHING`` --- when defined, minimal perfect\n hashing is enabled (methods word2index and index2word are\n available)\n\n\nJust run::\n\n\t\t$ python setup.py install\n\nIf compilation succed, module is ready to use.\n\n\nAPI\n===\n\n\nModule\n------\n\nModule ``pydawg`` provides class ``DAWG`` and following members:\n\n* ``EMPTY``, ``ACTIVE``, ``CLOSED`` --- symbolic constants for\n ``state`` member of ``DAWG`` object\n* ``perfect_hashing`` -- see `Minimal perfect hashing`_\n* ``unicode`` -- see `Unicode and bytes`_\n\n\nUnicode and bytes\n~~~~~~~~~~~~~~~~~\n\nType of strings accepted and returned by ``DAWG`` methods can be\neither **unicode** or **bytes**, depending on compile time\nsettings (preprocessor definition ``DAWG_UNICODE``). Value of\nmodule member ``unicode`` informs about chosen type.\n\n\n\n\n``DAWG`` class\n--------------\n\n``DAWG`` class is picklable__, and also provide independent\nway of marshaling with methods ``binload()`` and ``bindump()``.\n\n__ http://docs.python.org/py3k/library/pickle.html\n\n\nProperty\n~~~~~~~~\n\n``state`` [read-only integer]\n\tFollowing values are possible:\n\n\t* ``pydawg.EMPTY`` --- no words in a set;\n\t* ``pydawg.ACTIVE`` --- there is at least one word in a set,\n\t and adding new words is possible (see ``add_word`` & ``add_word_unchecked``);\n\t* ``pydawg.CLOSED`` --- there is at least one word in a set,\n\t but adding new words is not allowed (see ``close``/``freeze``).\n\n\nBasic mathods\n~~~~~~~~~~~~~\n\n``add_word(word) => bool``\n\tAdd word, returns True if word didn't exists in a set.\n\tProcedure checks if ``word`` is greater then previously \n\tadded word (in lexicography order).\n\n``add_word_unchecked(word) => bool``\n\tDoes the same thing as ``add_word`` but do not check ``word``\n\torder. Method should be used if one is sure, that input data\n\tsatisfy\talgorithm requirements, i.e. words order is valid.\n\n``exists(word) => bool`` or ``word in ...``\n\tCheck if word is in set.\n\n``match(word) => bool``\n\tCheck if word or any of its prefix is in a set.\n\n``longest_prefix(word) => int``\n\tReturns length of the longest prefix of word that exists in a set.\n\n``len()`` protocol\n\tReturns number of distinct words.\n\n``words() => list``\n\tReturns list of all words.\n\n``find_all([word, [wildchar, [how]]]) => iterator``\n\tReturns iterator that match words depending on ``word`` argument.\n\n\t``find_all()``\n\t\tdoes the same job as ``iter()``\n\n\t``find_all(prefix)``\n\t\tYields words that share a prefix\n\n\t``find_all(pattern, wildchar, [how])``\n\t\tYields words that match a ``pattern`` with given ``wildchar`` (wildchar\n\t\tmatches any char). Parameter ``how`` controls which words are matched:\n\t\t\n\t\t``MATCH_EXACT_LENGTH``\n\t\t\twords with the same length as a pattern\n\n\t\t``MATCH_AT_LEAST_PREFIX``\n\t\t\twords of length not less then pattern\n\n\t\t``MATCH_AT_MOST_PREFIX``\n\t\t\twords of length no greater then pattern\n\n\n``clear()``\n\tErase all words from set.\n\n``close()`` or ``freeze()``\n\tDon't allow to add any new words, ``state`` value become\n\t``pydawg.CLOSED``. Also free memory occupied by\ta hash table\n\tused to perform incremental algorithm (see also\t``get_hash_stats()``).\n\n\tCan be reverted only by ``clear()``.\n\n\nIterator\n~~~~~~~~\n\nClass supports ``iter`` protocol, i.e. ``iter(DAWGobject)`` returns\niterator, a lazy version of ``words()`` method.\n\n\nMinimal perfect hashing\n~~~~~~~~~~~~~~~~~~~~~~~\n\nMinimal `perfect hashing`__ (MPH) allows to find unique number\nrepresenting any word from DAWG, and also find word with given number.\nNumbers are in always in range 1 ... ``len(DAWG)``.\n\nFinally, this feature makes possible to perform fast lookups as\nin a regular dictionary.\n\nAlgorithm used for MPH is described in *Applications of Finite Automata\nRepresenting Large Vocabularies*, **Claudio Lucchesi** and **Tomasz\nKowaltowski**, Software Practice and Experience, 23(1), pp. 15--30, Jan.\n1993.\n\nMPH feature is enabled during compilation time if preprocessor\ndefinition ``DAWG_PERFECT_HASHING`` exists. Module member\n``perfect_hashing`` reflects this setting.\n\n__ http://en.wikipedia.org/wiki/perfect%20hashing\n\n.. warning::\n\tWords numbering is done for the whole DAWG. If new words\n\tare added with ``add_word`` or ``add_word_unchecked``,\n\tthen current numbering is lost and when method ``word2index``\n\tor ``index2word`` is called, then DAWG is renumbered.\n\t\n\tBecause of that frequent mixing these two groups of method\n\twill degrade performance.\n\n\n``word2index(word) => index``\n\tReturns index of word, or None if word is not present in a DAWG.\n\n``index2word(index) => word``\n\tReturns words associated with index, or None if index isn't valid.\n\n\nExample\n#######\n\n::\n\n\tD = pydawg.DAWG()\n\n\t# fill DAWG with keys\n\tfor key in sorted(dict):\n\t\tD.add_word_unchecked(key)\n\n\t# prepare values array\n\tV = [None] * len(D)\n\n\tfor key, value in dict.items():\n\t\tindex = D.word2index(key)\n\t\tassert index is not None\n\n\t\tV[index - 1] = value\n\t\t\n\t\n\t# lookups are possible now\n\tfor word in user_input:\n\t\tindex = D.word2index(word)\n\t\tif index is not None:\n\t\t\tprint(word, \"=>\", V[index - 1])\n\n\nOther\n~~~~~\n\n``dump() => (set of nodes, set of edges)``\n\tReturns sets describing DAWG, elements are tuples.\n\t\n\tNode tuple:\n\n\t* unique id of node (number)\n\t* end of word marker\n\n\tEdge tuple:\n\n\t* source node id\n\t* edge label --- letter\n\t* destination node id\n\n\tDistribution contains program ``dump2dot.py`` that shows how to\n\tconvert output of this function to `graphviz`__ DOT language.\n\n\t__ http://graphviz.org\n\n``bindump() => bytes``\n\tReturns binary DAWG data.\n\n``binload(bytes)``\n\tRestore DAWG from binary data. Example::\n\n\t\timport pydawg\n\n\t\tA = pydawg.DAWG()\n\t\twith open('dump', 'wb') as f:\n\t\t\tf.write(A.bindump())\n\n\t\tB = pydawg.DAWG()\n\t\twith open('dump', 'rb') as f:\n\t\t\tB.binload(f.read())\n\n``get_stats() => dict``\n\tReturns dictionary containing some statistics about\n\tunderlaying data structure:\n\n\t* ``words_count``\t--- number of distinct words (same as ``len(dawg)``)\n\t* ``longest_word``\t--- length of the longest word\n\t* ``nodes_count``\t--- number of nodes\n\t* ``edges_count``\t--- number of edges\n\t* ``sizeof_node``\t--- size of single node (in bytes)\n\t* ``sizeof_edge``\t--- size of single node (in bytes)\n\t* ``graph_size``\t--- size of whole graph (in bytes); it's about\n\t ``nodes_count * sizeof_node + edges_count * sizeof_edge``\n\n``get_hash_stats() => dict``\n\tReturns some statistics about hash table used by DAWG.\n\n\t* ``table_size`` --- number of table's elements\n\t* ``element_size`` --- size of single table item\n\t* ``items_count`` --- number of items saved in a table\n\t* ``item_size`` --- size of single item\n\n\tApprox memory occupied by hash table is\n\t``table_size * element_size + items_count * item_size``.\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/WojciechMula/pyDAWG", "keywords": "dawg", "license": "BSD (3 clauses)", "maintainer": "", "maintainer_email": "", "name": "pyDAWG", "package_url": "https://pypi.org/project/pyDAWG/", "platform": "Linux", "project_url": "https://pypi.org/project/pyDAWG/", "project_urls": { "Homepage": "http://github.com/WojciechMula/pyDAWG" }, "release_url": "https://pypi.org/project/pyDAWG/1.0.1/", "requires_dist": null, "requires_python": "", "summary": "Directed Acyclic Word Graph (DAWG) allows to store huge strings set in compacted form", "version": "1.0.1" }, "last_serial": 3810757, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "e4ad879fb6cfc0d9ff4c07e3cad8245e", "sha256": "d8d91178e4aab0642387ae77a72d900f842dfbc2963e262918af97cba53b2ed5" }, "downloads": -1, "filename": "pyDAWG-1.0.0.tar.gz", "has_sig": false, "md5_digest": "e4ad879fb6cfc0d9ff4c07e3cad8245e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 26378, "upload_time": "2014-11-26T17:19:22", "url": "https://files.pythonhosted.org/packages/db/15/494d4ad17f0755151bac865491334e915efddbed84bec848aaeb5ec7da1f/pyDAWG-1.0.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "bd627ed2a0694ffdeb4cb2f1de7d524f", "sha256": "20577c9bdb67ba244ce9d32a2f000394af1b789c6a5e4113fa43ff1f3eeed3b3" }, "downloads": -1, "filename": "pyDAWG-1.0.1.tar.gz", "has_sig": false, "md5_digest": "bd627ed2a0694ffdeb4cb2f1de7d524f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 28670, "upload_time": "2018-04-26T15:40:16", "url": "https://files.pythonhosted.org/packages/2e/5e/cf8fb76e2a74195dd4771d809d119ec32b75479dda7eac7c4df3f3db353d/pyDAWG-1.0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "bd627ed2a0694ffdeb4cb2f1de7d524f", "sha256": "20577c9bdb67ba244ce9d32a2f000394af1b789c6a5e4113fa43ff1f3eeed3b3" }, "downloads": -1, "filename": "pyDAWG-1.0.1.tar.gz", "has_sig": false, "md5_digest": "bd627ed2a0694ffdeb4cb2f1de7d524f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 28670, "upload_time": "2018-04-26T15:40:16", "url": "https://files.pythonhosted.org/packages/2e/5e/cf8fb76e2a74195dd4771d809d119ec32b75479dda7eac7c4df3f3db353d/pyDAWG-1.0.1.tar.gz" } ] }