{ "info": { "author": "Chuancong Gao", "author_email": "chuancong@gmail.com", "bugtrack_url": null, "classifiers": [ "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 3" ], "description": "[![PyPI version](https://img.shields.io/pypi/v/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)\n[![PyPI pyversions](https://img.shields.io/pypi/pyversions/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)\n[![PyPI license](https://img.shields.io/pypi/l/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)\n\n**Featured on ImportPython [Issue 173](http://importpython.com/newsletter/no/173/). Thank you so much for support!**\n\nThe shortest yet efficient implementation of the famous frequent sequential pattern mining algorithm [PrefixSpan](https://ieeexplore.ieee.org/abstract/document/914830/), the famous frequent **closed** sequential pattern mining algorithm [BIDE](https://ieeexplore.ieee.org/abstract/document/1319986) (in `closed.py`), and the frequent **generator** sequential pattern mining algorithm [FEAT](https://dl.acm.org/citation.cfm?doid=1367497.1367651) (in `generator.py`), as a unified and holistic algorithm framework.\n\n- BIDE is usually much faster than PrefixSpan on large datasets, as only a small subset of closed patterns sharing the equivalent information of all the patterns are returned.\n\n- FEAT is usually faster than PrefixSpan but slower than BIDE on large datasets.\n\nFor simpler code, some general purpose functions have been moved to be part of a new library [extratools](https://github.com/chuanconggao/extratools).\n\n## Reference\n\n### Research Papers\n\n``` text\nPrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth.\nJian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, Meichun Hsu.\nProceedings of the 17th International Conference on Data Engineering, 2001.\n```\n\n``` text\nBIDE: Efficient Mining of Frequent Closed Sequences.\nJianyong Wang, Jiawei Han.\nProceedings of the 20th International Conference on Data Engineering, 2004.\n```\n\n``` text\nEfficient mining of frequent sequence generators.\nChuancong Gao, Jianyong Wang, Yukai He, Lizhu Zhou.\nProceedings of the 17th International Conference on World Wide Web, 2008.\n```\n\n### Alternative Implementations\n\nI created this project with the [original](https://github.com/chuanconggao/PrefixSpan-py/commit/441b04eca2174b3c92f6b6b2f50a30f1ffe4968c) minimal 15 lines implementation of PrefixSpan for educational purpose. However, as this project grows into a full feature library, its code size also inevitably grows. I have revised and reuploaded the original implementation as a GitHub Gist [here](https://gist.github.com/chuanconggao/4df9c1b06fa7f3ed854d5d96e2ae499f) for reference.\n\nYou can also try my Scala [version](https://github.com/chuanconggao/PrefixSpan-scala) of PrefixSpan.\n\n## Features\n\nOutputs traditional single-item sequential patterns, where gaps are allowed between items.\n\n- Mining top-k patterns is supported, with respective optimizations on efficiency.\n\n- You can limit the length of mined patterns. Note that setting maximum pattern length properly can significantly speedup the algorithm.\n\n- Custom key function, custom filter function, and custom callback function can be applied.\n\n## Installation\n\nThis package is available on PyPI. Just use `pip3 install -U prefixspan` to install it.\n\n## CLI Usage\n\nYou can simply use the algorithms on terminal.\n\n``` text\nUsage:\n prefixspan-cli (frequent | top-k) [options] []\n\n prefixspan-cli --help\n\n\nOptions:\n --text Treat each item as text instead of integer.\n\n --closed Return only closed patterns.\n --generator Return only generator patterns.\n\n --key= Custom key function. [default: ]\n Must be a Python function in form of \"lambda patt, matches: ...\", returning an integer value.\n --bound= The upper-bound function of the respective key function. When unspecified, the same key function is used. [default: ]\n Must be no less than the key function, i.e. bound(patt, matches) \u2265 key(patt, matches).\n Must be anti-monotone, i.e. for patt1 \u2291 patt2, bound(patt1, matches1) \u2265 bound(patt2, matches2).\n\n --filter= Custom filter function. [default: ]\n Must be a Python function in form of \"lambda patt, matches: ...\", returning a boolean value.\n\n --minlen= Minimum length of patterns. [default: 1]\n --maxlen= Maximum length of patterns. [default: 1000]\n```\n\n* Sequences are read from standard input. Each sequence is integers separated by space, like this example:\n\n``` text\ncat test.dat\n\n0 1 2 3 4\n1 1 1 3 4\n2 1 2 2 0\n1 1 1 2 2\n```\n\n- When dealing with text data, please use the `--text` option. Each sequence is words separated by space, assuming stop words have been removed, like this example:\n\n``` text\ncat test.txt\n\na b c d e\nb b b d e\nc b c c a\nb b b c c\n```\n\n* The patterns and their respective frequencies are printed to standard output.\n\n``` text\nprefixspan-cli frequent 2 test.dat\n\n0 : 2\n1 : 4\n1 2 : 3\n1 2 2 : 2\n1 3 : 2\n1 3 4 : 2\n1 4 : 2\n1 1 : 2\n1 1 1 : 2\n2 : 3\n2 2 : 2\n3 : 2\n3 4 : 2\n4 : 2\n```\n\n``` text\nprefixspan-cli frequent 2 --text test.txt\n\na : 2\nb : 4\nb c : 3\nb c c : 2\nb d : 2\nb d e : 2\nb e : 2\nb b : 2\nb b b : 2\nc : 3\nc c : 2\nd : 2\nd e : 2\ne : 2\n```\n\n## API Usage\n\nAlternatively, you can use the algorithms via API.\n\n``` python\nfrom prefixspan import PrefixSpan\n\ndb = [\n [0, 1, 2, 3, 4],\n [1, 1, 1, 3, 4],\n [2, 1, 2, 2, 0],\n [1, 1, 1, 2, 2],\n]\n\nps = PrefixSpan(db)\n```\n\nFor details of each parameter, please refer to the `PrefixSpan` class in `prefixspan/api.py`.\n\n``` python\nprint(ps.frequent(2))\n# [(2, [0]),\n# (4, [1]),\n# (3, [1, 2]),\n# (2, [1, 2, 2]),\n# (2, [1, 3]),\n# (2, [1, 3, 4]),\n# (2, [1, 4]),\n# (2, [1, 1]),\n# (2, [1, 1, 1]),\n# (3, [2]),\n# (2, [2, 2]),\n# (2, [3]),\n# (2, [3, 4]),\n# (2, [4])]\n\nprint(ps.topk(5))\n# [(4, [1]),\n# (3, [2]),\n# (3, [1, 2]),\n# (2, [1, 3]),\n# (2, [1, 3, 4])]\n\n\nprint(ps.frequent(2, closed=True))\n\nprint(ps.topk(5, closed=True))\n\n\nprint(ps.frequent(2, generator=True))\n\nprint(ps.topk(5, generator=True))\n```\n\n## Closed Patterns and Generator Patterns\n\nThe closed patterns are much more compact due to the smaller number.\n\n- A pattern is closed if there is no super-pattern with the same frequency.\n\n``` text\nprefixspan-cli frequent 2 --closed test.dat\n\n0 : 2\n1 : 4\n1 2 : 3\n1 2 2 : 2\n1 3 4 : 2\n1 1 1 : 2\n```\n\nThe generator patterns are even more compact due to both the smaller number and the shorter lengths.\n\n- A pattern is generator if there is no sub-pattern with the same frequency.\n\n- Due to the high compactness, generator patterns are useful as features for classification, etc.\n\n``` text\nprefixspan-cli frequent 2 --generator test.dat\n\n0 : 2\n1 1 : 2\n2 : 3\n2 2 : 2\n3 : 2\n4 : 2\n```\n\nThere are patterns that are both closed and generator.\n\n``` text\nprefixspan-cli frequent 2 --closed --generator test.dat\n\n0 : 2\n```\n\n## Custom Key Function\n\nFor both frequent and top-k algorithms, a custom key function `key=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.\n \n- In default, `len(matches)` is used denoting the frequency of current pattern.\n\n- Alternatively, any key function can be used. As an example, `sum(len(db[i]) for i in matches)` can be used to find the satisfying patterns according to the number of matched items.\n\n- For efficiency, an anti-monotone upper-bound function should also be specified for pruning.\n\n - If unspecified, the key function is also the upper-bound function, and must be anti-monotone.\n\n``` python\nprint(ps.topk(5, key=lambda patt, matches: sum(len(db[i]) for i, _ in matches)))\n# [(20, [1]),\n# (15, [2]),\n# (15, [1, 2]),\n# (10, [1, 3]),\n# (10, [1, 3, 4])]\n```\n\n## Custom Filter Function\n\nFor both frequent and top-k algorithms, a custom filter function `filter=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.\n\n- In default, `filter` is not applied and all the patterns are returned.\n\n- Alternatively, any function can be used. As an example, `matches[0][0] > 0` can be used to exclude the patterns covering the first sequence.\n\n``` python\nprint(ps.topk(5, filter=lambda patt, matches: matches[0][0] > 0))\n# [(2, [1, 1]),\n# (2, [1, 1, 1]),\n# (2, [1, 2, 2]),\n# (2, [2, 2]),\n# (1, [1, 2, 2, 0])]\n```\n\n## Custom Callback Function\n\nFor both the frequent and the top-k algorithm, you can use a custom callback function `callback=lambda patt, matches: ...` instead of returning the normal results of patterns and their respective frequencies.\n\n- When callback function is specified, `None` is returned.\n\n- For large datasets, when mining frequent patterns, you can use callback function to process each pattern immediately, and avoid having a huge list of patterns consuming huge amount of memory.\n\n- The following example finds the longest frequent pattern covering each sequence.\n\n``` python\ncoverage = [[] for i in range(len(db))]\n\ndef cover(patt, matches):\n for i, _ in matches:\n coverage[i] = max(coverage[i], patt, key=len)\n\n\nps.frequent(2, callback=cover)\n\nprint(coverage)\n# [[1, 3, 4],\n# [1, 3, 4],\n# [1, 2, 2],\n# [1, 2, 2]]\n```\n\n## Tip\n\nI strongly encourage using [PyPy](http://pypy.org/) instead of CPython to run the script for best performance. In my own experience, it is nearly 10 times faster in average. To start, you can install this package in a [virtual environment](https://virtualenv.pypa.io/en/stable/) created for PyPy.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "https://github.com/chuanconggao/PrefixSpan-py/tarball/0.5.2", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/chuanconggao/PrefixSpan-py", "keywords": "data-mining,pattern-mining", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "prefixspan", "package_url": "https://pypi.org/project/prefixspan/", "platform": "", "project_url": "https://pypi.org/project/prefixspan/", "project_urls": { "Download": "https://github.com/chuanconggao/PrefixSpan-py/tarball/0.5.2", "Homepage": "https://github.com/chuanconggao/PrefixSpan-py" }, "release_url": "https://pypi.org/project/prefixspan/0.5.2/", "requires_dist": null, "requires_python": ">= 3", "summary": "PrefixSpan, BIDE, and FEAT in Python 3", "version": "0.5.2" }, "last_serial": 4322009, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "8e3f462aa7607965cd36af297a815276", "sha256": "a31cdc476bfd388cf5082c65105f7a1deecbcb81eeb78d7717bb343137347ddf" }, "downloads": -1, "filename": "prefixspan-0.1.1.tar.gz", "has_sig": false, "md5_digest": "8e3f462aa7607965cd36af297a815276", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3032, "upload_time": "2018-01-30T06:41:29", "url": "https://files.pythonhosted.org/packages/2b/00/6e71d01dbd20afb77b93aa730992350efffff80a86f95499352094602de8/prefixspan-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "5833c5a9a6edd66f1fec2c2a73621974", "sha256": "38a4f36c16b3631ff120ede0384a0ff6099089f78492e44dba64fbca716c04a7" }, "downloads": -1, "filename": "prefixspan-0.1.2.tar.gz", "has_sig": false, "md5_digest": "5833c5a9a6edd66f1fec2c2a73621974", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3189, "upload_time": "2018-03-03T18:57:47", "url": "https://files.pythonhosted.org/packages/f7/cb/3e77aefce409f6a565c5a7111b8e6bd851017e4bb6febd9d5b9952ec341c/prefixspan-0.1.2.tar.gz" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "cfade972bfe8ce8e0a6b9a4f7c22bf4d", "sha256": "ee603816304fa3fd6189a0e88b7dc8967e60604264c6e05849c74bf918f46fc1" }, "downloads": -1, "filename": "prefixspan-0.2.tar.gz", "has_sig": false, "md5_digest": "cfade972bfe8ce8e0a6b9a4f7c22bf4d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3781, "upload_time": "2018-04-16T09:05:58", "url": "https://files.pythonhosted.org/packages/ef/a6/abe853df4c9b2e69f2ac037050cbb3b2138b45c405a8b53928586a027728/prefixspan-0.2.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "ec79b20d1c48fd2d8047deec1054cd4d", "sha256": "50b9f098f0fc5abbed9c0656f6dff3b171d9f1291b0b5bfeb496b839fe8d3ec7" }, "downloads": -1, "filename": "prefixspan-0.2.2.tar.gz", "has_sig": false, "md5_digest": "ec79b20d1c48fd2d8047deec1054cd4d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4161, "upload_time": "2018-04-18T06:15:18", "url": "https://files.pythonhosted.org/packages/7d/c2/565fc13c8341c64c7b03ea304a1de516b0288d32a5c35fbd8c52df95d063/prefixspan-0.2.2.tar.gz" } ], "0.2.3": [ { "comment_text": "", "digests": { "md5": "0c966906ab74f16ed7c443328f582bf9", "sha256": "531bfa7f7d80206b5e22ec321710115aec93b563f628f3cf3773f788ff717df7" }, "downloads": -1, "filename": "prefixspan-0.2.3.tar.gz", "has_sig": false, "md5_digest": "0c966906ab74f16ed7c443328f582bf9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4379, "upload_time": "2018-04-18T19:58:07", "url": "https://files.pythonhosted.org/packages/d3/cc/4684b63a5110f925180102ba2c93d105b7ac9ea9d4a62a7bffde901812bb/prefixspan-0.2.3.tar.gz" } ], "0.3.1": [ { "comment_text": "", "digests": { "md5": "ed8edff113e11c318d239eb218836ce8", "sha256": "02f650cef247a1d1eea738e19b0477c9161105374f7a1cb7fb691b52f2f6181f" }, "downloads": -1, "filename": "prefixspan-0.3.1.tar.gz", "has_sig": false, "md5_digest": "ed8edff113e11c318d239eb218836ce8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5005, "upload_time": "2018-04-22T07:39:21", "url": "https://files.pythonhosted.org/packages/c2/38/6660a7747984375b21fb999d7ec0a1d0022f5007addff6c2a39a1ce07523/prefixspan-0.3.1.tar.gz" } ], "0.3.4": [ { "comment_text": "", "digests": { "md5": "e23309664ca4d5ed10132494bec6b76b", "sha256": "ef5f9b70c64211dcaf45a74b388ab529c36cdb0c83f4ba1d2c4a7e56918e540b" }, "downloads": -1, "filename": "prefixspan-0.3.4.tar.gz", "has_sig": false, "md5_digest": "e23309664ca4d5ed10132494bec6b76b", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 5847, "upload_time": "2018-04-23T00:23:13", "url": "https://files.pythonhosted.org/packages/f2/6a/24c389e9ac9adf7f3c96dac7bfa390f32d8dd6c2400421224770a610e0a8/prefixspan-0.3.4.tar.gz" } ], "0.3.5": [ { "comment_text": "", "digests": { "md5": "266d5423e6b551a784a0a80d503abe05", "sha256": "c347853cd27ce24deb20c3802670ae195e1521c12376cdb5bc288c23acdcb79e" }, "downloads": -1, "filename": "prefixspan-0.3.5.tar.gz", "has_sig": false, "md5_digest": "266d5423e6b551a784a0a80d503abe05", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 5818, "upload_time": "2018-04-23T02:34:01", "url": "https://files.pythonhosted.org/packages/b8/02/6237085d59401776ef82964de57ce7a75309694aa3d488f8ef8feb0d50bd/prefixspan-0.3.5.tar.gz" } ], "0.3.6": [ { "comment_text": "", "digests": { "md5": "bf4b24d7471822f862622ac5cc4dd75a", "sha256": "6e58477a8a94f92041489d31d46729e13c0fdb5c261b6b7f202f095ccebbbbb7" }, "downloads": -1, "filename": "prefixspan-0.3.6.tar.gz", "has_sig": false, "md5_digest": "bf4b24d7471822f862622ac5cc4dd75a", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 5895, "upload_time": "2018-04-24T16:27:18", "url": "https://files.pythonhosted.org/packages/b6/94/8e48e5396510b625c4529b06962c450d269e774c45ec0749c40216226ece/prefixspan-0.3.6.tar.gz" } ], "0.4": [ { "comment_text": "", "digests": { "md5": "9179ed6ac5fc166d1feefc918b3b883b", "sha256": "ac64da572371180c48ba501c9e42e982f903337f72239b7eeaa9ad122b185276" }, "downloads": -1, "filename": "prefixspan-0.4.tar.gz", "has_sig": false, "md5_digest": "9179ed6ac5fc166d1feefc918b3b883b", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 6252, "upload_time": "2018-04-28T04:32:22", "url": "https://files.pythonhosted.org/packages/b2/8a/431f967d5c5c641b713ef96e32aa61a4cb593ae0329a78663d71d340959a/prefixspan-0.4.tar.gz" } ], "0.5": [ { "comment_text": "", "digests": { "md5": "0d0ea56d53edc4d929faf4dcff7bd8bb", "sha256": "e6ac3835bbe7e1e8aca04931df2b6be0d7ef8674d5c162a4953f5050c3135f80" }, "downloads": -1, "filename": "prefixspan-0.5.tar.gz", "has_sig": false, "md5_digest": "0d0ea56d53edc4d929faf4dcff7bd8bb", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 7404, "upload_time": "2018-04-29T06:49:40", "url": "https://files.pythonhosted.org/packages/8f/2b/82f96f9c50b1012178eae48aa92bba97fcb4c6e03559d4a1fdd645a74c26/prefixspan-0.5.tar.gz" } ], "0.5.1": [ { "comment_text": "", "digests": { "md5": "40b4d22cd6f8799e93878f0e3558f83a", "sha256": "cbd7f5b92d71e8592b9d617b3efe7111b340c86feea2287374381a4d51d4e75e" }, "downloads": -1, "filename": "prefixspan-0.5.1.tar.gz", "has_sig": false, "md5_digest": "40b4d22cd6f8799e93878f0e3558f83a", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 8157, "upload_time": "2018-05-09T18:50:57", "url": "https://files.pythonhosted.org/packages/53/8d/f48a549b96f63234e0ff446be390cb8c915d7596401764ca76a0ff508a86/prefixspan-0.5.1.tar.gz" } ], "0.5.1.1": [ { "comment_text": "", "digests": { "md5": "e857d0e3f4f350b893f08f991e266dfd", "sha256": "5d372cc72149b6564d125c169b47df11a3652072f8f09976809873f8265d73e0" }, "downloads": -1, "filename": "prefixspan-0.5.1.1.tar.gz", "has_sig": false, "md5_digest": "e857d0e3f4f350b893f08f991e266dfd", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 8167, "upload_time": "2018-05-16T07:57:13", "url": "https://files.pythonhosted.org/packages/93/f8/2ec0a9617be6e0e4f1ecffcb8cf228d8b8af4ff830828ba7f46bc38bb67f/prefixspan-0.5.1.1.tar.gz" } ], "0.5.1.2": [ { "comment_text": "", "digests": { "md5": "108968230c9f95a7d68c016fd9a904c9", "sha256": "3013881f15cd3ec659dd8778b32fec0bf452069102923838c8ca8d8e8bdc30ee" }, "downloads": -1, "filename": "prefixspan-0.5.1.2.tar.gz", "has_sig": false, "md5_digest": "108968230c9f95a7d68c016fd9a904c9", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 8178, "upload_time": "2018-05-24T14:05:50", "url": "https://files.pythonhosted.org/packages/54/1e/59133723311810007340384e9205f6ad073d8e4f426ca9f39f2e3625fa2c/prefixspan-0.5.1.2.tar.gz" } ], "0.5.2": [ { "comment_text": "", "digests": { "md5": "671beb44a55202bace8e54ead620aa9e", "sha256": "8fbd2f94b3a7f4399d04f9bd6aa214b830fb7828799c472cd43dda10b03f671c" }, "downloads": -1, "filename": "prefixspan-0.5.2.tar.gz", "has_sig": false, "md5_digest": "671beb44a55202bace8e54ead620aa9e", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 10404, "upload_time": "2018-09-29T06:12:27", "url": "https://files.pythonhosted.org/packages/c5/b0/e66e9f6e07a0b37aa0f5703c46f54bafbdf65dfba63994247676b19076c4/prefixspan-0.5.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "671beb44a55202bace8e54ead620aa9e", "sha256": "8fbd2f94b3a7f4399d04f9bd6aa214b830fb7828799c472cd43dda10b03f671c" }, "downloads": -1, "filename": "prefixspan-0.5.2.tar.gz", "has_sig": false, "md5_digest": "671beb44a55202bace8e54ead620aa9e", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3", "size": 10404, "upload_time": "2018-09-29T06:12:27", "url": "https://files.pythonhosted.org/packages/c5/b0/e66e9f6e07a0b37aa0f5703c46f54bafbdf65dfba63994247676b19076c4/prefixspan-0.5.2.tar.gz" } ] }