{"id":13474681,"url":"https://github.com/chuanconggao/PrefixSpan-py","last_synced_at":"2025-03-26T22:31:06.919Z","repository":{"id":27459234,"uuid":"30938163","full_name":"chuanconggao/PrefixSpan-py","owner":"chuanconggao","description":"The shortest yet efficient Python implementation of the sequential pattern mining algorithm PrefixSpan, closed sequential pattern mining algorithm BIDE, and generator sequential pattern mining algorithm FEAT.","archived":false,"fork":false,"pushed_at":"2020-07-25T09:01:16.000Z","size":68,"stargazers_count":418,"open_issues_count":15,"forks_count":92,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-03-10T07:05:20.986Z","etag":null,"topics":["bide","data-mining","feat","pattern-mining","prefixspan"],"latest_commit_sha":null,"homepage":"https://git.io/prefixspan","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chuanconggao.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-02-17T21:15:59.000Z","updated_at":"2025-02-28T09:01:30.000Z","dependencies_parsed_at":"2022-08-19T06:10:53.131Z","dependency_job_id":null,"html_url":"https://github.com/chuanconggao/PrefixSpan-py","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chuanconggao%2FPrefixSpan-py","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chuanconggao%2FPrefixSpan-py/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chuanconggao%2FPrefixSpan-py/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chuanconggao%2FPrefixSpan-py/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chuanconggao","download_url":"https://codeload.github.com/chuanconggao/PrefixSpan-py/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245747569,"owners_count":20665809,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bide","data-mining","feat","pattern-mining","prefixspan"],"created_at":"2024-07-31T16:01:14.013Z","updated_at":"2025-03-26T22:31:06.610Z","avatar_url":"https://github.com/chuanconggao.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"[![PyPI version](https://img.shields.io/pypi/v/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)\n[![PyPI pyversions](https://img.shields.io/pypi/pyversions/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)\n[![PyPI license](https://img.shields.io/pypi/l/prefixspan.svg)](https://pypi.python.org/pypi/prefixspan/)\n\n**Featured on ImportPython [Issue 173](http://importpython.com/newsletter/no/173/). Thank you so much for support!**\n\nThe shortest yet efficient implementation of the famous frequent sequential pattern mining algorithm [PrefixSpan](https://ieeexplore.ieee.org/abstract/document/914830/), the famous frequent **closed** sequential pattern mining algorithm [BIDE](https://ieeexplore.ieee.org/abstract/document/1319986) (in `closed.py`), and the frequent **generator** sequential pattern mining algorithm [FEAT](https://dl.acm.org/citation.cfm?doid=1367497.1367651) (in `generator.py`), as a unified and holistic algorithm framework.\n\n- BIDE is usually much faster than PrefixSpan on large datasets, as only a small subset of closed patterns sharing the equivalent information of all the patterns are returned.\n\n- FEAT is usually faster than PrefixSpan but slower than BIDE on large datasets.\n\nFor simpler code, some general purpose functions have been moved to be part of a new library [extratools](https://github.com/chuanconggao/extratools).\n\n## Reference\n\n### Research Papers\n\n``` text\nPrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth.\nJian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, Meichun Hsu.\nProceedings of the 17th International Conference on Data Engineering, 2001.\n```\n\n``` text\nBIDE: Efficient Mining of Frequent Closed Sequences.\nJianyong Wang, Jiawei Han.\nProceedings of the 20th International Conference on Data Engineering, 2004.\n```\n\n``` text\nEfficient mining of frequent sequence generators.\nChuancong Gao, Jianyong Wang, Yukai He, Lizhu Zhou.\nProceedings of the 17th International Conference on World Wide Web, 2008.\n```\n\n### Alternative Implementations\n\nI created this project with the [original](https://github.com/chuanconggao/PrefixSpan-py/commit/441b04eca2174b3c92f6b6b2f50a30f1ffe4968c) minimal 15 lines implementation of PrefixSpan for educational purpose. However, as this project grows into a full feature library, its code size also inevitably grows. I have revised and reuploaded the original implementation as a GitHub Gist [here](https://gist.github.com/chuanconggao/4df9c1b06fa7f3ed854d5d96e2ae499f) for reference.\n\nYou can also try my Scala [version](https://github.com/chuanconggao/PrefixSpan-scala) of PrefixSpan.\n\n## Features\n\nOutputs traditional single-item sequential patterns, where gaps are allowed between items.\n\n- Mining top-k patterns is supported, with respective optimizations on efficiency.\n\n- You can limit the length of mined patterns. Note that setting maximum pattern length properly can significantly speedup the algorithm.\n\n- Custom key function, custom filter function, and custom callback function can be applied.\n\n## Installation\n\nThis package is available on PyPI. Just use `pip3 install -U prefixspan` to install it.\n\n## CLI Usage\n\nYou can simply use the algorithms on terminal.\n\n``` text\nUsage:\n    prefixspan-cli (frequent | top-k) \u003cthreshold\u003e [options] [\u003cfile\u003e]\n\n    prefixspan-cli --help\n\n\nOptions:\n    --text             Treat each item as text instead of integer.\n\n    --closed           Return only closed patterns.\n    --generator        Return only generator patterns.\n\n    --key=\u003ckey\u003e        Custom key function. [default: ]\n                       Must be a Python function in form of \"lambda patt, matches: ...\", returning an integer value.\n    --bound=\u003cbound\u003e    The upper-bound function of the respective key function. When unspecified, the same key function is used. [default: ]\n                       Must be no less than the key function, i.e. bound(patt, matches) ≥ key(patt, matches).\n                       Must be anti-monotone, i.e. for patt1 ⊑ patt2, bound(patt1, matches1) ≥ bound(patt2, matches2).\n\n    --filter=\u003cfilter\u003e  Custom filter function. [default: ]\n                       Must be a Python function in form of \"lambda patt, matches: ...\", returning a boolean value.\n\n    --minlen=\u003cminlen\u003e  Minimum length of patterns. [default: 1]\n    --maxlen=\u003cmaxlen\u003e  Maximum length of patterns. [default: 1000]\n```\n\n* Sequences are read from standard input. Each sequence is integers separated by space, like this example:\n\n``` text\ncat test.dat\n\n0 1 2 3 4\n1 1 1 3 4\n2 1 2 2 0\n1 1 1 2 2\n```\n\n- When dealing with text data, please use the `--text` option. Each sequence is words separated by space, assuming stop words have been removed, like this example:\n\n``` text\ncat test.txt\n\na b c d e\nb b b d e\nc b c c a\nb b b c c\n```\n\n* The patterns and their respective frequencies are printed to standard output.\n\n``` text\nprefixspan-cli frequent 2 test.dat\n\n0 : 2\n1 : 4\n1 2 : 3\n1 2 2 : 2\n1 3 : 2\n1 3 4 : 2\n1 4 : 2\n1 1 : 2\n1 1 1 : 2\n2 : 3\n2 2 : 2\n3 : 2\n3 4 : 2\n4 : 2\n```\n\n``` text\nprefixspan-cli frequent 2 --text test.txt\n\na : 2\nb : 4\nb c : 3\nb c c : 2\nb d : 2\nb d e : 2\nb e : 2\nb b : 2\nb b b : 2\nc : 3\nc c : 2\nd : 2\nd e : 2\ne : 2\n```\n\n## API Usage\n\nAlternatively, you can use the algorithms via API.\n\n``` python\nfrom prefixspan import PrefixSpan\n\ndb = [\n    [0, 1, 2, 3, 4],\n    [1, 1, 1, 3, 4],\n    [2, 1, 2, 2, 0],\n    [1, 1, 1, 2, 2],\n]\n\nps = PrefixSpan(db)\n```\n\nFor details of each parameter, please refer to the `PrefixSpan` class in `prefixspan/api.py`.\n\n``` python\nprint(ps.frequent(2))\n# [(2, [0]),\n#  (4, [1]),\n#  (3, [1, 2]),\n#  (2, [1, 2, 2]),\n#  (2, [1, 3]),\n#  (2, [1, 3, 4]),\n#  (2, [1, 4]),\n#  (2, [1, 1]),\n#  (2, [1, 1, 1]),\n#  (3, [2]),\n#  (2, [2, 2]),\n#  (2, [3]),\n#  (2, [3, 4]),\n#  (2, [4])]\n\nprint(ps.topk(5))\n# [(4, [1]),\n#  (3, [2]),\n#  (3, [1, 2]),\n#  (2, [1, 3]),\n#  (2, [1, 3, 4])]\n\n\nprint(ps.frequent(2, closed=True))\n\nprint(ps.topk(5, closed=True))\n\n\nprint(ps.frequent(2, generator=True))\n\nprint(ps.topk(5, generator=True))\n```\n\n## Closed Patterns and Generator Patterns\n\nThe closed patterns are much more compact due to the smaller number.\n\n- A pattern is closed if there is no super-pattern with the same frequency.\n\n``` text\nprefixspan-cli frequent 2 --closed test.dat\n\n0 : 2\n1 : 4\n1 2 : 3\n1 2 2 : 2\n1 3 4 : 2\n1 1 1 : 2\n```\n\nThe generator patterns are even more compact due to both the smaller number and the shorter lengths.\n\n- A pattern is generator if there is no sub-pattern with the same frequency.\n\n- Due to the high compactness, generator patterns are useful as features for classification, etc.\n\n``` text\nprefixspan-cli frequent 2 --generator test.dat\n\n0 : 2\n1 1 : 2\n2 : 3\n2 2 : 2\n3 : 2\n4 : 2\n```\n\nThere are patterns that are both closed and generator.\n\n``` text\nprefixspan-cli frequent 2 --closed --generator test.dat\n\n0 : 2\n```\n\n## Custom Key Function\n\nFor both frequent and top-k algorithms, a custom key function `key=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.\n    \n- In default, `len(matches)` is used denoting the frequency of current pattern.\n\n- Alternatively, any key function can be used. As an example, `sum(len(db[i]) for i in matches)` can be used to find the satisfying patterns according to the number of matched items.\n\n- For efficiency, an anti-monotone upper-bound function should also be specified for pruning.\n\n    - If unspecified, the key function is also the upper-bound function, and must be anti-monotone.\n\n``` python\nprint(ps.topk(5, key=lambda patt, matches: sum(len(db[i]) for i, _ in matches)))\n# [(20, [1]),\n#  (15, [2]),\n#  (15, [1, 2]),\n#  (10, [1, 3]),\n#  (10, [1, 3, 4])]\n```\n\n## Custom Filter Function\n\nFor both frequent and top-k algorithms, a custom filter function `filter=lambda patt, matches: ...` can be applied, where `patt` is the current pattern and `matches` is the current list of matching sequence `(id, position)` tuples.\n\n- In default, `filter` is not applied and all the patterns are returned.\n\n- Alternatively, any function can be used. As an example, `matches[0][0] \u003e 0` can be used to exclude the patterns covering the first sequence.\n\n``` python\nprint(ps.topk(5, filter=lambda patt, matches: matches[0][0] \u003e 0))\n# [(2, [1, 1]),\n#  (2, [1, 1, 1]),\n#  (2, [1, 2, 2]),\n#  (2, [2, 2]),\n#  (1, [1, 2, 2, 0])]\n```\n\n## Custom Callback Function\n\nFor both the frequent and the top-k algorithm, you can use a custom callback function `callback=lambda patt, matches: ...` instead of returning the normal results of patterns and their respective frequencies.\n\n- When callback function is specified, `None` is returned.\n\n- For large datasets, when mining frequent patterns, you can use callback function to process each pattern immediately, and avoid having a huge list of patterns consuming huge amount of memory.\n\n- The following example finds the longest frequent pattern covering each sequence.\n\n``` python\ncoverage = [[] for i in range(len(db))]\n\ndef cover(patt, matches):\n    for i, _ in matches:\n        coverage[i] = max(coverage[i], patt, key=len)\n\n\nps.frequent(2, callback=cover)\n\nprint(coverage)\n# [[1, 3, 4],\n#  [1, 3, 4],\n#  [1, 2, 2],\n#  [1, 2, 2]]\n```\n\n## Tip\n\nI strongly encourage using [PyPy](http://pypy.org/) instead of CPython to run the script for best performance. In my own experience, it is nearly 10 times faster in average. To start, you can install this package in a [virtual environment](https://virtualenv.pypa.io/en/stable/) created for PyPy.\n\nNote that only the earlier version 0.4 works for the latest PyPy3 6.0.0 (compatible with Python 3.5.3). Please install it via `pip3 install prefixspan==0.4`. Latest version should work for the future PyPy3 (compatible with Python 3.6).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchuanconggao%2FPrefixSpan-py","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchuanconggao%2FPrefixSpan-py","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchuanconggao%2FPrefixSpan-py/lists"}