{"id":43182936,"url":"https://github.com/jacksonpradolima/gsp-py","last_synced_at":"2026-02-08T22:03:35.830Z","repository":{"id":57436238,"uuid":"108451832","full_name":"jacksonpradolima/gsp-py","owner":"jacksonpradolima","description":"GSP (Generalized Sequence Pattern) algorithm in Python ","archived":false,"fork":false,"pushed_at":"2026-02-06T11:06:33.000Z","size":1337,"stargazers_count":39,"open_issues_count":5,"forks_count":23,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-02-06T11:26:21.729Z","etag":null,"topics":["data-analysis","data-mining","data-mining-algorithms","gsp","pattern-recognition","python","sequence-mining","sequential-patterns"],"latest_commit_sha":null,"homepage":"https://jacksonpradolima.github.io/gsp-py/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jacksonpradolima.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":["jacksonpradolima"],"buy_me_a_coffee":"pradolima"}},"created_at":"2017-10-26T18:43:24.000Z","updated_at":"2026-02-06T02:01:52.000Z","dependencies_parsed_at":"2026-02-01T04:01:48.956Z","dependency_job_id":null,"html_url":"https://github.com/jacksonpradolima/gsp-py","commit_stats":{"total_commits":18,"total_committers":3,"mean_commits":6.0,"dds":0.6666666666666667,"last_synced_commit":"bf979e58c9545f4df3582bf249d33a87ea705863"},"previous_names":[],"tags_count":31,"template":false,"template_full_name":null,"purl":"pkg:github/jacksonpradolima/gsp-py","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacksonpradolima%2Fgsp-py","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacksonpradolima%2Fgsp-py/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacksonpradolima%2Fgsp-py/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacksonpradolima%2Fgsp-py/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jacksonpradolima","download_url":"https://codeload.github.com/jacksonpradolima/gsp-py/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jacksonpradolima%2Fgsp-py/sbom","scorecard":{"id":500877,"data":{"date":"2025-08-11","repo":{"name":"github.com/jacksonpradolima/gsp-py","commit":"5fd578603713cb4739febd1e6411add5eff32ccc"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":5.3,"checks":[{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Code-Review","score":0,"reason":"Found 0/2 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Security-Policy","score":10,"reason":"security policy file detected","details":["Info: security policy file detected: SECURITY.md:1","Info: Found linked content: SECURITY.md:1","Info: Found disclosure, vulnerability, and/or timelines in security policy: SECURITY.md:1","Info: Found text in security policy: SECURITY.md:1"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/code_quality.yml:15: update your workflow using https://app.stepsecurity.io/secureworkflow/jacksonpradolima/gsp-py/code_quality.yml/master?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/code_quality.yml:19: update your workflow using https://app.stepsecurity.io/secureworkflow/jacksonpradolima/gsp-py/code_quality.yml/master?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/code_quality.yml:31: update your workflow using https://app.stepsecurity.io/secureworkflow/jacksonpradolima/gsp-py/code_quality.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/codecov.yml:12: update your workflow using https://app.stepsecurity.io/secureworkflow/jacksonpradolima/gsp-py/codecov.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/codecov.yml:17: update your workflow using https://app.stepsecurity.io/secureworkflow/jacksonpradolima/gsp-py/codecov.yml/master?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/codecov.yml:31: update your workflow using https://app.stepsecurity.io/secureworkflow/jacksonpradolima/gsp-py/codecov.yml/master?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/codecov.yml:40: update your workflow using https://app.stepsecurity.io/secureworkflow/jacksonpradolima/gsp-py/codecov.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish.yml:17: update your workflow using https://app.stepsecurity.io/secureworkflow/jacksonpradolima/gsp-py/publish.yml/master?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/publish.yml:20: update your workflow using https://app.stepsecurity.io/secureworkflow/jacksonpradolima/gsp-py/publish.yml/master?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/publish.yml:34: update your workflow using https://app.stepsecurity.io/secureworkflow/jacksonpradolima/gsp-py/publish.yml/master?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/codecov.yml:23","Warn: pipCommand not pinned by hash: .github/workflows/publish.yml:26","Warn: pipCommand not pinned by hash: .github/workflows/publish.yml:27","Info:   0 out of   5 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   5 third-party GitHubAction dependencies pinned","Info:   0 out of   3 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/code_quality.yml:1","Warn: no topLevel permission defined: .github/workflows/codecov.yml:1","Warn: no topLevel permission defined: .github/workflows/publish.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Packaging","score":10,"reason":"packaging workflow detected","details":["Info: Project packages its releases by way of GitHub Actions.: .github/workflows/publish.yml:8"],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Branch-Protection","score":3,"reason":"branch protection is not maximal on development and all release branches","details":["Info: 'allow deletion' disabled on branch 'master'","Info: 'force pushes' disabled on branch 'master'","Warn: 'branch protection settings apply to administrators' is disabled on branch 'master'","Info: 'stale review dismissal' is required to merge on branch 'master'","Warn: branch 'master' does not require approvers","Info: codeowner review is required on branch 'master'","Warn: 'last push approval' is disabled on branch 'master'","Warn: no status checks found to merge onto branch 'master'","Info: PRs are required in order to make changes on branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":10,"reason":"SAST tool is run on all commits","details":["Info: all commits (30) are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-19T21:49:01.701Z","repository_id":57436238,"created_at":"2025-08-19T21:49:01.701Z","updated_at":"2025-08-19T21:49:01.701Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29246439,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-08T21:42:34.334Z","status":"ssl_error","status_checked_at":"2026-02-08T21:41:38.468Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-mining","data-mining-algorithms","gsp","pattern-recognition","python","sequence-mining","sequential-patterns"],"created_at":"2026-02-01T04:01:34.613Z","updated_at":"2026-02-08T22:03:35.821Z","avatar_url":"https://github.com/jacksonpradolima.png","language":"Python","readme":"[![Docs](https://img.shields.io/badge/Docs-GSP--Py%20Site-3D9970?style=flat-square)](https://jacksonpradolima.github.io/gsp-py/)\n[![PyPI License](https://img.shields.io/pypi/l/gsppy.svg?style=flat-square)]()\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3333987.svg)](https://doi.org/10.5281/zenodo.3333987)\n\n[![PyPI Downloads](https://img.shields.io/pypi/dm/gsppy.svg?style=flat-square)](https://pypi.org/project/gsppy/)\n[![PyPI version](https://badge.fury.io/py/gsppy.svg)](https://pypi.org/project/gsppy)\n![](https://img.shields.io/badge/python-3.11+-blue.svg)\n\n[![OpenSSF Scorecard](https://api.securityscorecards.dev/projects/github.com/jacksonpradolima/gsp-py/badge)](https://securityscorecards.dev/viewer/?uri=github.com/jacksonpradolima/gsp-py)\n[![SLSA provenance](https://github.com/jacksonpradolima/gsp-py/actions/workflows/slsa-provenance.yml/badge.svg)](https://github.com/jacksonpradolima/gsp-py/actions/workflows/slsa-provenance.yml)\n[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/11684/badge)](https://www.bestpractices.dev/projects/11684)\n\n[![Bugs](https://sonarcloud.io/api/project_badges/measure?project=jacksonpradolima_gsp-py\u0026metric=bugs)](https://sonarcloud.io/summary/new_code?id=jacksonpradolima_gsp-py)\n[![Vulnerabilities](https://sonarcloud.io/api/project_badges/measure?project=jacksonpradolima_gsp-py\u0026metric=vulnerabilities)](https://sonarcloud.io/summary/new_code?id=jacksonpradolima_gsp-py)\n[![Security Rating](https://sonarcloud.io/api/project_badges/measure?project=jacksonpradolima_gsp-py\u0026metric=security_rating)](https://sonarcloud.io/summary/new_code?id=jacksonpradolima_gsp-py)\n[![Maintainability Rating](https://sonarcloud.io/api/project_badges/measure?project=jacksonpradolima_gsp-py\u0026metric=sqale_rating)](https://sonarcloud.io/summary/new_code?id=jacksonpradolima_gsp-py)\n[![codecov](https://codecov.io/github/jacksonpradolima/gsp-py/graph/badge.svg?token=o1P0qXaYtJ)](https://codecov.io/github/jacksonpradolima/gsp-py)\n\n# GSP-Py\n\n**GSP-Py**: A Python-powered library to mine sequential patterns in large datasets, based on the robust **Generalized\nSequence Pattern (GSP)** algorithm. Ideal for market basket analysis, temporal mining, and user journey discovery.\n\n\u003e [!IMPORTANT]\n\u003e GSP-Py is compatible with Python 3.11 and later versions!\n\n---\n\n## 📚 Table of Contents\n\n1. [🔍 What is GSP?](#what-is-gsp)\n2. [🔧 Requirements](#requirements)\n3. [🚀 Installation](#installation)\n    - [❖ Clone Repository](#option-1-clone-the-repository)\n    - [❖ Install via PyPI](#option-2-install-via-pip)\n4. [🛠️ Developer Installation](#developer-installation)\n5. [📖 Documentation](#documentation)\n6. [💡 Usage](#usage)\n    - [✅ Example: Analyzing Sales Data](#example-analyzing-sales-data)\n    - [📊 Explanation: Support and Results](#explanation-support-and-results)\n    - [📊 DataFrame Input Support](#dataframe-input-support)\n    - [🔗 Itemset Support](#itemset-support)\n    - [⏱️ Temporal Constraints](#temporal-constraints)\n7. [⌨️ Typing](#typing)\n8. [🌟 Planned Features](#planned-features)\n9. [🤝 Contributing](#contributing)\n10. [📝 License](#license)\n11. [📖 Citation](#citation)\n\n---\n\n## 🔍 What is GSP?\n\nThe **Generalized Sequential Pattern (GSP)** algorithm is a sequential pattern mining technique based on **Apriori\nprinciples**. Using support thresholds, GSP identifies frequent sequences of items in transaction datasets.\n\n### Key Features:\n\n- **Ordered (non-contiguous) matching**: Detects patterns where items appear in order but not necessarily adjacent, following standard GSP semantics. For example, the pattern `('A', 'C')` is found in the sequence `['A', 'B', 'C']`.\n- **Support-based pruning**: Only retains sequences that meet the minimum support threshold.\n- **Candidate generation**: Iteratively generates candidate sequences of increasing length.\n- **Temporal constraints**: Support for time-constrained pattern mining with `mingap`, `maxgap`, and `maxspan` parameters to find patterns within specific time windows.\n- **General-purpose**: Useful in retail, web analytics, social networks, temporal sequence mining, and more.\n\nFor example:\n\n- In a shopping dataset, GSP can identify patterns like \"Customers who buy bread and milk often purchase diapers next\" - even if other items appear between bread and milk.\n- In a website clickstream, GSP might find patterns like \"Users visit A, then eventually go to C\" - capturing user journeys with intermediate steps.\n\n---\n\n## 🔧 Requirements\n\nYou will need Python installed on your system. On most Linux systems, you can install Python with:\n\n```bash\nsudo apt install python3\n```\n\nFor package dependencies of GSP-Py, they will automatically be installed when using `pip`.\n\n---\n\n## 🚀 Installation\n\nGSP-Py can be easily installed from either the **repository** or PyPI.\n\n### Option 1: Clone the Repository\n\nTo manually clone the repository and set up the environment:\n\n```bash\ngit clone https://github.com/jacksonpradolima/gsp-py.git\ncd gsp-py\n```\n\nRefer to the [Developer Installation](#developer-installation) section and run the setup with uv.\n\n### Option 2: Install via `pip`\n\nAlternatively, install GSP-Py from PyPI with:\n\n```bash\npip install gsppy\n```\n\n---\n\n## 🛠️ Developer Installation\n\nThis project now uses [uv](https://github.com/astral-sh/uv) for dependency management and virtual environments.\n\n#### 1. Install uv\n```bash\ncurl -Ls https://astral.sh/uv/install.sh | bash\n```\n\nMake sure uv is on your PATH (for most Linux setups):\n```bash\nexport PATH=\"$HOME/.local/bin:$PATH\"\n```\n\n#### 2. Set up the project environment\nCreate a local virtual environment and install dependencies from uv.lock (single source of truth):\n\n```bash\nuv venv .venv\nuv sync --frozen --extra dev  # uses uv.lock\nuv pip install -e .\n```\n\n#### 3. Optional: Enable Rust acceleration\n\nRust acceleration is optional and provides faster support counting using a PyO3 extension. Python fallback remains available.\n\nBuild the extension locally:\n```bash\nmake rust-build\n```\n\nSelect backend at runtime (auto tries Rust, then falls back to Python):\n```bash\nexport GSPPY_BACKEND=rust   # or python, or unset for auto\n```\n\nRun benchmarks (adjust to your machine):\n```bash\nmake bench-small\nmake bench-big   # may use significant memory/CPU\n# or customize:\nGSPPY_BACKEND=auto uv run --python .venv/bin/python --no-project \\\n  python benchmarks/bench_support.py --n_tx 1000000 --tx_len 8 --vocab 50000 --min_support 0.2 --warmup\n```\n\n#### 4. Optional: Enable GPU (CuPy) acceleration\n\nGPU acceleration is experimental and currently optimizes singleton (k=1) support counting using CuPy.\nNon-singleton candidates fall back to the Rust/Python backend.\n\nInstall the optional extra (choose a CuPy build that matches your CUDA/ROCm setup if needed):\n\n```bash\nuv run pip install -e .[gpu]\n```\n\nSelect the GPU backend at runtime:\n\n```bash\nexport GSPPY_BACKEND=gpu\n```\n\nIf a GPU isn't available, an error will be raised when GSPPY_BACKEND=gpu is set. Otherwise, the default \"auto\" uses CPU.\n\n#### 5. Common development tasks\nAfter the environment is ready, activate it and run tasks with standard tools:\n\n```bash\nsource .venv/bin/activate\npytest -n auto\nruff check .\npyright\n```\n\nIf you prefer, you can also prefix commands with uv without activating:\n\n```bash\nuv run pytest -n auto\nuv run ruff check .\nuv run pyright\n```\n\n#### 5. Makefile (shortcuts)\nYou can use the Makefile to automate common tasks:\n\n```bash\nmake setup               # create .venv with uv and pin Python\nmake install             # sync deps (from uv.lock) + install project (-e .)\nmake test                # pytest -n auto\nmake lint                # ruff check .\nmake format              # ruff --fix\nmake typecheck           # pyright + ty\nmake pre-commit-install  # install the pre-commit hook\nmake pre-commit-run      # run pre-commit on all files\n\n# Rust-specific shortcuts\nmake rust-setup          # install rustup toolchain\nmake rust-build          # build PyO3 extension with maturin\nmake bench-small         # run small benchmark\nmake bench-big           # run large benchmark\n```\n\n\u003e [!NOTE]\n\u003e Tox in this project uses the \"tox-uv\" plugin. When running `make tox` or `tox`, missing Python interpreters can be provisioned automatically via uv (no need to pre-install all versions). This makes local setup faster.\n\n## 🔏 Release assets and verification\n\nEvery GitHub release bundles artifacts to help you validate what you download:\n\n- Built wheels and source distributions produced by the automated publish workflow.\n- `sbom.json` (CycloneDX) generated with [Syft](https://github.com/anchore/syft).\n- Sigstore-generated `.sig` and `.pem` files for each artifact, created using GitHub OIDC identity.\n\nTo verify a downloaded artifact from a release:\n\n```bash\npython -m pip install sigstore  # installs the CLI\nsigstore verify identity \\\n  --certificate gsppy-\u003cversion\u003e-py3-none-any.whl.pem \\\n  --signature gsppy-\u003cversion\u003e-py3-none-any.whl.sig \\\n  --cert-identity \"https://github.com/jacksonpradolima/gsp-py/.github/workflows/publish.yml@refs/tags/v\u003cversion\u003e\" \\\n  --cert-oidc-issuer https://token.actions.githubusercontent.com \\\n  gsppy-\u003cversion\u003e-py3-none-any.whl\n```\n\nReplace `\u003cversion\u003e` with the numeric package version (for example, `3.1.1`) in the filenames; in `--cert-identity`, this becomes `v\u003cversion\u003e` (for example, `v3.1.1`). Adjust the filenames for the sdist (`.tar.gz`) if preferred. The same release page also hosts `sbom.json` for supply-chain inspection.\n\n## 📖 Documentation\n\n- **Live site:** https://jacksonpradolima.github.io/gsp-py/\n- **Build locally:**\n\n  ```bash\n  uv venv .venv\n  uv sync --extra docs\n  uv run mkdocs serve\n  ```\n\nThe docs use MkDocs with the Material theme and mkdocstrings to render the Python API directly from docstrings.\n\n## 💡 Usage\n\nThe library is designed to be easy to use and integrate with your own projects. You can use GSP-Py either programmatically (Python API) or directly from the command line (CLI).\n\n---\n\n## 🚦 Using GSP-Py via CLI\n\nGSP-Py provides a command-line interface (CLI) for running the Generalized Sequential Pattern algorithm on transactional data. This allows you to mine frequent sequential patterns from JSON or CSV files without writing any code.\n\n### Installation\n\nFirst, install GSP-Py (if not already installed):\n\n```bash\npip install gsppy\n```\n\nThis will make the `gsppy` CLI command available in your environment.\n\n### Preparing Your Data\n\nYour input file should be either:\n\n- **JSON**: A list of transactions, each transaction is a list of items. Example:\n  ```json\n  [\n    [\"Bread\", \"Milk\"],\n    [\"Bread\", \"Diaper\", \"Beer\", \"Eggs\"],\n    [\"Milk\", \"Diaper\", \"Beer\", \"Coke\"],\n    [\"Bread\", \"Milk\", \"Diaper\", \"Beer\"],\n    [\"Bread\", \"Milk\", \"Diaper\", \"Coke\"]\n  ]\n  ```\n\n- **CSV**: Each row is a transaction, items separated by commas. Example:\n  ```csv\n  Bread,Milk\n  Bread,Diaper,Beer,Eggs\n  Milk,Diaper,Beer,Coke\n  Bread,Milk,Diaper,Beer\n  Bread,Milk,Diaper,Coke\n  ```\n\n- **SPM/GSP Format**: Uses delimiters to separate elements and sequences. This format is commonly used in sequential pattern mining datasets.\n  - `-1`: Marks the end of an element (itemset)\n  - `-2`: Marks the end of a sequence (transaction)\n  \n  Example:\n  ```text\n  1 2 -1 3 -1 -2\n  4 -1 5 6 -1 -2\n  1 -1 2 3 -1 -2\n  ```\n  \n  The above represents:\n  - Transaction 1: `[[1, 2], [3]]` → flattened to `[1, 2, 3]`\n  - Transaction 2: `[[4], [5, 6]]` → flattened to `[4, 5, 6]`\n  - Transaction 3: `[[1], [2, 3]]` → flattened to `[1, 2, 3]`\n  \n  String tokens are also supported:\n  ```text\n  A B -1 C -1 -2\n  D -1 E F -1 -2\n  ```\n\n- **Parquet/Arrow Files**: Modern columnar data formats (requires 'gsppy[dataframe]')\n  ```bash\n  pip install 'gsppy[dataframe]'\n  ```\n  This installs optional dependencies: `polars`, `pandas`, and `pyarrow` for DataFrame support.\n\n### Running the CLI\n\nUse the following command to run GSPPy on your data:\n\n```bash\ngsppy --file path/to/transactions.json --min_support 0.3 --backend auto\n```\n\nOr for CSV files:\n\n```bash\ngsppy --file path/to/transactions.csv --min_support 0.3 --backend rust\n```\n\nFor SPM/GSP format files, use the `--format spm` option:\n\n```bash\ngsppy --file path/to/data.txt --format spm --min_support 0.3\n```\n\n#### CLI Options\n\n- `--file`: Path to your input file (JSON, CSV, or SPM format). **Required**.\n- `--format`: File format to use for input. Options: `auto` (default, auto-detect from extension), `json`, `csv`, `spm`, `parquet`, `arrow`.\n- `--min_support`: Minimum support threshold as a fraction (e.g., `0.3` for 30%). Default is `0.2`.\n- `--backend`: Backend to use for support counting. One of `auto` (default), `python`, `rust`, or `gpu`.\n- `--output`: Path to save mining results to a file. If not specified, results are printed to console.\n- `--output-format`: Output format for mining results. Options: `auto` (default, detect from extension), `parquet`, `arrow`, `csv`, `json`. Requires `--output` to be specified.\n- `--verbose`: Enable detailed logging with timestamps, log levels, and process IDs for debugging and traceability.\n- `--mingap`, `--maxgap`, `--maxspan`: Temporal constraints for time-aware pattern mining (requires timestamped transactions).\n\n#### Verbose Mode\n\nFor debugging or to track execution in CI/CD pipelines, use the `--verbose` flag:\n\n```bash\ngsppy --file transactions.json --min_support 0.3 --verbose\n```\n\nThis produces structured logging output with timestamps, log levels, and process information:\n\n```\nYYYY-MM-DDTHH:MM:SS | INFO     | PID:4179 | gsppy.gsp | Pre-processing transactions...\nYYYY-MM-DDTHH:MM:SS | DEBUG    | PID:4179 | gsppy.gsp | Unique candidates: [('Bread',), ('Milk',), ...]\nYYYY-MM-DDTHH:MM:SS | INFO     | PID:4179 | gsppy.gsp | Starting GSP algorithm with min_support=0.3...\nYYYY-MM-DDTHH:MM:SS | INFO     | PID:4179 | gsppy.gsp | Run 1: 6 candidates filtered to 5.\n...\n```\n\nFor complete logging documentation, see [docs/logging.md](docs/logging.md).\n\n#### Example\n\nSuppose you have a file `transactions.json` as shown above. To find patterns with at least 30% support:\n\n```bash\ngsppy --file transactions.json --min_support 0.3\n```\n\nSample output:\n\n```\nPre-processing transactions...\nStarting GSP algorithm with min_support=0.3...\nRun 1: 6 candidates filtered to 5.\nRun 2: 20 candidates filtered to 3.\nRun 3: 2 candidates filtered to 2.\nRun 4: 1 candidates filtered to 0.\nGSP algorithm completed.\nFrequent Patterns Found:\n\n1-Sequence Patterns:\nPattern: ('Bread',), Support: 4\nPattern: ('Milk',), Support: 4\nPattern: ('Diaper',), Support: 4\nPattern: ('Beer',), Support: 3\nPattern: ('Coke',), Support: 2\n\n2-Sequence Patterns:\nPattern: ('Bread', 'Milk'), Support: 3\nPattern: ('Milk', 'Diaper'), Support: 3\nPattern: ('Diaper', 'Beer'), Support: 3\n\n3-Sequence Patterns:\nPattern: ('Bread', 'Milk', 'Diaper'), Support: 2\nPattern: ('Milk', 'Diaper', 'Beer'), Support: 2\n```\n\n#### Exporting Results\n\nGSP-Py supports exporting mining results to various formats for further analysis or integration with data pipelines:\n\n**Export to Parquet** (efficient columnar format for large datasets):\n```bash\ngsppy --file transactions.json --min_support 0.3 --output results.parquet\n```\n\n**Export to CSV**:\n```bash\ngsppy --file transactions.json --min_support 0.3 --output results.csv\n```\n\n**Export to JSON**:\n```bash\ngsppy --file transactions.json --min_support 0.3 --output results.json\n```\n\n**Specify format explicitly**:\n```bash\ngsppy --file transactions.json --min_support 0.3 --output results.data --output-format parquet\n```\n\nThe exported files contain three columns:\n- `pattern`: The sequential pattern (e.g., `('Bread', 'Milk')`)\n- `support`: Number of transactions containing the pattern\n- `level`: Length of the pattern sequence\n\nExport formats are particularly useful for:\n- **Parquet/Arrow**: Integration with big data tools (Spark, Polars, Pandas), data lakes, and cloud analytics\n- **CSV**: Easy viewing in spreadsheets and compatibility with traditional tools\n- **JSON**: Structured data for web applications and APIs\n\n#### Error Handling\n\n- If the file does not exist or is in an unsupported format, a clear error message will be shown.\n- The `min_support` value must be between 0.0 and 1.0 (exclusive of 0.0, inclusive of 1.0).\n\n#### Advanced: Verbose Output\n\nTo see detailed logs for debugging, add the `--verbose` flag:\n\n```bash\ngsppy --file transactions.json --min_support 0.3 --verbose\n```\n\n---\n\nThe following example shows how to use GSP-Py programmatically in Python:\n\n### Example Input Data\n\nThe input to the algorithm is a sequence of transactions, where each transaction contains a sequence of items:\n\n```python\ntransactions = [\n    ['Bread', 'Milk'],\n    ['Bread', 'Diaper', 'Beer', 'Eggs'],\n    ['Milk', 'Diaper', 'Beer', 'Coke'],\n    ['Bread', 'Milk', 'Diaper', 'Beer'],\n    ['Bread', 'Milk', 'Diaper', 'Coke']\n]\n```\n\n### Importing and Initializing the GSP Algorithm\n\nImport the `GSP` class from the `gsppy` package and call the `search` method to find frequent patterns with a support\nthreshold (e.g., `0.3`):\n\n```python\nfrom gsppy.gsp import GSP\n\n# Example transactions: customer purchases\ntransactions = [\n    ['Bread', 'Milk'],  # Transaction 1\n    ['Bread', 'Diaper', 'Beer', 'Eggs'],  # Transaction 2\n    ['Milk', 'Diaper', 'Beer', 'Coke'],  # Transaction 3\n    ['Bread', 'Milk', 'Diaper', 'Beer'],  # Transaction 4\n    ['Bread', 'Milk', 'Diaper', 'Coke']  # Transaction 5\n]\n\n# Set minimum support threshold (30%)\nmin_support = 0.3\n\n# Find frequent patterns\nresult = GSP(transactions).search(min_support)\n\n# Output the results\nprint(result)\n```\n\n### Verbose Mode for Debugging\n\nEnable detailed logging to track algorithm progress and debug issues:\n\n```python\nfrom gsppy.gsp import GSP\n\n# Enable verbose logging for the entire instance\ngsp = GSP(transactions, verbose=True)\nresult = gsp.search(min_support=0.3)\n\n# Or enable verbose for a specific search only\ngsp = GSP(transactions)\nresult = gsp.search(min_support=0.3, verbose=True)\n```\n\nVerbose mode provides:\n- Detailed progress information during execution\n- Candidate generation and filtering statistics\n- Preprocessing and validation details\n- Useful for debugging, research, and CI/CD integration\n\nFor complete documentation on logging, see [docs/logging.md](docs/logging.md).\n\n### Using Sequence Objects for Rich Pattern Representation\n\nGSP-Py 4.0+ introduces a **Sequence abstraction class** that provides a richer, more maintainable way to work with sequential patterns. The Sequence class encapsulates pattern items, support counts, and optional metadata in an immutable, hashable object.\n\n#### Traditional Dict-based Output (Default)\n\n```python\nfrom gsppy import GSP\n\ntransactions = [\n    ['Bread', 'Milk'],\n    ['Bread', 'Diaper', 'Beer', 'Eggs'],\n    ['Milk', 'Diaper', 'Beer', 'Coke']\n]\n\ngsp = GSP(transactions)\nresult = gsp.search(min_support=0.3)\n\n# Returns: [{('Bread',): 4, ('Milk',): 4, ...}, {('Bread', 'Milk'): 3, ...}, ...]\nfor level_patterns in result:\n    for pattern, support in level_patterns.items():\n        print(f\"Pattern: {pattern}, Support: {support}\")\n```\n\n#### Sequence Objects (New Feature)\n\n```python\nfrom gsppy import GSP\n\ntransactions = [\n    ['Bread', 'Milk'],\n    ['Bread', 'Diaper', 'Beer', 'Eggs'],\n    ['Milk', 'Diaper', 'Beer', 'Coke']\n]\n\ngsp = GSP(transactions)\nresult = gsp.search(min_support=0.3, return_sequences=True)\n\n# Returns: [[Sequence(('Bread',), support=4), ...], [Sequence(('Bread', 'Milk'), support=3), ...], ...]\nfor level_patterns in result:\n    for seq in level_patterns:\n        print(f\"Pattern: {seq.items}, Support: {seq.support}, Length: {seq.length}\")\n        # Access sequence properties\n        print(f\"  First item: {seq.first_item}, Last item: {seq.last_item}\")\n        # Check if item is in sequence\n        if \"Milk\" in seq:\n            print(f\"  Contains Milk!\")\n```\n\n#### Key Benefits of Sequence Objects\n\n1. **Rich API**: Access pattern properties like `length`, `first_item`, `last_item`\n2. **Type Safety**: IDE autocomplete and better type hints\n3. **Immutable \u0026 Hashable**: Can be used as dictionary keys\n4. **Extensible**: Add metadata for confidence, lift, or custom properties\n5. **Backward Compatible**: Convert to/from dict format as needed\n\n```python\nfrom gsppy import Sequence, sequences_to_dict, dict_to_sequences\n\n# Create custom sequences\nseq = Sequence.from_tuple((\"A\", \"B\", \"C\"), support=5)\n\n# Extend sequences\nextended = seq.extend(\"D\")  # Creates Sequence((\"A\", \"B\", \"C\", \"D\"))\n\n# Add metadata\nseq_with_meta = seq.with_metadata(confidence=0.85, lift=1.5)\n\n# Convert between formats for compatibility\nseq_result = gsp.search(min_support=0.3, return_sequences=True)\ndict_format = sequences_to_dict(seq_result[0])  # Convert to dict\n```\n\nFor a complete example, see [examples/sequence_example.py](examples/sequence_example.py).\n\n### Loading SPM/GSP Format Files\n\nGSP-Py supports loading datasets in the classical SPM/GSP delimiter format, which is widely used in sequential pattern mining research. This format uses:\n- `-1` to mark the end of an element (itemset)\n- `-2` to mark the end of a sequence (transaction)\n\n#### Using the SPM Loader\n\n```python\nfrom gsppy.utils import read_transactions_from_spm\nfrom gsppy import GSP\n\n# Load SPM format file\ntransactions = read_transactions_from_spm('data.txt')\n\n# Run GSP algorithm\ngsp = GSP(transactions)\nresult = gsp.search(min_support=0.3)\n```\n\n#### SPM Format Examples\n\n**Simple sequence file (`data.txt`):**\n```text\n1 2 -1 3 -1 -2\n4 -1 5 6 -1 -2\n1 -1 2 3 -1 -2\n```\n\nThis represents:\n- Transaction 1: Items [1, 2] followed by item [3] → flattened to [1, 2, 3]\n- Transaction 2: Item [4] followed by items [5, 6] → flattened to [4, 5, 6]\n- Transaction 3: Item [1] followed by items [2, 3] → flattened to [1, 2, 3]\n\n**String tokens are also supported:**\n```text\nA B -1 C -1 -2\nD -1 E F -1 -2\n```\n\n#### Token Mapping\n\nFor workflows requiring conversion between string tokens and integer IDs, use the `TokenMapper`:\n\n```python\nfrom gsppy.utils import read_transactions_from_spm\nfrom gsppy import TokenMapper\n\n# Load with mappings\ntransactions, str_to_int, int_to_str = read_transactions_from_spm(\n    'data.txt', \n    return_mappings=True\n)\n\nprint(\"String to Int:\", str_to_int)\n# Output: {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4, '6': 5}\n\nprint(\"Int to String:\", int_to_str)\n# Output: {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6'}\n\n# Use the TokenMapper class directly\nmapper = TokenMapper()\nid_a = mapper.add_token(\"A\")\nid_b = mapper.add_token(\"B\")\nprint(f\"A -\u003e {id_a}, B -\u003e {id_b}\")\n# Output: A -\u003e 0, B -\u003e 1\n```\n\n#### Edge Cases Handled\n\nThe SPM loader gracefully handles:\n- Empty lines (skipped)\n- Missing `-2` delimiter at end of line\n- Extra or consecutive delimiters\n- Mixed-length elements in sequences\n- Both integer and string tokens\n\n### Output\n\nThe algorithm will return a list of patterns with their corresponding support.\n\nSample Output:\n\n```python\n[\n    {('Bread',): 4, ('Milk',): 4, ('Diaper',): 4, ('Beer',): 3, ('Coke',): 2},\n    {('Bread', 'Milk'): 3, ('Bread', 'Diaper'): 3, ('Bread', 'Beer'): 2, ('Milk', 'Diaper'): 3, ('Milk', 'Beer'): 2, ('Milk', 'Coke'): 2, ('Diaper', 'Beer'): 3, ('Diaper', 'Coke'): 2},\n    {('Bread', 'Milk', 'Diaper'): 2, ('Bread', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Coke'): 2}\n]\n```\n\n- The **first dictionary** contains single-item sequences with their frequencies (e.g., `('Bread',): 4` means \"Bread\"\n  appears in 4 transactions).\n- The **second dictionary** contains 2-item sequential patterns (e.g., `('Bread', 'Milk'): 3` means the sequence \"\n  Bread → Milk\" appears in 3 transactions). Note that patterns like `('Bread', 'Beer')` are detected even when they don't appear adjacent in transactions - they just need to appear in order.\n- The **third dictionary** contains 3-item sequential patterns (e.g., `('Bread', 'Milk', 'Diaper'): 2` means the\n  sequence \"Bread → Milk → Diaper\" appears in 2 transactions).\n\n\u003e [!NOTE]\n\u003e The **support** of a sequence is calculated as the fraction of transactions containing the sequence **in order** (not necessarily contiguously), e.g.,\n`('Bread', 'Milk')` appears in 3 out of 5 transactions → Support = `3 / 5 = 0.6` (60%).\n\u003e This insight helps identify frequently occurring sequential patterns in datasets, such as shopping trends or user\n\u003e behavior.\n\n\u003e [!IMPORTANT]\n\u003e **Non-contiguous (ordered) matching**: GSP-Py detects patterns where items appear in the specified order but not necessarily adjacent. For example, the pattern `('Bread', 'Beer')` matches the transaction `['Bread', 'Milk', 'Diaper', 'Beer']` because Bread appears before Beer, even though they are not adjacent. This follows the standard GSP algorithm semantics for sequential pattern mining.\n\n### Understanding Non-Contiguous Pattern Matching\n\nGSP-Py follows the standard GSP algorithm semantics by detecting **ordered (non-contiguous)** subsequences. This means:\n\n- ✅ **Order matters**: Items must appear in the specified sequence order\n- ✅ **Gaps allowed**: Items don't need to be adjacent\n- ❌ **Wrong order rejected**: Items appearing in different order won't match\n\n**Example:**\n\n```python\nfrom gsppy.gsp import GSP\n\nsequences = [\n    ['a', 'b', 'c'],  # Contains: (a,b), (a,c), (b,c), (a,b,c)\n    ['a', 'c'],       # Contains: (a,c)\n    ['b', 'c', 'a'],  # Contains: (b,c), (b,a), (c,a)\n    ['a', 'b', 'c', 'd'],  # Contains: (a,b), (a,c), (a,d), (b,c), (b,d), (c,d), etc.\n]\n\ngsp = GSP(sequences)\nresult = gsp.search(min_support=0.5)  # Need at least 2/4 sequences\n\n# Pattern ('a', 'c') is found with support=3 because:\n# - It appears in ['a', 'b', 'c'] (with 'b' in between)\n# - It appears in ['a', 'c'] (adjacent)\n# - It appears in ['a', 'b', 'c', 'd'] (with 'b' in between)\n# Total: 3 out of 4 sequences = 75% support ✅\n```\n\n\n\u003e [!TIP]\n\u003e For more complex examples, find example scripts in the [`gsppy/tests`](gsppy/tests) folder.\n\n---\n\n## 📊 DataFrame Input Support\n\nGSP-Py supports **Polars and Pandas DataFrames** as input, enabling high-performance workflows with modern data formats like Arrow and Parquet. This feature is particularly useful for large-scale data engineering pipelines and integration with existing data processing workflows.\n\n### Installation\n\nInstall GSP-Py with DataFrame support:\n\n```bash\npip install 'gsppy[dataframe]'\n```\n\nThis installs the optional dependencies: `polars`, `pandas`, and `pyarrow`.\n\n### DataFrame Input Formats\n\nGSP-Py supports two DataFrame formats:\n\n#### 1. Grouped Format (Transaction ID + Item Columns)\n\nUse when your data has separate rows for each item in a transaction:\n\n```python\nimport polars as pl\nfrom gsppy import GSP\n\n# Polars DataFrame with transaction_id and item columns\ndf = pl.DataFrame({\n    \"transaction_id\": [1, 1, 2, 2, 2, 3, 3],\n    \"item\": [\"Bread\", \"Milk\", \"Bread\", \"Diaper\", \"Beer\", \"Milk\", \"Coke\"],\n})\n\n# Run GSP directly on the DataFrame\ngsp = GSP(df, transaction_col=\"transaction_id\", item_col=\"item\")\npatterns = gsp.search(min_support=0.3)\n\nfor level, freq_patterns in enumerate(patterns, start=1):\n    print(f\"\\n{level}-Sequence Patterns:\")\n    for pattern, support in freq_patterns.items():\n        print(f\"  {pattern}: {support}\")\n```\n\n#### 2. Sequence Format (List Column)\n\nUse when each row contains a complete transaction as a list:\n\n```python\nimport pandas as pd\nfrom gsppy import GSP\n\n# Pandas DataFrame with sequences as lists\ndf = pd.DataFrame({\n    \"transaction\": [\n        [\"Bread\", \"Milk\"],\n        [\"Bread\", \"Diaper\", \"Beer\"],\n        [\"Milk\", \"Coke\"],\n    ]\n})\n\ngsp = GSP(df, sequence_col=\"transaction\")\npatterns = gsp.search(min_support=0.3)\n```\n\n### DataFrame with Timestamps\n\nDataFrames support temporal constraints for time-aware pattern mining:\n\n```python\nimport polars as pl\nfrom gsppy import GSP\n\n# Grouped format with timestamps\ndf = pl.DataFrame({\n    \"transaction_id\": [1, 1, 1, 2, 2, 2],\n    \"item\": [\"Login\", \"Browse\", \"Purchase\", \"Login\", \"Browse\", \"Purchase\"],\n    \"timestamp\": [0, 2, 5, 0, 1, 15],  # Time in seconds\n})\n\n# Find patterns where consecutive events occur within 10 seconds\ngsp = GSP(\n    df,\n    transaction_col=\"transaction_id\",\n    item_col=\"item\",\n    timestamp_col=\"timestamp\",\n    maxgap=10\n)\npatterns = gsp.search(min_support=0.5)\n```\n\nFor sequence format with timestamps:\n\n```python\nimport pandas as pd\nfrom gsppy import GSP\n\ndf = pd.DataFrame({\n    \"sequence\": [[\"A\", \"B\", \"C\"], [\"A\", \"D\"]],\n    \"timestamps\": [[1, 2, 3], [1, 5]],  # Timestamps per item\n})\n\ngsp = GSP(df, sequence_col=\"sequence\", timestamp_col=\"timestamps\", maxgap=3)\npatterns = gsp.search(min_support=0.5)\n```\n\n### Working with Parquet and Arrow Files\n\nDataFrames enable seamless integration with columnar storage formats:\n\n```python\nimport polars as pl\nfrom gsppy import GSP\n\n# Read directly from Parquet\ndf = pl.read_parquet(\"transactions.parquet\")\n\n# Run GSP with automatic schema detection\ngsp = GSP(df, transaction_col=\"txn_id\", item_col=\"product\")\npatterns = gsp.search(min_support=0.2)\n\n# Or use Pandas with Arrow backend\nimport pandas as pd\ndf_pandas = pd.read_parquet(\"transactions.parquet\", engine=\"pyarrow\")\ngsp = GSP(df_pandas, transaction_col=\"txn_id\", item_col=\"product\")\npatterns = gsp.search(min_support=0.2)\n```\n\n### Performance Considerations\n\nDataFrames offer performance benefits for large datasets:\n\n- **Polars**: Leverages Arrow for zero-copy operations and parallel processing\n- **Pandas**: Compatible with Arrow backend for efficient memory usage\n- **Parquet/Arrow**: Columnar storage enables efficient filtering and reading\n- **Schema validation**: Errors are caught early with clear messages\n\n### DataFrame Schema Requirements\n\n**Grouped Format:**\n- `transaction_col`: Column containing transaction/sequence IDs (any type)\n- `item_col`: Column containing items (any type, converted to strings)\n- `timestamp_col` (optional): Column containing timestamps (numeric)\n\n**Sequence Format:**\n- `sequence_col`: Column containing lists of items\n- `timestamp_col` (optional): Column containing lists of timestamps (must match sequence lengths)\n\n### Error Handling\n\nGSP-Py provides clear error messages for schema issues:\n\n```python\nimport polars as pl\nfrom gsppy import GSP\n\ndf = pl.DataFrame({\n    \"txn_id\": [1, 2],\n    \"product\": [\"A\", \"B\"],\n})\n\n# ❌ Missing required column\ntry:\n    gsp = GSP(df, transaction_col=\"txn_id\", item_col=\"item\")  # 'item' doesn't exist\nexcept ValueError as e:\n    print(f\"Error: {e}\")  # \"Column 'item' not found in DataFrame\"\n\n# ❌ Invalid format specification\ntry:\n    gsp = GSP(df)  # Must specify either sequence_col or both transaction_col and item_col\nexcept ValueError as e:\n    print(f\"Error: {e}\")  # \"Must specify either 'sequence_col' or both 'transaction_col' and 'item_col'\"\n```\n\n### Backward Compatibility\n\nTraditional list-based input continues to work:\n\n```python\nfrom gsppy import GSP\n\n# Lists still work as before\ntransactions = [[\"A\", \"B\"], [\"A\", \"C\"], [\"B\", \"C\"]]\ngsp = GSP(transactions)\npatterns = gsp.search(min_support=0.5)\n```\n\nDataFrame parameters cannot be mixed with list input:\n\n```python\ntransactions = [[\"A\", \"B\"], [\"C\", \"D\"]]\n\n# ❌ This raises an error\ngsp = GSP(transactions, transaction_col=\"txn\")  # ValueError: DataFrame parameters cannot be used with list input\n```\n\n### Examples and Tests\n\nFor complete examples and edge cases, see:\n- [`tests/test_dataframe.py`](tests/test_dataframe.py) - Comprehensive test suite\n- DataFrame adapter documentation in [`gsppy/dataframe_adapters.py`](gsppy/dataframe_adapters.py)\n\n---\n\n## 🔗 Itemset Support\n\nGSP-Py supports **itemsets** within sequence elements, enabling you to capture **co-occurrence** of multiple items at the same time step. This is crucial for applications where items occur together rather than in strict sequential order.\n\n### What are Itemsets?\n\n- **Flat sequences**: `['A', 'B', 'C']` - each item occurs at a separate time step\n- **Itemset sequences**: `[['A', 'B'], ['C']]` - items A and B occur together at the first time step, then C occurs later\n\n### Why Use Itemsets?\n\nItemsets are essential when temporal co-occurrence matters in your domain:\n\n- **Market basket analysis**: Customers buy multiple items in a single shopping trip, then return for more items later\n- **Web analytics**: Users open multiple pages in parallel tabs before moving to the next set of pages\n- **Event logs**: Multiple events can occur simultaneously in complex systems\n- **Purchase patterns**: Items bought together vs. items bought in sequence\n\n### Using Itemsets\n\n#### Basic Example\n\n```python\nfrom gsppy import GSP\n\n# Itemset format: nested lists where inner lists are items that occur together\ntransactions = [\n    [['Bread', 'Milk'], ['Eggs']],  # Bought Bread \u0026 Milk together, then Eggs later\n    [['Bread', 'Milk', 'Butter']],  # Bought all three items together\n    [['Bread', 'Milk'], ['Eggs']],  # Same pattern as customer 1\n]\n\ngsp = GSP(transactions)\npatterns = gsp.search(min_support=0.5)\n\n# Pattern ('Bread',) will match any itemset containing Bread\n# Pattern ('Bread', 'Eggs') will match sequences where Bread appears before Eggs\n# (even if they're in different itemsets)\n```\n\n#### Backward Compatibility with Flat Sequences\n\nGSP-Py automatically normalizes flat sequences to itemsets internally, ensuring full backward compatibility:\n\n```python\nfrom gsppy import GSP\n\n# These are equivalent after normalization:\nflat_transactions = [['A', 'B', 'C']]  # Flat format\nitemset_transactions = [[['A'], ['B'], ['C']]]  # Equivalent itemset format\n\n# Both produce the same results\ngsp1 = GSP(flat_transactions)\ngsp2 = GSP(itemset_transactions)\n\n# Patterns are identical\npatterns1 = gsp1.search(min_support=0.5)\npatterns2 = gsp2.search(min_support=0.5)\n```\n\n### Itemset Matching Semantics\n\nPattern matching with itemsets uses **subset semantics**:\n\n- A pattern element matches a sequence element if all items in the pattern element are present in the sequence element\n- Example: Pattern `[['A', 'B']]` matches sequence element `['A', 'B', 'C']` because {A, B} ⊆ {A, B, C}\n- Pattern elements must appear in order across the sequence\n\n```python\nfrom gsppy import GSP\n\ntransactions = [\n    [['A', 'B', 'D'], ['E'], ['C', 'F']],  # A,B,D together, then E, then C,F together\n]\n\ngsp = GSP(transactions)\n\n# Pattern ('A', 'C') will match because:\n# - 'A' is in first itemset ['A', 'B', 'D'] ✓\n# - 'C' appears later in third itemset ['C', 'F'] ✓\n# - Order is preserved ✓\n```\n\n### Reading Itemsets from SPM Format\n\nThe SPM/GSP format supports itemsets using delimiters:\n\n- `-1`: End of itemset\n- `-2`: End of sequence\n\n```python\nfrom gsppy.utils import read_transactions_from_spm\n\n# SPM file content:\n# 1 2 -1 3 -1 -2\n# 1 -1 3 4 -1 -2\n\n# Read with itemsets preserved\ntransactions = read_transactions_from_spm(\"data.txt\", preserve_itemsets=True)\n# Result: [[['1', '2'], ['3']], [['1'], ['3', '4']]]\n\n# Read with itemsets flattened (backward compatible)\ntransactions = read_transactions_from_spm(\"data.txt\", preserve_itemsets=False)\n# Result: [['1', '2', '3'], ['1', '3', '4']]\n```\n\n### Itemsets with Timestamps\n\nItemsets work seamlessly with temporal constraints:\n\n```python\nfrom gsppy import GSP\n\n# Itemsets with timestamps: [(item, timestamp), ...]\ntransactions = [\n    [[('Login', 0), ('Home', 0)], [('Product', 5)], [('Checkout', 10)]],\n    [[('Login', 0)], [('Home', 2), ('Product', 2)], [('Checkout', 15)]],\n]\n\n# Find patterns where events in the same itemset occur together\n# and subsequent itemsets occur within maxgap time units\ngsp = GSP(transactions, maxgap=10)\npatterns = gsp.search(min_support=0.5)\n```\n\n### Complete Example\n\nSee [examples/itemset_example.py](examples/itemset_example.py) for comprehensive examples including:\n\n- Market basket analysis with itemsets\n- Web clickstream with parallel page views\n- Comparison of flat vs. itemset semantics\n- Reading and processing SPM format files\n\n### Key Takeaways\n\n✓ **Itemsets capture co-occurrence** of items at the same time step  \n✓ **Flat sequences are automatically normalized** to itemsets internally  \n✓ **Both formats work seamlessly** with GSP-Py  \n✓ **Use itemsets when temporal co-occurrence matters** in your domain  \n✓ **SPM format supports** both flat and itemset representations\n\n---\n\n## ⏱️ Temporal Constraints\n\nGSP-Py supports **time-constrained sequential pattern mining** with three powerful temporal constraints: `mingap`, `maxgap`, and `maxspan`. These constraints enable domain-specific applications such as medical event mining, retail analytics, and temporal user journey discovery.\n\n### Temporal Constraint Parameters\n\n- **`mingap`**: Minimum time gap required between consecutive items in a pattern\n- **`maxgap`**: Maximum time gap allowed between consecutive items in a pattern\n- **`maxspan`**: Maximum time span from the first to the last item in a pattern\n\n### Using Temporal Constraints\n\nTo use temporal constraints, your transactions must include timestamps as (item, timestamp) tuples:\n\n```python\nfrom gsppy.gsp import GSP\n\n# Transactions with timestamps (e.g., in seconds, hours, days, etc.)\ntimestamped_transactions = [\n    [(\"Login\", 0), (\"Browse\", 2), (\"AddToCart\", 5), (\"Purchase\", 7)],\n    [(\"Login\", 0), (\"Browse\", 1), (\"AddToCart\", 15), (\"Purchase\", 20)],\n    [(\"Login\", 0), (\"Browse\", 3), (\"AddToCart\", 6), (\"Purchase\", 8)],\n]\n\n# Find patterns where consecutive events occur within 10 time units\ngsp = GSP(timestamped_transactions, maxgap=10)\npatterns = gsp.search(min_support=0.6)\n\n# The pattern (\"Browse\", \"AddToCart\", \"Purchase\") will:\n# - Be found in transaction 1: gaps are 3 and 2 (both ≤ 10) ✅\n# - NOT be found in transaction 2: gap between Browse→AddToCart is 14 (exceeds maxgap) ❌\n# - Be found in transaction 3: gaps are 3 and 2 (both ≤ 10) ✅\n# Result: Support = 2/3 = 67% (above threshold of 60%)\n```\n\n### CLI Usage with Temporal Constraints\n\n```bash\n# Find patterns with maximum gap of 5 time units\ngsppy --file temporal_data.json --min_support 0.3 --maxgap 5\n\n# Find patterns with minimum gap of 2 time units\ngsppy --file temporal_data.json --min_support 0.3 --mingap 2\n\n# Find patterns that complete within 10 time units\ngsppy --file temporal_data.json --min_support 0.3 --maxspan 10\n\n# Combine multiple constraints\ngsppy --file temporal_data.json --min_support 0.3 --mingap 1 --maxgap 5 --maxspan 10\n```\n\n### Real-World Examples\n\n#### Medical Event Mining\n\n```python\nfrom gsppy.gsp import GSP\n\n# Medical events with timestamps in days\nmedical_sequences = [\n    [(\"Symptom\", 0), (\"Diagnosis\", 2), (\"Treatment\", 5), (\"Recovery\", 15)],\n    [(\"Symptom\", 0), (\"Diagnosis\", 1), (\"Treatment\", 20), (\"Recovery\", 30)],\n    [(\"Symptom\", 0), (\"Diagnosis\", 3), (\"Treatment\", 6), (\"Recovery\", 18)],\n]\n\n# Find patterns where treatment follows diagnosis within 10 days\ngsp = GSP(medical_sequences, maxgap=10)\nresult = gsp.search(min_support=0.5)\n\n# Pattern (\"Diagnosis\", \"Treatment\") found in sequences 1 \u0026 3 only\n# (sequence 2 has gap of 19 days, exceeding maxgap)\n```\n\n#### Retail Analytics\n\n```python\nfrom gsppy.gsp import GSP\n\n# Customer purchases with timestamps in hours\npurchase_sequences = [\n    [(\"Browse\", 0), (\"AddToCart\", 0.5), (\"Purchase\", 1)],\n    [(\"Browse\", 0), (\"AddToCart\", 1), (\"Purchase\", 25)],  # Long delay\n    [(\"Browse\", 0), (\"AddToCart\", 0.3), (\"Purchase\", 0.8)],\n]\n\n# Find purchase journeys that complete within 2 hours\ngsp = GSP(purchase_sequences, maxspan=2)\nresult = gsp.search(min_support=0.5)\n\n# Full sequence found in 2 out of 3 transactions\n# (sequence 2 has span of 25 hours, exceeding maxspan)\n```\n\n#### User Journey Discovery\n\n```python\nfrom gsppy.gsp import GSP\n\n# Website navigation with timestamps in seconds\nnavigation_sequences = [\n    [(\"Home\", 0), (\"Search\", 5), (\"Product\", 10), (\"Checkout\", 15)],\n    [(\"Home\", 0), (\"Search\", 3), (\"Product\", 8), (\"Checkout\", 180)],\n    [(\"Home\", 0), (\"Search\", 4), (\"Product\", 9), (\"Checkout\", 14)],\n]\n\n# Find navigation patterns with:\n# - Minimum 2 seconds between steps (mingap)\n# - Maximum 20 seconds between steps (maxgap)\n# - Complete within 30 seconds total (maxspan)\ngsp = GSP(navigation_sequences, mingap=2, maxgap=20, maxspan=30)\nresult = gsp.search(min_support=0.5)\n```\n\n### Important Notes\n\n- Temporal constraints require timestamped transactions (item-timestamp tuples)\n- If temporal constraints are specified but transactions don't have timestamps, a warning is logged and constraints are ignored\n- When using temporal constraints, the Python backend is automatically used (accelerated backends don't yet support temporal constraints)\n- Timestamps can be in any unit (seconds, minutes, hours, days) as long as they're consistent within your dataset\n\n---\n\n## 🔧 Flexible Candidate Pruning\n\nGSP-Py supports **flexible candidate pruning strategies** that allow you to customize how candidate sequences are filtered during pattern mining. This enables optimization for different dataset characteristics and mining requirements.\n\n### Built-in Pruning Strategies\n\n#### 1. Support-Based Pruning (Default)\n\nThe standard GSP pruning based on minimum support threshold:\n\n```python\nfrom gsppy.gsp import GSP\nfrom gsppy.pruning import SupportBasedPruning\n\n# Explicit support-based pruning\npruner = SupportBasedPruning(min_support_fraction=0.3)\ngsp = GSP(transactions, pruning_strategy=pruner)\nresult = gsp.search(min_support=0.3)\n```\n\n#### 2. Frequency-Based Pruning\n\nPrunes candidates based on absolute frequency (minimum number of occurrences):\n\n```python\nfrom gsppy.pruning import FrequencyBasedPruning\n\n# Require patterns to appear at least 5 times\npruner = FrequencyBasedPruning(min_frequency=5)\ngsp = GSP(transactions, pruning_strategy=pruner)\nresult = gsp.search(min_support=0.2)\n```\n\n**Use case**: When you need patterns to occur a minimum absolute number of times, regardless of dataset size.\n\n#### 3. Temporal-Aware Pruning\n\nOptimizes pruning for time-constrained pattern mining by pre-filtering infeasible patterns:\n\n```python\nfrom gsppy.pruning import TemporalAwarePruning\n\n# Prune patterns that cannot satisfy temporal constraints\npruner = TemporalAwarePruning(\n    mingap=1,\n    maxgap=5,\n    maxspan=10,\n    min_support_fraction=0.3\n)\ngsp = GSP(timestamped_transactions, mingap=1, maxgap=5, maxspan=10, pruning_strategy=pruner)\nresult = gsp.search(min_support=0.3)\n```\n\n**Use case**: Improves performance for temporal pattern mining by eliminating patterns that cannot satisfy temporal constraints.\n\n#### 4. Combined Pruning\n\nCombines multiple pruning strategies for aggressive filtering:\n\n```python\nfrom gsppy.pruning import CombinedPruning, SupportBasedPruning, FrequencyBasedPruning\n\n# Apply both support and frequency constraints\nstrategies = [\n    SupportBasedPruning(min_support_fraction=0.3),\n    FrequencyBasedPruning(min_frequency=5)\n]\npruner = CombinedPruning(strategies)\ngsp = GSP(transactions, pruning_strategy=pruner)\nresult = gsp.search(min_support=0.3)\n```\n\n**Use case**: When you want to combine multiple filtering criteria for more selective pattern discovery.\n\n### Custom Pruning Strategies\n\nYou can create custom pruning strategies by implementing the `PruningStrategy` interface:\n\n```python\nfrom gsppy.pruning import PruningStrategy\nfrom typing import Dict, Optional, Tuple\n\nclass MyCustomPruner(PruningStrategy):\n    def should_prune(\n        self,\n        candidate: Tuple[str, ...],\n        support_count: int,\n        total_transactions: int,\n        context: Optional[Dict] = None\n    ) -\u003e bool:\n        # Custom pruning logic\n        # Return True to prune (filter out), False to keep\n        pattern_length = len(candidate)\n        # Example: Prune very long patterns with low support\n        if pattern_length \u003e 5 and support_count \u003c 10:\n            return True\n        return False\n\n# Use your custom pruner\ncustom_pruner = MyCustomPruner()\ngsp = GSP(transactions, pruning_strategy=custom_pruner)\nresult = gsp.search(min_support=0.2)\n```\n\n### Performance Characteristics\n\nDifferent pruning strategies have different performance tradeoffs:\n\n| Strategy | Pruning Aggressiveness | Use Case | Performance Impact |\n|----------|----------------------|----------|-------------------|\n| **SupportBased** | Moderate | General-purpose mining | Baseline performance |\n| **FrequencyBased** | High (for large datasets) | Require absolute frequency | Faster on large datasets |\n| **TemporalAware** | High (for temporal data) | Time-constrained patterns | Significant speedup for temporal mining |\n| **Combined** | Very High | Selective pattern discovery | Fastest, but may miss edge cases |\n\n### Benchmarking Pruning Strategies\n\nTo compare pruning strategies on your dataset:\n\n```bash\n# Compare all strategies\npython benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy all\n\n# Benchmark a specific strategy\npython benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy frequency\n\n# Run multiple rounds for averaging\npython benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy all --rounds 3\n```\n\nSee `benchmarks/bench_pruning.py` for the complete benchmarking script.\n\n---\n\n## ⌨️ Typing\n\n`gsppy` ships inline type information (PEP 561) via a bundled `py.typed` marker. The public API is re-exported from\n`gsppy` directly—import `GSP` for programmatic use or reuse the CLI helpers (`detect_and_read_file`,\n`read_transactions_from_json`, `read_transactions_from_csv`, and `setup_logging`) when embedding the tool in\nlarger applications.\n\n---\n\n## 🌟 Planned Features\n\nWe are actively working to improve GSP-Py. Here are some exciting features planned for future releases:\n\n1. **Support for Preprocessing and Postprocessing**:\n    - Add hooks to allow users to transform datasets before mining and customize the output results.\n\nWant to contribute or suggest an\nimprovement? [Open a discussion or issue!](https://github.com/jacksonpradolima/gsp-py/issues)\n\n---\n\n## 🤝 Contributing\n\nWe welcome contributions from the community! If you'd like to help improve GSP-Py, read\nour [CONTRIBUTING.md](CONTRIBUTING.md) guide to get started.\n\nDevelopment dependencies (e.g., testing and linting tools) are handled via uv.\nTo set up and run the main tasks:\n\n```bash\nuv venv .venv\nuv sync --frozen --extra dev\nuv pip install -e .\n\n# Run tasks\nuv run pytest -n auto\nuv run ruff check .\nuv run pyright\n```\n\n### Testing \u0026 Fuzzing\n\nGSP-Py includes comprehensive test coverage, including property-based fuzzing tests using [Hypothesis](https://hypothesis.readthedocs.io/). These fuzzing tests automatically generate random inputs to verify algorithm invariants and discover edge cases. Run the fuzzing tests with:\n\n```bash\nuv run pytest tests/test_gsp_fuzzing.py -v\n```\n\n### General Steps:\n\n1. Fork the repository.\n2. Create a feature branch: `git checkout -b feature/my-feature`.\n3. Commit your changes using [Conventional Commits](https://www.conventionalcommits.org/) format: `git commit -m \"feat: add my feature\"`.\n4. Push to your branch: `git push origin feature/my-feature`.\n5. Submit a pull request to the main repository!\n\nLooking for ideas? Check out our [Planned Features](#planned-features) section.\n\n### Release Management\n\nGSP-Py uses automated release management with [Conventional Commits](https://www.conventionalcommits.org/). When commits are merged to `main`:\n- **Releases are triggered** by: `fix:` (patch), `feat:` (minor), `perf:` (patch), or `BREAKING CHANGE:` (major)\n- **No release** for: `docs:`, `style:`, `refactor:`, `test:`, `build:`, `ci:`, `chore:`\n- CHANGELOG.md is automatically updated with structured release notes\n- Git tags and GitHub releases are created automatically\n\nSee [Release Management Guide](docs/RELEASE_MANAGEMENT.md) for details on commit message format and release process.\n\n---\n\n## 📝 License\n\nThis project is licensed under the terms of the **MIT License**. For more details, refer to the [LICENSE](LICENSE) file.\n\n---\n\n## 📖 Citation\n\nIf GSP-Py contributed to your research or project that led to a publication, we kindly ask that you cite it as follows:\n\n```\n@misc{pradolima_gsppy,\n  author       = {Prado Lima, Jackson Antonio do},\n  title        = {{GSP-Py - Generalized Sequence Pattern algorithm in Python}},\n  month        = Dec,\n  year         = 2025,\n  doi          = {10.5281/zenodo.3333987},\n  url          = {https://doi.org/10.5281/zenodo.3333987}\n}\n```\n","funding_links":["https://github.com/sponsors/jacksonpradolima","https://buymeacoffee.com/pradolima"],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjacksonpradolima%2Fgsp-py","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjacksonpradolima%2Fgsp-py","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjacksonpradolima%2Fgsp-py/lists"}