https://github.com/jacksonpradolima/gsp-py
GSP (Generalized Sequence Pattern) algorithm in Python
https://github.com/jacksonpradolima/gsp-py
data-analysis data-mining data-mining-algorithms gsp pattern-recognition python sequence-mining sequential-patterns
Last synced: about 6 hours ago
JSON representation
GSP (Generalized Sequence Pattern) algorithm in Python
- Host: GitHub
- URL: https://github.com/jacksonpradolima/gsp-py
- Owner: jacksonpradolima
- License: mit
- Created: 2017-10-26T18:43:24.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2026-01-26T02:55:25.000Z (6 days ago)
- Last Synced: 2026-01-26T06:03:16.624Z (6 days ago)
- Topics: data-analysis, data-mining, data-mining-algorithms, gsp, pattern-recognition, python, sequence-mining, sequential-patterns
- Language: Python
- Homepage: https://jacksonpradolima.github.io/gsp-py/
- Size: 997 KB
- Stars: 39
- Watchers: 1
- Forks: 23
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Citation: CITATION.cff
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
Awesome Lists containing this project
README
[](https://jacksonpradolima.github.io/gsp-py/)
[]()
[](https://doi.org/10.5281/zenodo.3333987)
[](https://pypi.org/project/gsppy/)
[](https://pypi.org/project/gsppy)

[](https://securityscorecards.dev/viewer/?uri=github.com/jacksonpradolima/gsp-py)
[](https://github.com/jacksonpradolima/gsp-py/actions/workflows/slsa-provenance.yml)
[](https://www.bestpractices.dev/projects/11684)
[](https://sonarcloud.io/summary/new_code?id=jacksonpradolima_gsp-py)
[](https://sonarcloud.io/summary/new_code?id=jacksonpradolima_gsp-py)
[](https://sonarcloud.io/summary/new_code?id=jacksonpradolima_gsp-py)
[](https://sonarcloud.io/summary/new_code?id=jacksonpradolima_gsp-py)
[](https://codecov.io/github/jacksonpradolima/gsp-py)
# GSP-Py
**GSP-Py**: A Python-powered library to mine sequential patterns in large datasets, based on the robust **Generalized
Sequence Pattern (GSP)** algorithm. Ideal for market basket analysis, temporal mining, and user journey discovery.
> [!IMPORTANT]
> GSP-Py is compatible with Python 3.11 and later versions!
---
## π Table of Contents
1. [π What is GSP?](#what-is-gsp)
2. [π§ Requirements](#requirements)
3. [π Installation](#installation)
- [β Clone Repository](#option-1-clone-the-repository)
- [β Install via PyPI](#option-2-install-via-pip)
4. [π οΈ Developer Installation](#developer-installation)
5. [π Documentation](#documentation)
6. [π‘ Usage](#usage)
- [β
Example: Analyzing Sales Data](#example-analyzing-sales-data)
- [π Explanation: Support and Results](#explanation-support-and-results)
- [π DataFrame Input Support](#dataframe-input-support)
- [β±οΈ Temporal Constraints](#temporal-constraints)
7. [β¨οΈ Typing](#typing)
8. [π Planned Features](#planned-features)
9. [π€ Contributing](#contributing)
10. [π License](#license)
11. [π Citation](#citation)
---
## π What is GSP?
The **Generalized Sequential Pattern (GSP)** algorithm is a sequential pattern mining technique based on **Apriori
principles**. Using support thresholds, GSP identifies frequent sequences of items in transaction datasets.
### Key Features:
- **Ordered (non-contiguous) matching**: Detects patterns where items appear in order but not necessarily adjacent, following standard GSP semantics. For example, the pattern `('A', 'C')` is found in the sequence `['A', 'B', 'C']`.
- **Support-based pruning**: Only retains sequences that meet the minimum support threshold.
- **Candidate generation**: Iteratively generates candidate sequences of increasing length.
- **Temporal constraints**: Support for time-constrained pattern mining with `mingap`, `maxgap`, and `maxspan` parameters to find patterns within specific time windows.
- **General-purpose**: Useful in retail, web analytics, social networks, temporal sequence mining, and more.
For example:
- In a shopping dataset, GSP can identify patterns like "Customers who buy bread and milk often purchase diapers next" - even if other items appear between bread and milk.
- In a website clickstream, GSP might find patterns like "Users visit A, then eventually go to C" - capturing user journeys with intermediate steps.
---
## π§ Requirements
You will need Python installed on your system. On most Linux systems, you can install Python with:
```bash
sudo apt install python3
```
For package dependencies of GSP-Py, they will automatically be installed when using `pip`.
---
## π Installation
GSP-Py can be easily installed from either the **repository** or PyPI.
### Option 1: Clone the Repository
To manually clone the repository and set up the environment:
```bash
git clone https://github.com/jacksonpradolima/gsp-py.git
cd gsp-py
```
Refer to the [Developer Installation](#developer-installation) section and run the setup with uv.
### Option 2: Install via `pip`
Alternatively, install GSP-Py from PyPI with:
```bash
pip install gsppy
```
---
## π οΈ Developer Installation
This project now uses [uv](https://github.com/astral-sh/uv) for dependency management and virtual environments.
#### 1. Install uv
```bash
curl -Ls https://astral.sh/uv/install.sh | bash
```
Make sure uv is on your PATH (for most Linux setups):
```bash
export PATH="$HOME/.local/bin:$PATH"
```
#### 2. Set up the project environment
Create a local virtual environment and install dependencies from uv.lock (single source of truth):
```bash
uv venv .venv
uv sync --frozen --extra dev # uses uv.lock
uv pip install -e .
```
#### 3. Optional: Enable Rust acceleration
Rust acceleration is optional and provides faster support counting using a PyO3 extension. Python fallback remains available.
Build the extension locally:
```bash
make rust-build
```
Select backend at runtime (auto tries Rust, then falls back to Python):
```bash
export GSPPY_BACKEND=rust # or python, or unset for auto
```
Run benchmarks (adjust to your machine):
```bash
make bench-small
make bench-big # may use significant memory/CPU
# or customize:
GSPPY_BACKEND=auto uv run --python .venv/bin/python --no-project \
python benchmarks/bench_support.py --n_tx 1000000 --tx_len 8 --vocab 50000 --min_support 0.2 --warmup
```
#### 4. Optional: Enable GPU (CuPy) acceleration
GPU acceleration is experimental and currently optimizes singleton (k=1) support counting using CuPy.
Non-singleton candidates fall back to the Rust/Python backend.
Install the optional extra (choose a CuPy build that matches your CUDA/ROCm setup if needed):
```bash
uv run pip install -e .[gpu]
```
Select the GPU backend at runtime:
```bash
export GSPPY_BACKEND=gpu
```
If a GPU isn't available, an error will be raised when GSPPY_BACKEND=gpu is set. Otherwise, the default "auto" uses CPU.
#### 5. Common development tasks
After the environment is ready, activate it and run tasks with standard tools:
```bash
source .venv/bin/activate
pytest -n auto
ruff check .
pyright
```
If you prefer, you can also prefix commands with uv without activating:
```bash
uv run pytest -n auto
uv run ruff check .
uv run pyright
```
#### 5. Makefile (shortcuts)
You can use the Makefile to automate common tasks:
```bash
make setup # create .venv with uv and pin Python
make install # sync deps (from uv.lock) + install project (-e .)
make test # pytest -n auto
make lint # ruff check .
make format # ruff --fix
make typecheck # pyright + ty
make pre-commit-install # install the pre-commit hook
make pre-commit-run # run pre-commit on all files
# Rust-specific shortcuts
make rust-setup # install rustup toolchain
make rust-build # build PyO3 extension with maturin
make bench-small # run small benchmark
make bench-big # run large benchmark
```
> [!NOTE]
> Tox in this project uses the "tox-uv" plugin. When running `make tox` or `tox`, missing Python interpreters can be provisioned automatically via uv (no need to pre-install all versions). This makes local setup faster.
## π Release assets and verification
Every GitHub release bundles artifacts to help you validate what you download:
- Built wheels and source distributions produced by the automated publish workflow.
- `sbom.json` (CycloneDX) generated with [Syft](https://github.com/anchore/syft).
- Sigstore-generated `.sig` and `.pem` files for each artifact, created using GitHub OIDC identity.
To verify a downloaded artifact from a release:
```bash
python -m pip install sigstore # installs the CLI
sigstore verify identity \
--certificate gsppy--py3-none-any.whl.pem \
--signature gsppy--py3-none-any.whl.sig \
--cert-identity "https://github.com/jacksonpradolima/gsp-py/.github/workflows/publish.yml@refs/tags/v" \
--cert-oidc-issuer https://token.actions.githubusercontent.com \
gsppy--py3-none-any.whl
```
Replace `` with the numeric package version (for example, `3.1.1`) in the filenames; in `--cert-identity`, this becomes `v` (for example, `v3.1.1`). Adjust the filenames for the sdist (`.tar.gz`) if preferred. The same release page also hosts `sbom.json` for supply-chain inspection.
## π Documentation
- **Live site:** https://jacksonpradolima.github.io/gsp-py/
- **Build locally:**
```bash
uv venv .venv
uv sync --extra docs
uv run mkdocs serve
```
The docs use MkDocs with the Material theme and mkdocstrings to render the Python API directly from docstrings.
## π‘ Usage
The library is designed to be easy to use and integrate with your own projects. You can use GSP-Py either programmatically (Python API) or directly from the command line (CLI).
---
## π¦ Using GSP-Py via CLI
GSP-Py provides a command-line interface (CLI) for running the Generalized Sequential Pattern algorithm on transactional data. This allows you to mine frequent sequential patterns from JSON or CSV files without writing any code.
### Installation
First, install GSP-Py (if not already installed):
```bash
pip install gsppy
```
This will make the `gsppy` CLI command available in your environment.
### Preparing Your Data
Your input file should be either:
- **JSON**: A list of transactions, each transaction is a list of items. Example:
```json
[
["Bread", "Milk"],
["Bread", "Diaper", "Beer", "Eggs"],
["Milk", "Diaper", "Beer", "Coke"],
["Bread", "Milk", "Diaper", "Beer"],
["Bread", "Milk", "Diaper", "Coke"]
]
```
- **CSV**: Each row is a transaction, items separated by commas. Example:
```csv
Bread,Milk
Bread,Diaper,Beer,Eggs
Milk,Diaper,Beer,Coke
Bread,Milk,Diaper,Beer
Bread,Milk,Diaper,Coke
```
- **SPM/GSP Format**: Uses delimiters to separate elements and sequences. This format is commonly used in sequential pattern mining datasets.
- `-1`: Marks the end of an element (itemset)
- `-2`: Marks the end of a sequence (transaction)
Example:
```text
1 2 -1 3 -1 -2
4 -1 5 6 -1 -2
1 -1 2 3 -1 -2
```
The above represents:
- Transaction 1: `[[1, 2], [3]]` β flattened to `[1, 2, 3]`
- Transaction 2: `[[4], [5, 6]]` β flattened to `[4, 5, 6]`
- Transaction 3: `[[1], [2, 3]]` β flattened to `[1, 2, 3]`
String tokens are also supported:
```text
A B -1 C -1 -2
D -1 E F -1 -2
```
- **Parquet/Arrow Files**: Modern columnar data formats (requires 'gsppy[dataframe]')
```bash
pip install 'gsppy[dataframe]'
```
This installs optional dependencies: `polars`, `pandas`, and `pyarrow` for DataFrame support.
### Running the CLI
Use the following command to run GSPPy on your data:
```bash
gsppy --file path/to/transactions.json --min_support 0.3 --backend auto
```
Or for CSV files:
```bash
gsppy --file path/to/transactions.csv --min_support 0.3 --backend rust
```
For SPM/GSP format files, use the `--format spm` option:
```bash
gsppy --file path/to/data.txt --format spm --min_support 0.3
```
#### CLI Options
- `--file`: Path to your input file (JSON, CSV, or SPM format). **Required**.
- `--format`: File format to use. Options: `auto` (default, auto-detect from extension), `json`, `csv`, `spm`, `parquet`, `arrow`.
- `--min_support`: Minimum support threshold as a fraction (e.g., `0.3` for 30%). Default is `0.2`.
- `--backend`: Backend to use for support counting. One of `auto` (default), `python`, `rust`, or `gpu`.
- `--verbose`: Enable detailed logging with timestamps, log levels, and process IDs for debugging and traceability.
- `--mingap`, `--maxgap`, `--maxspan`: Temporal constraints for time-aware pattern mining (requires timestamped transactions).
#### Verbose Mode
For debugging or to track execution in CI/CD pipelines, use the `--verbose` flag:
```bash
gsppy --file transactions.json --min_support 0.3 --verbose
```
This produces structured logging output with timestamps, log levels, and process information:
```
YYYY-MM-DDTHH:MM:SS | INFO | PID:4179 | gsppy.gsp | Pre-processing transactions...
YYYY-MM-DDTHH:MM:SS | DEBUG | PID:4179 | gsppy.gsp | Unique candidates: [('Bread',), ('Milk',), ...]
YYYY-MM-DDTHH:MM:SS | INFO | PID:4179 | gsppy.gsp | Starting GSP algorithm with min_support=0.3...
YYYY-MM-DDTHH:MM:SS | INFO | PID:4179 | gsppy.gsp | Run 1: 6 candidates filtered to 5.
...
```
For complete logging documentation, see [docs/logging.md](docs/logging.md).
#### Example
Suppose you have a file `transactions.json` as shown above. To find patterns with at least 30% support:
```bash
gsppy --file transactions.json --min_support 0.3
```
Sample output:
```
Pre-processing transactions...
Starting GSP algorithm with min_support=0.3...
Run 1: 6 candidates filtered to 5.
Run 2: 20 candidates filtered to 3.
Run 3: 2 candidates filtered to 2.
Run 4: 1 candidates filtered to 0.
GSP algorithm completed.
Frequent Patterns Found:
1-Sequence Patterns:
Pattern: ('Bread',), Support: 4
Pattern: ('Milk',), Support: 4
Pattern: ('Diaper',), Support: 4
Pattern: ('Beer',), Support: 3
Pattern: ('Coke',), Support: 2
2-Sequence Patterns:
Pattern: ('Bread', 'Milk'), Support: 3
Pattern: ('Milk', 'Diaper'), Support: 3
Pattern: ('Diaper', 'Beer'), Support: 3
3-Sequence Patterns:
Pattern: ('Bread', 'Milk', 'Diaper'), Support: 2
Pattern: ('Milk', 'Diaper', 'Beer'), Support: 2
```
#### Error Handling
- If the file does not exist or is in an unsupported format, a clear error message will be shown.
- The `min_support` value must be between 0.0 and 1.0 (exclusive of 0.0, inclusive of 1.0).
#### Advanced: Verbose Output
To see detailed logs for debugging, add the `--verbose` flag:
```bash
gsppy --file transactions.json --min_support 0.3 --verbose
```
---
The following example shows how to use GSP-Py programmatically in Python:
### Example Input Data
The input to the algorithm is a sequence of transactions, where each transaction contains a sequence of items:
```python
transactions = [
['Bread', 'Milk'],
['Bread', 'Diaper', 'Beer', 'Eggs'],
['Milk', 'Diaper', 'Beer', 'Coke'],
['Bread', 'Milk', 'Diaper', 'Beer'],
['Bread', 'Milk', 'Diaper', 'Coke']
]
```
### Importing and Initializing the GSP Algorithm
Import the `GSP` class from the `gsppy` package and call the `search` method to find frequent patterns with a support
threshold (e.g., `0.3`):
```python
from gsppy.gsp import GSP
# Example transactions: customer purchases
transactions = [
['Bread', 'Milk'], # Transaction 1
['Bread', 'Diaper', 'Beer', 'Eggs'], # Transaction 2
['Milk', 'Diaper', 'Beer', 'Coke'], # Transaction 3
['Bread', 'Milk', 'Diaper', 'Beer'], # Transaction 4
['Bread', 'Milk', 'Diaper', 'Coke'] # Transaction 5
]
# Set minimum support threshold (30%)
min_support = 0.3
# Find frequent patterns
result = GSP(transactions).search(min_support)
# Output the results
print(result)
```
### Verbose Mode for Debugging
Enable detailed logging to track algorithm progress and debug issues:
```python
from gsppy.gsp import GSP
# Enable verbose logging for the entire instance
gsp = GSP(transactions, verbose=True)
result = gsp.search(min_support=0.3)
# Or enable verbose for a specific search only
gsp = GSP(transactions)
result = gsp.search(min_support=0.3, verbose=True)
```
Verbose mode provides:
- Detailed progress information during execution
- Candidate generation and filtering statistics
- Preprocessing and validation details
- Useful for debugging, research, and CI/CD integration
For complete documentation on logging, see [docs/logging.md](docs/logging.md).
### Loading SPM/GSP Format Files
GSP-Py supports loading datasets in the classical SPM/GSP delimiter format, which is widely used in sequential pattern mining research. This format uses:
- `-1` to mark the end of an element (itemset)
- `-2` to mark the end of a sequence (transaction)
#### Using the SPM Loader
```python
from gsppy.utils import read_transactions_from_spm
from gsppy import GSP
# Load SPM format file
transactions = read_transactions_from_spm('data.txt')
# Run GSP algorithm
gsp = GSP(transactions)
result = gsp.search(min_support=0.3)
```
#### SPM Format Examples
**Simple sequence file (`data.txt`):**
```text
1 2 -1 3 -1 -2
4 -1 5 6 -1 -2
1 -1 2 3 -1 -2
```
This represents:
- Transaction 1: Items [1, 2] followed by item [3] β flattened to [1, 2, 3]
- Transaction 2: Item [4] followed by items [5, 6] β flattened to [4, 5, 6]
- Transaction 3: Item [1] followed by items [2, 3] β flattened to [1, 2, 3]
**String tokens are also supported:**
```text
A B -1 C -1 -2
D -1 E F -1 -2
```
#### Token Mapping
For workflows requiring conversion between string tokens and integer IDs, use the `TokenMapper`:
```python
from gsppy.utils import read_transactions_from_spm
from gsppy import TokenMapper
# Load with mappings
transactions, str_to_int, int_to_str = read_transactions_from_spm(
'data.txt',
return_mappings=True
)
print("String to Int:", str_to_int)
# Output: {'1': 0, '2': 1, '3': 2, '4': 3, '5': 4, '6': 5}
print("Int to String:", int_to_str)
# Output: {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6'}
# Use the TokenMapper class directly
mapper = TokenMapper()
id_a = mapper.add_token("A")
id_b = mapper.add_token("B")
print(f"A -> {id_a}, B -> {id_b}")
# Output: A -> 0, B -> 1
```
#### Edge Cases Handled
The SPM loader gracefully handles:
- Empty lines (skipped)
- Missing `-2` delimiter at end of line
- Extra or consecutive delimiters
- Mixed-length elements in sequences
- Both integer and string tokens
### Output
The algorithm will return a list of patterns with their corresponding support.
Sample Output:
```python
[
{('Bread',): 4, ('Milk',): 4, ('Diaper',): 4, ('Beer',): 3, ('Coke',): 2},
{('Bread', 'Milk'): 3, ('Bread', 'Diaper'): 3, ('Bread', 'Beer'): 2, ('Milk', 'Diaper'): 3, ('Milk', 'Beer'): 2, ('Milk', 'Coke'): 2, ('Diaper', 'Beer'): 3, ('Diaper', 'Coke'): 2},
{('Bread', 'Milk', 'Diaper'): 2, ('Bread', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Beer'): 2, ('Milk', 'Diaper', 'Coke'): 2}
]
```
- The **first dictionary** contains single-item sequences with their frequencies (e.g., `('Bread',): 4` means "Bread"
appears in 4 transactions).
- The **second dictionary** contains 2-item sequential patterns (e.g., `('Bread', 'Milk'): 3` means the sequence "
Bread β Milk" appears in 3 transactions). Note that patterns like `('Bread', 'Beer')` are detected even when they don't appear adjacent in transactions - they just need to appear in order.
- The **third dictionary** contains 3-item sequential patterns (e.g., `('Bread', 'Milk', 'Diaper'): 2` means the
sequence "Bread β Milk β Diaper" appears in 2 transactions).
> [!NOTE]
> The **support** of a sequence is calculated as the fraction of transactions containing the sequence **in order** (not necessarily contiguously), e.g.,
`('Bread', 'Milk')` appears in 3 out of 5 transactions β Support = `3 / 5 = 0.6` (60%).
> This insight helps identify frequently occurring sequential patterns in datasets, such as shopping trends or user
> behavior.
> [!IMPORTANT]
> **Non-contiguous (ordered) matching**: GSP-Py detects patterns where items appear in the specified order but not necessarily adjacent. For example, the pattern `('Bread', 'Beer')` matches the transaction `['Bread', 'Milk', 'Diaper', 'Beer']` because Bread appears before Beer, even though they are not adjacent. This follows the standard GSP algorithm semantics for sequential pattern mining.
### Understanding Non-Contiguous Pattern Matching
GSP-Py follows the standard GSP algorithm semantics by detecting **ordered (non-contiguous)** subsequences. This means:
- β
**Order matters**: Items must appear in the specified sequence order
- β
**Gaps allowed**: Items don't need to be adjacent
- β **Wrong order rejected**: Items appearing in different order won't match
**Example:**
```python
from gsppy.gsp import GSP
sequences = [
['a', 'b', 'c'], # Contains: (a,b), (a,c), (b,c), (a,b,c)
['a', 'c'], # Contains: (a,c)
['b', 'c', 'a'], # Contains: (b,c), (b,a), (c,a)
['a', 'b', 'c', 'd'], # Contains: (a,b), (a,c), (a,d), (b,c), (b,d), (c,d), etc.
]
gsp = GSP(sequences)
result = gsp.search(min_support=0.5) # Need at least 2/4 sequences
# Pattern ('a', 'c') is found with support=3 because:
# - It appears in ['a', 'b', 'c'] (with 'b' in between)
# - It appears in ['a', 'c'] (adjacent)
# - It appears in ['a', 'b', 'c', 'd'] (with 'b' in between)
# Total: 3 out of 4 sequences = 75% support β
```
> [!TIP]
> For more complex examples, find example scripts in the [`gsppy/tests`](gsppy/tests) folder.
---
## π DataFrame Input Support
GSP-Py supports **Polars and Pandas DataFrames** as input, enabling high-performance workflows with modern data formats like Arrow and Parquet. This feature is particularly useful for large-scale data engineering pipelines and integration with existing data processing workflows.
### Installation
Install GSP-Py with DataFrame support:
```bash
pip install 'gsppy[dataframe]'
```
This installs the optional dependencies: `polars`, `pandas`, and `pyarrow`.
### DataFrame Input Formats
GSP-Py supports two DataFrame formats:
#### 1. Grouped Format (Transaction ID + Item Columns)
Use when your data has separate rows for each item in a transaction:
```python
import polars as pl
from gsppy import GSP
# Polars DataFrame with transaction_id and item columns
df = pl.DataFrame({
"transaction_id": [1, 1, 2, 2, 2, 3, 3],
"item": ["Bread", "Milk", "Bread", "Diaper", "Beer", "Milk", "Coke"],
})
# Run GSP directly on the DataFrame
gsp = GSP(df, transaction_col="transaction_id", item_col="item")
patterns = gsp.search(min_support=0.3)
for level, freq_patterns in enumerate(patterns, start=1):
print(f"\n{level}-Sequence Patterns:")
for pattern, support in freq_patterns.items():
print(f" {pattern}: {support}")
```
#### 2. Sequence Format (List Column)
Use when each row contains a complete transaction as a list:
```python
import pandas as pd
from gsppy import GSP
# Pandas DataFrame with sequences as lists
df = pd.DataFrame({
"transaction": [
["Bread", "Milk"],
["Bread", "Diaper", "Beer"],
["Milk", "Coke"],
]
})
gsp = GSP(df, sequence_col="transaction")
patterns = gsp.search(min_support=0.3)
```
### DataFrame with Timestamps
DataFrames support temporal constraints for time-aware pattern mining:
```python
import polars as pl
from gsppy import GSP
# Grouped format with timestamps
df = pl.DataFrame({
"transaction_id": [1, 1, 1, 2, 2, 2],
"item": ["Login", "Browse", "Purchase", "Login", "Browse", "Purchase"],
"timestamp": [0, 2, 5, 0, 1, 15], # Time in seconds
})
# Find patterns where consecutive events occur within 10 seconds
gsp = GSP(
df,
transaction_col="transaction_id",
item_col="item",
timestamp_col="timestamp",
maxgap=10
)
patterns = gsp.search(min_support=0.5)
```
For sequence format with timestamps:
```python
import pandas as pd
from gsppy import GSP
df = pd.DataFrame({
"sequence": [["A", "B", "C"], ["A", "D"]],
"timestamps": [[1, 2, 3], [1, 5]], # Timestamps per item
})
gsp = GSP(df, sequence_col="sequence", timestamp_col="timestamps", maxgap=3)
patterns = gsp.search(min_support=0.5)
```
### Working with Parquet and Arrow Files
DataFrames enable seamless integration with columnar storage formats:
```python
import polars as pl
from gsppy import GSP
# Read directly from Parquet
df = pl.read_parquet("transactions.parquet")
# Run GSP with automatic schema detection
gsp = GSP(df, transaction_col="txn_id", item_col="product")
patterns = gsp.search(min_support=0.2)
# Or use Pandas with Arrow backend
import pandas as pd
df_pandas = pd.read_parquet("transactions.parquet", engine="pyarrow")
gsp = GSP(df_pandas, transaction_col="txn_id", item_col="product")
patterns = gsp.search(min_support=0.2)
```
### Performance Considerations
DataFrames offer performance benefits for large datasets:
- **Polars**: Leverages Arrow for zero-copy operations and parallel processing
- **Pandas**: Compatible with Arrow backend for efficient memory usage
- **Parquet/Arrow**: Columnar storage enables efficient filtering and reading
- **Schema validation**: Errors are caught early with clear messages
### DataFrame Schema Requirements
**Grouped Format:**
- `transaction_col`: Column containing transaction/sequence IDs (any type)
- `item_col`: Column containing items (any type, converted to strings)
- `timestamp_col` (optional): Column containing timestamps (numeric)
**Sequence Format:**
- `sequence_col`: Column containing lists of items
- `timestamp_col` (optional): Column containing lists of timestamps (must match sequence lengths)
### Error Handling
GSP-Py provides clear error messages for schema issues:
```python
import polars as pl
from gsppy import GSP
df = pl.DataFrame({
"txn_id": [1, 2],
"product": ["A", "B"],
})
# β Missing required column
try:
gsp = GSP(df, transaction_col="txn_id", item_col="item") # 'item' doesn't exist
except ValueError as e:
print(f"Error: {e}") # "Column 'item' not found in DataFrame"
# β Invalid format specification
try:
gsp = GSP(df) # Must specify either sequence_col or both transaction_col and item_col
except ValueError as e:
print(f"Error: {e}") # "Must specify either 'sequence_col' or both 'transaction_col' and 'item_col'"
```
### Backward Compatibility
Traditional list-based input continues to work:
```python
from gsppy import GSP
# Lists still work as before
transactions = [["A", "B"], ["A", "C"], ["B", "C"]]
gsp = GSP(transactions)
patterns = gsp.search(min_support=0.5)
```
DataFrame parameters cannot be mixed with list input:
```python
transactions = [["A", "B"], ["C", "D"]]
# β This raises an error
gsp = GSP(transactions, transaction_col="txn") # ValueError: DataFrame parameters cannot be used with list input
```
### Examples and Tests
For complete examples and edge cases, see:
- [`tests/test_dataframe.py`](tests/test_dataframe.py) - Comprehensive test suite
- DataFrame adapter documentation in [`gsppy/dataframe_adapters.py`](gsppy/dataframe_adapters.py)
---
## β±οΈ Temporal Constraints
GSP-Py supports **time-constrained sequential pattern mining** with three powerful temporal constraints: `mingap`, `maxgap`, and `maxspan`. These constraints enable domain-specific applications such as medical event mining, retail analytics, and temporal user journey discovery.
### Temporal Constraint Parameters
- **`mingap`**: Minimum time gap required between consecutive items in a pattern
- **`maxgap`**: Maximum time gap allowed between consecutive items in a pattern
- **`maxspan`**: Maximum time span from the first to the last item in a pattern
### Using Temporal Constraints
To use temporal constraints, your transactions must include timestamps as (item, timestamp) tuples:
```python
from gsppy.gsp import GSP
# Transactions with timestamps (e.g., in seconds, hours, days, etc.)
timestamped_transactions = [
[("Login", 0), ("Browse", 2), ("AddToCart", 5), ("Purchase", 7)],
[("Login", 0), ("Browse", 1), ("AddToCart", 15), ("Purchase", 20)],
[("Login", 0), ("Browse", 3), ("AddToCart", 6), ("Purchase", 8)],
]
# Find patterns where consecutive events occur within 10 time units
gsp = GSP(timestamped_transactions, maxgap=10)
patterns = gsp.search(min_support=0.6)
# The pattern ("Browse", "AddToCart", "Purchase") will:
# - Be found in transaction 1: gaps are 3 and 2 (both β€ 10) β
# - NOT be found in transaction 2: gap between BrowseβAddToCart is 14 (exceeds maxgap) β
# - Be found in transaction 3: gaps are 3 and 2 (both β€ 10) β
# Result: Support = 2/3 = 67% (above threshold of 60%)
```
### CLI Usage with Temporal Constraints
```bash
# Find patterns with maximum gap of 5 time units
gsppy --file temporal_data.json --min_support 0.3 --maxgap 5
# Find patterns with minimum gap of 2 time units
gsppy --file temporal_data.json --min_support 0.3 --mingap 2
# Find patterns that complete within 10 time units
gsppy --file temporal_data.json --min_support 0.3 --maxspan 10
# Combine multiple constraints
gsppy --file temporal_data.json --min_support 0.3 --mingap 1 --maxgap 5 --maxspan 10
```
### Real-World Examples
#### Medical Event Mining
```python
from gsppy.gsp import GSP
# Medical events with timestamps in days
medical_sequences = [
[("Symptom", 0), ("Diagnosis", 2), ("Treatment", 5), ("Recovery", 15)],
[("Symptom", 0), ("Diagnosis", 1), ("Treatment", 20), ("Recovery", 30)],
[("Symptom", 0), ("Diagnosis", 3), ("Treatment", 6), ("Recovery", 18)],
]
# Find patterns where treatment follows diagnosis within 10 days
gsp = GSP(medical_sequences, maxgap=10)
result = gsp.search(min_support=0.5)
# Pattern ("Diagnosis", "Treatment") found in sequences 1 & 3 only
# (sequence 2 has gap of 19 days, exceeding maxgap)
```
#### Retail Analytics
```python
from gsppy.gsp import GSP
# Customer purchases with timestamps in hours
purchase_sequences = [
[("Browse", 0), ("AddToCart", 0.5), ("Purchase", 1)],
[("Browse", 0), ("AddToCart", 1), ("Purchase", 25)], # Long delay
[("Browse", 0), ("AddToCart", 0.3), ("Purchase", 0.8)],
]
# Find purchase journeys that complete within 2 hours
gsp = GSP(purchase_sequences, maxspan=2)
result = gsp.search(min_support=0.5)
# Full sequence found in 2 out of 3 transactions
# (sequence 2 has span of 25 hours, exceeding maxspan)
```
#### User Journey Discovery
```python
from gsppy.gsp import GSP
# Website navigation with timestamps in seconds
navigation_sequences = [
[("Home", 0), ("Search", 5), ("Product", 10), ("Checkout", 15)],
[("Home", 0), ("Search", 3), ("Product", 8), ("Checkout", 180)],
[("Home", 0), ("Search", 4), ("Product", 9), ("Checkout", 14)],
]
# Find navigation patterns with:
# - Minimum 2 seconds between steps (mingap)
# - Maximum 20 seconds between steps (maxgap)
# - Complete within 30 seconds total (maxspan)
gsp = GSP(navigation_sequences, mingap=2, maxgap=20, maxspan=30)
result = gsp.search(min_support=0.5)
```
### Important Notes
- Temporal constraints require timestamped transactions (item-timestamp tuples)
- If temporal constraints are specified but transactions don't have timestamps, a warning is logged and constraints are ignored
- When using temporal constraints, the Python backend is automatically used (accelerated backends don't yet support temporal constraints)
- Timestamps can be in any unit (seconds, minutes, hours, days) as long as they're consistent within your dataset
---
## π§ Flexible Candidate Pruning
GSP-Py supports **flexible candidate pruning strategies** that allow you to customize how candidate sequences are filtered during pattern mining. This enables optimization for different dataset characteristics and mining requirements.
### Built-in Pruning Strategies
#### 1. Support-Based Pruning (Default)
The standard GSP pruning based on minimum support threshold:
```python
from gsppy.gsp import GSP
from gsppy.pruning import SupportBasedPruning
# Explicit support-based pruning
pruner = SupportBasedPruning(min_support_fraction=0.3)
gsp = GSP(transactions, pruning_strategy=pruner)
result = gsp.search(min_support=0.3)
```
#### 2. Frequency-Based Pruning
Prunes candidates based on absolute frequency (minimum number of occurrences):
```python
from gsppy.pruning import FrequencyBasedPruning
# Require patterns to appear at least 5 times
pruner = FrequencyBasedPruning(min_frequency=5)
gsp = GSP(transactions, pruning_strategy=pruner)
result = gsp.search(min_support=0.2)
```
**Use case**: When you need patterns to occur a minimum absolute number of times, regardless of dataset size.
#### 3. Temporal-Aware Pruning
Optimizes pruning for time-constrained pattern mining by pre-filtering infeasible patterns:
```python
from gsppy.pruning import TemporalAwarePruning
# Prune patterns that cannot satisfy temporal constraints
pruner = TemporalAwarePruning(
mingap=1,
maxgap=5,
maxspan=10,
min_support_fraction=0.3
)
gsp = GSP(timestamped_transactions, mingap=1, maxgap=5, maxspan=10, pruning_strategy=pruner)
result = gsp.search(min_support=0.3)
```
**Use case**: Improves performance for temporal pattern mining by eliminating patterns that cannot satisfy temporal constraints.
#### 4. Combined Pruning
Combines multiple pruning strategies for aggressive filtering:
```python
from gsppy.pruning import CombinedPruning, SupportBasedPruning, FrequencyBasedPruning
# Apply both support and frequency constraints
strategies = [
SupportBasedPruning(min_support_fraction=0.3),
FrequencyBasedPruning(min_frequency=5)
]
pruner = CombinedPruning(strategies)
gsp = GSP(transactions, pruning_strategy=pruner)
result = gsp.search(min_support=0.3)
```
**Use case**: When you want to combine multiple filtering criteria for more selective pattern discovery.
### Custom Pruning Strategies
You can create custom pruning strategies by implementing the `PruningStrategy` interface:
```python
from gsppy.pruning import PruningStrategy
from typing import Dict, Optional, Tuple
class MyCustomPruner(PruningStrategy):
def should_prune(
self,
candidate: Tuple[str, ...],
support_count: int,
total_transactions: int,
context: Optional[Dict] = None
) -> bool:
# Custom pruning logic
# Return True to prune (filter out), False to keep
pattern_length = len(candidate)
# Example: Prune very long patterns with low support
if pattern_length > 5 and support_count < 10:
return True
return False
# Use your custom pruner
custom_pruner = MyCustomPruner()
gsp = GSP(transactions, pruning_strategy=custom_pruner)
result = gsp.search(min_support=0.2)
```
### Performance Characteristics
Different pruning strategies have different performance tradeoffs:
| Strategy | Pruning Aggressiveness | Use Case | Performance Impact |
|----------|----------------------|----------|-------------------|
| **SupportBased** | Moderate | General-purpose mining | Baseline performance |
| **FrequencyBased** | High (for large datasets) | Require absolute frequency | Faster on large datasets |
| **TemporalAware** | High (for temporal data) | Time-constrained patterns | Significant speedup for temporal mining |
| **Combined** | Very High | Selective pattern discovery | Fastest, but may miss edge cases |
### Benchmarking Pruning Strategies
To compare pruning strategies on your dataset:
```bash
# Compare all strategies
python benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy all
# Benchmark a specific strategy
python benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy frequency
# Run multiple rounds for averaging
python benchmarks/bench_pruning.py --n_tx 1000 --vocab 100 --min_support 0.2 --strategy all --rounds 3
```
See `benchmarks/bench_pruning.py` for the complete benchmarking script.
---
## β¨οΈ Typing
`gsppy` ships inline type information (PEP 561) via a bundled `py.typed` marker. The public API is re-exported from
`gsppy` directlyβimport `GSP` for programmatic use or reuse the CLI helpers (`detect_and_read_file`,
`read_transactions_from_json`, `read_transactions_from_csv`, and `setup_logging`) when embedding the tool in
larger applications.
---
## π Planned Features
We are actively working to improve GSP-Py. Here are some exciting features planned for future releases:
1. **Support for Preprocessing and Postprocessing**:
- Add hooks to allow users to transform datasets before mining and customize the output results.
Want to contribute or suggest an
improvement? [Open a discussion or issue!](https://github.com/jacksonpradolima/gsp-py/issues)
---
## π€ Contributing
We welcome contributions from the community! If you'd like to help improve GSP-Py, read
our [CONTRIBUTING.md](CONTRIBUTING.md) guide to get started.
Development dependencies (e.g., testing and linting tools) are handled via uv.
To set up and run the main tasks:
```bash
uv venv .venv
uv sync --frozen --extra dev
uv pip install -e .
# Run tasks
uv run pytest -n auto
uv run ruff check .
uv run pyright
```
### Testing & Fuzzing
GSP-Py includes comprehensive test coverage, including property-based fuzzing tests using [Hypothesis](https://hypothesis.readthedocs.io/). These fuzzing tests automatically generate random inputs to verify algorithm invariants and discover edge cases. Run the fuzzing tests with:
```bash
uv run pytest tests/test_gsp_fuzzing.py -v
```
### General Steps:
1. Fork the repository.
2. Create a feature branch: `git checkout -b feature/my-feature`.
3. Commit your changes using [Conventional Commits](https://www.conventionalcommits.org/) format: `git commit -m "feat: add my feature"`.
4. Push to your branch: `git push origin feature/my-feature`.
5. Submit a pull request to the main repository!
Looking for ideas? Check out our [Planned Features](#planned-features) section.
### Release Management
GSP-Py uses automated release management with [Conventional Commits](https://www.conventionalcommits.org/). When commits are merged to `main`:
- **Releases are triggered** by: `fix:` (patch), `feat:` (minor), `perf:` (patch), or `BREAKING CHANGE:` (major)
- **No release** for: `docs:`, `style:`, `refactor:`, `test:`, `build:`, `ci:`, `chore:`
- CHANGELOG.md is automatically updated with structured release notes
- Git tags and GitHub releases are created automatically
See [Release Management Guide](docs/RELEASE_MANAGEMENT.md) for details on commit message format and release process.
---
## π License
This project is licensed under the terms of the **MIT License**. For more details, refer to the [LICENSE](LICENSE) file.
---
## π Citation
If GSP-Py contributed to your research or project that led to a publication, we kindly ask that you cite it as follows:
```
@misc{pradolima_gsppy,
author = {Prado Lima, Jackson Antonio do},
title = {{GSP-Py - Generalized Sequence Pattern algorithm in Python}},
month = Dec,
year = 2025,
doi = {10.5281/zenodo.3333987},
url = {https://doi.org/10.5281/zenodo.3333987}
}
```