https://github.com/bmsuisse/rusket
rusket π¦π§Ί
https://github.com/bmsuisse/rusket
als association-rules bpr collaborative-filtering data-science eclat fp-growth lightgcn machine-learning market-basket-analysis matrix-factorization personalization pyo3 python recommendation-engine recommender-system recsys rust sasrec sequential-recommendation
Last synced: 4 months ago
JSON representation
rusket π¦π§Ί
- Host: GitHub
- URL: https://github.com/bmsuisse/rusket
- Owner: bmsuisse
- License: mit
- Created: 2026-02-19T18:54:08.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-02-28T00:32:18.000Z (4 months ago)
- Last Synced: 2026-02-28T05:51:50.335Z (4 months ago)
- Topics: als, association-rules, bpr, collaborative-filtering, data-science, eclat, fp-growth, lightgcn, machine-learning, market-basket-analysis, matrix-factorization, personalization, pyo3, python, recommendation-engine, recommender-system, recsys, rust, sasrec, sequential-recommendation
- Language: Python
- Homepage: https://bmsuisse.github.io/rusket/
- Size: 290 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project
README
Ultra-fast Recommender Engines & Market Basket Analysis for Python, written in Rust.
Made with β€οΈ by the Data & AI Team.
---
## π― Goals
| Goal | Details |
|---|---|
| β‘ **Blazing fast** | All algorithms run in compiled Rust (via PyO3) with multi-threaded Rayon parallelism and SIMD-accelerated kernels. ALS is **11Γ**, and FP-Growth is **140Γ** faster than PySpark. |
| π¦ **Zero dependencies** | No TensorFlow, no PyTorch, no JVM. A single ~3 MB wheel is all you need β `pip install rusket` and go. |
| π§βπ» **Easy to use** | Common cases are one-liners: `model.recommend_items(user_id)`, `model.recommend_users(item_id)`, `model.export_item_factors()` for vector/embedding export. No boilerplate. |
| ποΈ **Modern data stack** | Native Pandas, Polars, and Apache Spark support with zero-copy Arrow transfers. Works seamlessly with Delta Lake, Databricks, Snowflake, and any dbt/Parquet pipeline. |
---
> **β οΈ Note:** `rusket` is currently under heavy construction. The API will probably change in upcoming versions.
**rusket** is a modern, Rust-powered library for Market Basket Analysis and Recommender Engines. It delivers significant speed-ups and lower memory usage compared to traditional Python implementations, while natively supporting Pandas, Polars, and Spark out of the box.
**Zero runtime dependencies.** No TensorFlow, no PyTorch, no JVM β just `pip install rusket` and go. The entire engine is compiled Rust, distributed as a single ~3 MB wheel.
It features Collaborative Filtering (ALS, BPR, SVD, LightGCN, ItemKNN, UserKNN, EASE), Sequential Recommendation (FPMC, SASRec), Context-aware Prediction (FM), Pattern Mining (FP-Growth, Eclat, FIN, LCM, HUPM, PrefixSpan), and built-in Hyperparameter Tuning (Optuna + MLflow tracking) with high performance and low memory footprints. Both functional and OOP APIs are available for seamless integration.
---
## β¨ Highlights
| | `rusket` | `LibRecommender` | `implicit` | `pyspark.ml` |
|---|---|---|---|---|
| **Core language** | Rust (PyO3) | TF + PyTorch + Cython | Cython / C++ | Scala / Java (JVM) |
| **Runtime deps** | **0** | TF + PyTorch + gensim (~2 GB) | OpenBLAS / MKL | JVM + Spark |
| **Install size** | ~3 MB | ~2 GB | ~50 MB | ~300 MB |
| **Algorithms** | ALS, BPR, SVD, LightGCN, ItemKNN, UserKNN, EASE, FM, FPMC, SASRec, FP-Growth, Eclat, FIN, LCM, HUPM, PrefixSpan | ALS, BPR, SVD, LightGCN, ItemCF, FM, DeepFM, ... | ALS, BPR | ALS, FP-Growth, PrefixSpan |
| **Recommender API** | β
Hybrid Engine + i2i Similarity | β
| β
| β
(ALS only) |
| **Graph & Embeddings** | β
NetworkX Export, Vector DB Export | β | β | β |
| **OOP class API** | β
`ALS.from_transactions(df).fit()` | β
| β
| β
|
| **Pandas / Polars / Spark** | β
/ β
/ β
| β
/ β / β | β / β / β | β / β / β
|
| **Parallel execution** | β
Rayon work-stealing | β
TF/PyTorch threads | β
OpenMP | β
Spark Cluster |
| **Memory** | Low (native Rust buffers) | High (TF/PyTorch graphs) | Low (C++ arrays) | High (JVM overhead) |
---
## π¦ Installation
```bash
pip install rusket
# or with uv:
uv add rusket
```
**Optional extras:**
```bash
# Polars support
pip install "rusket[polars]"
# Pandas/NumPy support (usually already installed)
pip install "rusket[pandas]"
```
---
## π Quick Start
### "Frequently Bought Together" β Grocery Checkout Data
Identify which products co-occur most in customer baskets β the foundation of cross-sell widgets, promotional bundles, and shelf placement decisions.
```python
import pandas as pd
from rusket import FPGrowth
# One week of supermarket checkout data (1 row = 1 receipt, 1 col = 1 SKU)
receipts = pd.DataFrame({
"milk": [1, 1, 0, 1, 1, 0, 1],
"bread": [1, 0, 1, 1, 0, 1, 1],
"butter": [1, 0, 1, 0, 0, 1, 0],
"eggs": [0, 1, 1, 0, 1, 0, 1],
"coffee": [0, 1, 0, 0, 1, 1, 0],
"orange_juice": [1, 0, 0, 1, 0, 0, 1],
}, dtype=bool)
# Step 1 β which SKU combinations appear in β₯40% of receipts?
model = FPGrowth(receipts, min_support=0.4)
freq = model.mine(use_colnames=True)
# Step 2 β keep rules with β₯60% confidence
rules = model.association_rules(metric="confidence", min_threshold=0.6)
# Lift > 1 means customers buy these together more than chance alone
print(rules[["antecedents", "consequents", "support", "confidence", "lift"]]
.sort_values("lift", ascending=False))
```
---
### π E-Commerce Order Lines (Long Format)
Real-world data arrives as `(order_id, sku)` rows from a database β not one-hot matrices.
All mining algorithms expose a class-based API that goes straight from order lines to recommendations:
```python
import pandas as pd
from rusket import FPGrowth
# Order line export from your e-commerce backend
orders = pd.DataFrame({
"order_id": [1001, 1001, 1001, 1002, 1002, 1003, 1003],
"sku": ["HDPHONES", "USB_DAC", "AUX_CABLE",
"HDPHONES", "CARRY_CASE",
"USB_DAC", "AUX_CABLE"],
})
model = FPGrowth.from_transactions(
orders,
transaction_col="order_id",
item_col="sku",
min_support=0.3,
)
freq = model.mine(use_colnames=True) # Miner classes: mine() never auto-fits
rules = model.association_rules(metric="confidence", min_threshold=0.6)
# Which accessories should be suggested when headphones are in the cart?
suggestions = model.recommend_items(["HDPHONES"], n=3)
# β e.g. ["USB_DAC", "AUX_CABLE", "CARRY_CASE"]
```
Or use the explicit type variants:
```python
from rusket import FPGrowth
ohe = FPGrowth.from_pandas(orders, transaction_col="order_id", item_col="sku")
ohe = FPGrowth.from_polars(pl_orders, transaction_col="order_id", item_col="sku")
ohe = FPGrowth.from_transactions([["HDPHONES", "USB_DAC"], ["HDPHONES", "CARRY_CASE"]]) # list of lists
```
> **Spark** is also supported: `FPGrowth.from_spark(spark_df)` calls `.toPandas()` internally.
---
### π»ββοΈ Polars Input β Reading from Data Lake Parquet
For teams running a modern data stack with Parquet files on S3/GCS/Azure Blob, `rusket` natively accepts [Polars](https://pola.rs/) DataFrames. Data is transferred via Arrow zero-copy buffers β **no conversion overhead**.
The fastest path from a data lake to "Frequently Bought Together" rules:
```python
import polars as pl
from rusket import FPGrowth
# ββ 1. Read a one-hot basket matrix directly from S3/GCS/local Parquet ββ
# Columns = SKUs (bool), rows = receipts β produced by your dbt or Spark pipeline
baskets = pl.read_parquet("s3://data-lake/gold/basket_ohe.parquet")
print(f"Loaded {baskets.shape[0]:,} receipts Γ {baskets.shape[1]} SKUs")
# ββ 2. Instantiate FPGrowth (zero-copy from Polars) βββββββββββββββββ
model = FPGrowth(baskets, min_support=0.02, max_len=3)
# ββ 3. Mine frequent combinations ββββββββββββββββββββββββββββββββββββ
freq = model.mine(use_colnames=True)
print(f"Found {len(freq):,} frequent itemsets")
print(freq.sort_values("support", ascending=False).head(10))
# ββ 4. Generate cross-sell rules ββββββββββββββββββββββββββββββββββββ
rules = model.association_rules(metric="lift", min_threshold=1.2)
print(f"Rules with lift > 1.2: {len(rules):,}")
print(
rules[["antecedents", "consequents", "confidence", "lift"]]
.sort_values("lift", ascending=False)
.head(8)
)
```
> **How it works under the hood:**
> Polars β Arrow buffer β `np.uint8` (zero-copy) β Rust `fpgrowth_from_dense`
---
### π High-Utility Pattern Mining (HUPM) β Profit-Driven Bundle Discovery
Frequent items aren't always the most profitable. HUPM finds product combinations that generate the **highest total gross margin** β even if they appear rarely. `rusket` implements the state-of-the-art **EFIM** algorithm in Rust.
```python
import pandas as pd
from rusket import HUPM
# Specialty foods retailer: receipt line items with gross margin per unit sold
orders = pd.DataFrame({
"receipt_id": [1, 1, 1, 2, 2, 3, 3],
"product": ["aged_cheese", "wine_flight", "charcuterie",
"aged_cheese", "charcuterie",
"wine_flight", "charcuterie"],
"margin": [8.50, 12.00, 6.50, # receipt 1 β margin per item
8.50, 6.50, # receipt 2
12.00, 6.50], # receipt 3
})
# Find all product bundles generating β₯ β¬20 total margin across all receipts
high_margin = HUPM.from_transactions(
orders,
transaction_col="receipt_id",
item_col="product",
utility_col="margin",
min_utility=20.0,
).mine()
print(high_margin.head())
# e.g. aged_cheese + wine_flight + charcuterie β total margin 81.0
```
---
### π Sparse Pandas Input
For very sparse datasets (e.g. e-commerce with thousands of SKUs), use Pandas `SparseDtype` to minimize memory. `rusket` passes the raw CSR arrays straight to Rust β **no densification ever happens**.
```python
import pandas as pd
import numpy as np
from rusket import FPGrowth
rng = np.random.default_rng(7)
n_rows, n_cols = 30_000, 500
# Very sparse: average basket size β 3 items out of 500
p_buy = 3 / n_cols
matrix = rng.random((n_rows, n_cols)) < p_buy
products = [f"sku_{i:04d}" for i in range(n_cols)]
df_dense = pd.DataFrame(matrix.astype(bool), columns=products)
df_sparse = df_dense.astype(pd.SparseDtype("bool", fill_value=False))
dense_mb = df_dense.memory_usage(deep=True).sum() / 1e6
sparse_mb = df_sparse.memory_usage(deep=True).sum() / 1e6
print(f"Dense memory: {dense_mb:.1f} MB")
print(f"Sparse memory: {sparse_mb:.1f} MB ({dense_mb / sparse_mb:.1f}Γ smaller)")
# Same API, same results β just faster and lighter
freq = FPGrowth(df_sparse, min_support=0.01).mine(use_colnames=True)
print(f"Frequent itemsets: {len(freq):,}")
```
> **How it works under the hood:**
> Sparse DataFrame β COO β CSR β `(indptr, indices)` β Rust `fpgrowth_from_csr`
---
### π Out-of-Core Processing (FPMiner Streaming)
For datasets scaling to **Billion-row** sizes that don't fit in memory, use the `FPMiner` accumulator. It accepts chunks of `(txn_id, item_id)` pairs, sorting them in-place immediately, and uses a memory-safe **k-way merge** across all chunks to build the CSR matrix on the fly avoiding massive memory spikes.
```python
import numpy as np
from rusket import FPMiner
n_items = 5_000
miner = FPMiner(n_items=n_items)
# Feed chunks incrementally (e.g. from Parquet/CSV/SQL)
for chunk in dataset:
txn_ids = chunk["txn_id"].to_numpy(dtype=np.int64)
item_ids = chunk["item_id"].to_numpy(dtype=np.int32)
# Fast O(k log k) per-chunk sort
miner.add_chunk(txn_ids, item_ids)
# Stream k-way merge and mine in one pass!
# Returns a DataFrame with 'support' and 'itemsets' just like fpgrowth()
freq = miner.mine(min_support=0.001, max_len=3)
```
**Memory efficiency:** The peak memory overhead at `mine()` time is just $O(k)$ for the cursors (where $k$ is the number of chunks), plus the final compressed CSR allocation.
---
### π©οΈ Distributed Computing with Apache Spark
`rusket` ships a full Spark integration layer in `rusket.spark`. All algorithms run as **Native Arrow UDFs** via `applyInArrow` β Rust is called directly on each executor, with zero Python overhead per row.
#### How it works
```
PySpark DataFrame
βββΊ groupby(group_col).applyInArrow(...)
βββΊ Arrow Table (per partition / per group)
βββΊ Polars zero-copy conversion
βββΊ rusket Rust extension (on the executor)
βββΊ results β PyArrow β PySpark DataFrame
```
#### Full Example β Retail Basket Analysis per Store
```python
from pyspark.sql import SparkSession
from rusket.spark import mine_grouped, rules_grouped
spark = SparkSession.builder.appName("rusket-demo").getOrCreate()
# ββ 1. Load your OHE transaction table (one row = one basket) ββββββββββββββ
# Schema: store_id (string), bread (bool), butter (bool), milk (bool), ...
spark_df = spark.read.parquet("s3://data/baskets/")
# ββ 2. Mine frequent itemsets per store in parallel ββββββββββββββββββββββββββ
# Each Spark task calls the Rust FP-Growth/Eclat engine on its Arrow batch.
freq_df = mine_grouped(
spark_df,
group_col="store_id",
min_support=0.05, # 5% support per store
)
# freq_df schema: store_id | support (double) | itemsets (array)
# ββ 3. Count transactions per store (needed for rule support) ββββββββββββββββ
from pyspark.sql import functions as F
counts = (
spark_df.groupby("store_id")
.agg(F.count("*").alias("n"))
.rdd.collectAsMap() # {"store_1": 12000, "store_2": 8500, ...}
)
# ββ 4. Generate association rules per store ββββββββββββββββββββββββββββββββββ
rules_df = rules_grouped(
freq_df,
group_col="store_id",
num_itemsets=counts, # pass per-group counts as a dict
metric="confidence",
min_threshold=0.6,
)
# rules_df schema: store_id | antecedents | consequents | confidence | lift | ...
rules_df.orderBy("lift", ascending=False).show(10, truncate=False)
```
#### Sequential Patterns per Category
```python
from rusket.spark import prefixspan_grouped
# event_log schema: category_id, user_id, item_id, event_ts
event_log = spark.read.parquet("s3://data/events/")
seq_df = prefixspan_grouped(
event_log,
group_col="category_id", # mine independently per product category
user_col="user_id", # sequence identifier within the group
time_col="event_ts", # ordering column
item_col="item_id",
min_support=50, # absolute count: pattern must appear in β₯50 sessions
max_len=4,
)
# seq_df schema: category_id | support (long) | sequence (array)
seq_df.show(5, truncate=False)
```
#### High-Utility Patterns per Region
```python
from rusket.spark import hupm_grouped
# profit_log schema: region_id, txn_id, item_id, profit
profit_log = spark.read.parquet("s3://data/profit/")
utility_df = hupm_grouped(
profit_log,
group_col="region_id",
transaction_col="txn_id",
item_col="item_id",
utility_col="profit",
min_utility=500.0, # only itemsets with combined profit β₯ β¬500
)
# utility_df schema: region_id | utility (double) | itemset (array)
utility_df.show(5, truncate=False)
```
#### Batch Recommendations across the Cluster
```python
from rusket.spark import recommend_batches
from rusket import ALS
# 1. Train an ALS model locally (or load a pre-trained one)
als = ALS.from_transactions(
events_pd,
user_col="user_id",
item_col="item_id",
).fit() # β always call .fit() after from_transactions()
# 2. Scale-out scoring: one recommendation row per user
user_df = spark.read.parquet("s3://data/users/").select("user_id")
recs_df = recommend_batches(user_df, model=als, user_col="user_id", k=10)
# recs_df schema: user_id (string) | recommended_items (array)
recs_df.show(5, truncate=False)
```
> **Tip β Databricks / Delta Lake:** All functions return a standard PySpark DataFrame, so you can write results back with `.write.format("delta").save(...)` or `.saveAsTable(...)` directly.
---
## π API Reference
### OOP Class API
Every algorithm in `rusket` exposes a **class-based API** in addition to the functional helpers. All classes share a unified interface inherited from `BaseModel`:
| Class | Inherits from | Description |
|-------|--------------|-------------|
| `FPGrowth` | `Miner`, `RuleMinerMixin` | FP-Tree parallel mining |
| `Eclat` | `Miner`, `RuleMinerMixin` | Vertical bitset mining |
| `FPGrowth` | `Miner`, `RuleMinerMixin` | Frequent Pattern Growth algorithm |
| `FIN` | `Miner`, `RuleMinerMixin` | FP-tree Node-list intersection mining |
| `LCM` | `Miner`, `RuleMinerMixin` | Linear-time Closed itemset Mining |
| `HUPM` | `Miner` | High-Utility Pattern Mining (EFIM) |
| `PrefixSpan` | `Miner` | Sequential pattern mining |
| `ALS` | `ImplicitRecommender` | Alternating Least Squares CF |
| `BPR` | `ImplicitRecommender` | Bayesian Personalized Ranking CF |
| `SVD` | `ImplicitRecommender` | Funk SVD (biased SGD) |
| `LightGCN` | `ImplicitRecommender` | Graph Convolutional CF |
| `ItemKNN` | `ImplicitRecommender` | Item-based k-NN CF |
| `UserKNN` | `ImplicitRecommender` | User-based k-NN CF |
| `EASE` | `ImplicitRecommender` | Embarrassingly Shallow Autoencoders |
| `FM` | `BaseModel` | Factorization Machines (CTR prediction) |
| `FPMC` | `SequentialRecommender` | Factorizing Personalized Markov Chains |
| `SASRec` | `SequentialRecommender` | Self-Attentive Sequential Recommendation |
| `HybridEmbeddingIndex` | β | CF + semantic embedding fusion |
All classes share the following data-ingestion class methods inherited from `BaseModel`:
```python
# Load from long-format (transaction_id, item_id) DataFrame or list of lists
model = FPGrowth.from_transactions(df, transaction_col="order_id", item_col="item", min_support=0.3)
# Typed convenience aliases β same result
model = FPGrowth.from_pandas(df, ...)
model = FPGrowth.from_polars(pl_df, ...)
model = FPGrowth.from_spark(spark_df, ...)
```
`Miner` subclasses (`FPGrowth`, `Eclat`) additionally expose `RuleMinerMixin`, giving a fluent pipeline:
```python
model = FPGrowth.from_transactions(df, min_support=0.3)
freq = model.mine(use_colnames=True) # pd.DataFrame [support, itemsets]
rules = model.association_rules(metric="lift") # pd.DataFrame [antecedents, consequents, ...]
recs = model.recommend_items(["bread", "milk"]) # list of suggested items
```
`ImplicitRecommender` subclasses (`ALS`, `BPR`, `SVD`, `LightGCN`, `ItemKNN`, `UserKNN`, `EASE`) follow the **scikit-learn** `fit()`/`predict()` pattern.
`SequentialRecommender` subclasses (`FPMC`, `SASRec`) use `from_transactions(..., time_col=...).fit()` for sequential next-item prediction:
```python
# Option A β construct then fit with a sparse matrix
model = ALS(factors=64, iterations=15)
model.fit(user_item_csr)
# Option B β from event log, then explicit .fit()
model = ALS(factors=64).from_transactions(
df, user_col="user_id", item_col="item_id"
).fit() # β .fit() is always required
# Predict / recommend
items, scores = model.recommend_items(user_id=42, n=10, exclude_seen=True)
users, scores = model.recommend_users(item_id=99, n=5)
```
> **Breaking change vs older versions:** `from_transactions()` no longer auto-fits.
> Always chain `.fit()` after it.
## π§ Advanced Pattern & Recommendation Algorithms
`rusket` provides more than just basic market basket analysis. It includes an entire suite of modern algorithms and a high-level Business Recommender API.
### π― ItemKNN & UserKNN β Nearest-Neighbor Collaborative Filtering
Two complementary memory-based methods that consistently rank among the **top performers** in academic benchmarks (see [Anelli et al. 2022](https://arxiv.org/abs/2203.01155)).
- **ItemKNN** β Finds items similar to what the user already liked. Fast, stable, and scales well with pre-computed item-item similarity.
- **UserKNN** β Finds users similar to the target user and recommends what they liked. Often more serendipitous and performs particularly well on dense datasets.
Both support **BM25**, **TF-IDF**, **Cosine**, and raw **Count** weighting, with the top-K neighbor pruning running in parallel Rust.
```python
from rusket import ItemKNN, UserKNN
# ββ Item-based: "Customers who bought X also bought Y" ββββββββββββ
item_knn = ItemKNN.from_transactions(
purchases, user_col="user_id", item_col="item_id",
method="bm25", k=100,
).fit()
items, scores = item_knn.recommend_items(user_id=42, n=10)
# ββ User-based: "Users similar to you enjoyed these items" ββββββββ
user_knn = UserKNN.from_transactions(
purchases, user_col="user_id", item_col="item_id",
method="cosine", k=50,
).fit()
items, scores = user_knn.recommend_items(user_id=42, n=10)
```
> **Which one to choose?** Start with `ItemKNN(method="bm25")` β it's the fastest and most stable. Switch to `UserKNN` if you have a dense dataset or want more diverse recommendations. In production, try both and evaluate with `rusket.evaluate()`.
### π― ALS & BPR Collaborative Filtering
Both models learn user and item embeddings from **implicit feedback** (purchases, clicks, plays) and power personalised recommendations at scale. Use **ALS** for broad serendipitous discovery; use **BPR** when you care only about top-N ranking.
```python
from rusket import ALS, BPR
# ββ "For You" homepage β music streaming platform ββββββββββββββββββββ
# event log: user_id | track_id | plays (optional weight)
plays = pd.DataFrame({
"user_id": [101, 101, 102, 102, 103, 103, 103],
"track_id": ["T01", "T03", "T01", "T05", "T02", "T03", "T05"],
"plays": [12, 5, 8, 3, 20, 1, 7], # play count as confidence weight
})
als = ALS(factors=64, iterations=15, alpha=40.0).from_transactions(
plays, user_col="user_id", item_col="track_id", rating_col="plays"
).fit() # β always call .fit() after from_transactions()
# Top-10 tracks for user 101, excluding already-played tracks
tracks, scores = als.recommend_items(user_id=101, n=10, exclude_seen=True)
# Which users are most likely to enjoy track T05? β useful for email campaigns
users, scores = als.recommend_users(item_id="T05", n=50)
# BPR β optimise ranking directly rather than reconstruction
bpr = BPR(factors=64, learning_rate=0.05, iterations=150).fit(user_item_csr)
```
### π― Hybrid Recommender API
Combine **Collaborative Filtering** (ALS/BPR) with **Frequent Pattern Mining** to cover every placement surface β personalised homepage ("For You") and active cart ("Frequently Bought Together") β in a single engine.
```python
from rusket import ALS, Recommender, FPGrowth
# 1. Train on purchase history (implicit feedback)
als = ALS(factors=64, iterations=15).fit(user_item_csr)
# 2. Mine co-purchase rules from basket data
miner = FPGrowth(basket_ohe, min_support=0.01)
freq = miner.mine()
rules = miner.association_rules()
# 3. Create the Hybrid Engine
rec = Recommender(model=als, rules_df=rules)
# "For You" homepage β personalised for customer 1001
items, scores = rec.recommend_for_user(user_id=1001, n=5)
# Blend CF + product embeddings (e.g. from a PIM or sentence-transformer)
items, scores = rec.recommend_for_user(user_id=1001, n=5, alpha=0.7,
target_item_for_semantic="HDPHONES")
# Active cart cross-sell β "Frequently Bought Together"
add_ons = rec.recommend_for_cart(["USB_DAC", "AUX_CABLE"], n=3)
# Overnight batch β score all customers, write to CRM
batch_df = rec.predict_next_chunk(user_history_df, user_col="customer_id", k=5)
```
### 𧬠Hybrid Embedding Fusion β CF + Semantic in One Vector Space
Collaborative filtering embeddings capture *behavioral* signals (who bought what); semantic text embeddings capture *content* meaning (product descriptions). Fusing them into a **single vector space** lets you do ANN retrieval, vector DB export, and clustering in one shot.
```python
import rusket
# 1. Train ALS on implicit feedback
als = rusket.ALS(factors=64, iterations=15).fit(interactions)
# 2. Get semantic embeddings (e.g. from sentence-transformers)
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("all-MiniLM-L6-v2")
text_vectors = encoder.encode(product_descriptions) # (n_items, 384)
# 3. Fuse into a single hybrid vector space
hybrid = rusket.HybridEmbeddingIndex(
cf_embeddings=als.item_factors, # (n_items, 64)
semantic_embeddings=text_vectors, # (n_items, 384)
strategy="weighted_concat", # "concat" | "weighted_concat" | "projection"
alpha=0.6, # 60% CF, 40% semantic
)
# 4. Similar items via cosine on the fused space
ids, scores = hybrid.query(item_id=42, n=10)
# 5. Build an ANN index for sub-millisecond retrieval
ann = hybrid.build_ann_index(backend="native") # or "faiss"
# 6. Export to a vector DB for production serving
hybrid.export_vectors(qdrant_client, collection_name="hybrid_items")
# 7. Or export as separate named vectors for DB-side fusion
hybrid.export_vectors(qdrant_client, mode="multi", collection_name="hybrid_items")
# β Qdrant/Meilisearch/Weaviate store "cf" and "semantic" as separate named vectors
```
Three fusion strategies:
| Strategy | Description | Use Case |
|---|---|---|
| `"concat"` | L2-normalise each space, concatenate | Equal importance, no tuning |
| `"weighted_concat"` | Scale by `Ξ±` / `1βΞ±`, then concat | **Default** β tune `alpha` to balance CF vs semantic |
| `"projection"` | Concat + PCA to `projection_dim` | Compact vectors for large-scale deployment |
> **Standalone function:** If you just need the fused matrix without an index, use `rusket.fuse_embeddings(cf, sem, strategy="weighted_concat", alpha=0.6)`.
### π― Multi-Stage Recommendation Pipeline
For production systems requiring advanced retrieval and ranking, use the `Pipeline` class. This mirrors the "retrieve β rerank β filter" paradigm used by Twitter/X and modern ML stacks.
It chains multiple models together:
1. **Retrieve:** Candidate generation
2. **Rerank:** Re-score candidates using a heavier scoring function
3. **Filter:** Apply business rules (e.g. exclude out-of-stock items, diversify)
```python
from rusket import ALS, BPR, Pipeline, RuleBasedRecommender
import pandas as pd
# 1. Train multiple base models
als = ALS(factors=64).fit(interactions)
bpr = BPR(factors=128).fit(interactions)
# 2. Define explicit business rules (e.g. promoting warranties with laptops)
rules_df = pd.DataFrame({
"antecedent": ["102"], # Laptop SKU
"consequent": ["999"], # Warranty SKU
"score": [2.0]
})
rules = RuleBasedRecommender.from_transactions(
interactions, rules=rules_df, user_col="user", item_col="item"
).fit()
# 3. Compose the Pipeline (Retrieve from ALS, rerank with deeper BPR vectors)
# Items from the `rules` model receive an artificial +1,000,000 score
# ensuring they rank at the top *after* the algorithmic reranking.
pipeline = Pipeline(
retrieve=[als, bpr],
merge_strategy="max", # how to combine candidate scores
rerank=bpr,
rules=rules,
)
# Recommend for a user
items, scores = pipeline.recommend(user_id=42, n=10, exclude_seen=True)
# Blazing-fast Batch Scoring utilizing Rust inner loops
batch_recs = pipeline.recommend_batch(
user_ids=[1, 2, 3],
n=10,
format="polars" # Returns a native Polars DataFrame instantly
)
```
### πΎ Saving, Loading and Serving (LanceDB / Vector DBs)
`rusket` models use a unified `BaseModel` that provides `.save()` and `.load()` functionality. You can also export trained models to a Vector Database for fast, real-time serving in production. We even provide `load_model` which automatically infers the model architecture from the pickle file.
```python
import rusket
# 1. Train the model
model = rusket.ALS(factors=32).fit(interactions)
# 2. Save your trained model to disk
model.save("my_als_model.pkl")
# 3. Load it back using the generic loader
loaded_model = rusket.load_model("my_als_model.pkl")
# 4. Export the embeddings for a Vector Database
items_df = rusket.export_item_factors(
loaded_model,
normalize=True, # Best for Cosine Similarity search
format="pandas"
)
# 5. Serve it in real-time (Example using LanceDB)
import lancedb
# Create a local vector database
db = lancedb.connect("./lancedb_store")
table = db.create_table("items", data=items_df)
# Query the table with a specific user's latent factors
user_emb = loaded_model.user_factors[0]
# Retrieve top 5 item recommendations for this user using L2-normalized vector search!
results = table.search(user_emb).limit(5).to_pandas()
```
### π Analytics Helpers
```python
from rusket import find_substitutes, customer_saturation
# Identify cannibalizing SKUs (lift < 1.0) for assortment rationalisation
subs = find_substitutes(rules_df, max_lift=0.8)
# antecedents consequents lift
# (Cola A,) (Cola B,) 0.61 β these products hurt each other's sales
# Segment customers by category penetration (decile 10 = buy everything; 1 = barely engaged)
saturation = customer_saturation(
purchases_df, user_col="customer_id", category_col="category_id"
)
```
### π BPR & Sequential Patterns
- **BPR (Bayesian Personalized Ranking):** Directly optimises ranking of positive interactions over negative ones β ideal for newsfeeds, playlists, and app recommendation surfaces that prioritise top-N precision.
- **Sequential Pattern Mining (PrefixSpan):** Discovers ordered patterns across time (e.g., "Subscriber signed up for broadband β mobile plan β premium bundle" or "Customer viewed Camera β 2 weeks later bought Lens").
`rusket` natively extracts PrefixSpan sequences from **Pandas, Polars, and PySpark** event logs with zero-copy Arrow mapping:
```python
from rusket import PrefixSpan
# Telco product adoption journeys β what sequence of subscriptions do customers follow?
# df: customer_id | subscription_date | product_id
model = PrefixSpan.from_transactions(
subscription_events,
transaction_col="customer_id",
item_col="product_id",
time_col="subscription_date",
min_support=50, # at least 50 customers follow this path
max_len=4,
)
freq_seqs = model.mine()
# e.g. [broadband] β [mobile] β [tv_bundle] appears in 312 journeys
```
### πΈοΈ Graph Analytics & Embeddings
Integrate natively with the modern GenAI/LLM stack:
- **Vector Export:** Export user/item factors to a Pandas `DataFrame` ready for FAISS/Qdrant using `model.export_item_factors()`.
- **Item-to-Item Similarity:** Fast Cosine Similarity on embeddings using `model.similar_items(item_id)`.
- **Graph Generation:** Automatically convert association rules into a `networkx` directed Graph for community detection using `rusket.viz.to_networkx(rules)`.
---
### π¬ MLOps: MLflow Tracking & Hyperparameter Tuning
`rusket` has built-in support for [MLflow](https://mlflow.org/) experiment tracking, `mlflow.pyfunc` packaging, and Bayesian hyperparameter optimisation using [Optuna](https://optuna.org/)'s TPE sampler. For **ALS/eALS** models, each Optuna trial runs the Rust-native cross-validation backend β making the entire search blazingly fast.
```python
import rusket
import rusket.mlflow
from rusket import OptunaSearchSpace
# ββ 1. Enable MLflow Autologging βββββββββββββββββββββββββββββββββββββ
rusket.mlflow.autolog()
# ββ 2. Train a single model with automatic tracking ββββββββββββββββββ
# Hyperparameters (factors, iterations) and training_duration_seconds are logged!
import mlflow
with mlflow.start_run():
model = rusket.ALS(factors=64, iterations=15).fit(df)
# Save/Load models as native MLflow pyfunc artifacts for easy deployment
rusket.mlflow.save_model(model, "my_als_model")
loaded_model = mlflow.pyfunc.load_model("my_als_model") # Has a .predict(df) method
# ββ 3. Quick hyperparameter search with sensible defaults βββββββββββ
result = rusket.optuna_optimize(
rusket.ALS,
df,
user_col="user_id",
item_col="item_id",
n_trials=50,
metric="ndcg",
k=10,
)
print(f"Best ndcg@10: {result.best_score:.4f}")
print(f"Best params: {result.best_params}")
# ββ Custom search space + refit best model βββββββββββββββββββββββββββ
result = rusket.optuna_optimize(
rusket.eALS,
df,
user_col="user_id",
item_col="item_id",
search_space=[
OptunaSearchSpace.int("factors", 16, 256, log=True),
OptunaSearchSpace.float("alpha", 1.0, 100.0, log=True),
OptunaSearchSpace.float("regularization", 1e-4, 1.0, log=True),
OptunaSearchSpace.int("iterations", 5, 30),
],
n_trials=100,
n_folds=3,
metric="precision",
refit_best=True, # best model is already fitted
)
items, scores = result.best_model.recommend_items(user_id=42, n=10)
# ββ MLflow experiment tracking βββββββββββββββββββββββββββββββββββββββ
# pip install mlflow optuna-integration
import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("als-tuning")
result = rusket.optuna_optimize(
rusket.ALS, df,
user_col="user_id", item_col="item_id",
n_trials=50, metric="ndcg",
mlflow_tracking=True, # β every trial logged to MLflow
)
# ββ Custom callbacks βββββββββββββββββββββββββββββββββββββββββββββββββ
result = rusket.optuna_optimize(
rusket.ALS, df,
user_col="user_id", item_col="item_id",
n_trials=50,
callbacks=[my_custom_callback], # any Optuna-compatible callback
)
```
---
### π GPU Acceleration (CUDA)
`rusket` supports optional GPU acceleration via **CuPy** or **PyTorch CUDA** for models that benefit from large matrix operations. Enable it globally with a single call β no need to pass `use_gpu=True` to every model.
```python
import rusket
# Enable GPU globally β every model created after this uses CUDA
rusket.enable_gpu()
# All models now default to GPU
als = rusket.ALS(factors=128, iterations=20).fit(interactions)
ease = rusket.EASE(regularization=500).fit(interactions)
bpr = rusket.BPR(factors=64).fit(interactions)
# Per-model override: force a specific model to CPU
small_model = rusket.SVD(factors=16, use_gpu=False)
# Turn it off globally
rusket.disable_gpu()
# Check the current state
rusket.is_gpu_enabled() # β False
```
#### Supported Models
All 12 recommender models respect the global GPU flag:
| Model | GPU-accelerated operations |
|-------|--------------------------|
| **ALS / eALS** | Gramian, Cholesky solve, batch scoring |
| **BPR** | SGD updates, batch recommend |
| **SVD** | Factor updates, batch scoring |
| **EASE** | Gram matrix inversion |
| **ItemKNN / UserKNN** | Similarity scoring |
| **LightGCN** | Graph convolution, scoring |
| **FM** | Prediction |
| **FPMC** | Factor updates |
| **SASRec / BERT4Rec** | Attention forward pass |
| **NMF** | Multiplicative updates |
#### Installation
```bash
# CuPy (recommended β fastest)
pip install cupy-cuda12x
# Or PyTorch
pip install torch
```
> **No GPU? No problem.** `rusket` auto-detects whether a GPU backend is available. If neither CuPy nor PyTorch CUDA is installed, `enable_gpu()` will still succeed but models will raise an `ImportError` at fit-time. Use `rusket.check_gpu_available()` to test beforehand.
---
## β‘ Benchmarks
> **Benchmark environment:** Apple Silicon MacBook Air (M-series, arm64, 8 GB RAM). All timings are single-run wall-clock measurements.
### Scale Benchmarks (1M β 200M rows)
> **What's measured:** `from_transactions()` converts long-format `(txn_id, item_id)` rows into a sparse OHE matrix. `fpgrowth()` then mines that matrix. Both steps have the same Rust mining cost β the only difference at large scale is whether you pay the conversion cost upfront.
| Scale | `from_transactions` (conversion) | `fpgrowth` (mining) | **Total** |
|---|:---:|:---:|:---:|
| 1M rows | 4.9s | **0.1s** | **5.0s** |
| 10M rows | 23.2s | **1.2s** | **24.4s** |
| 50M rows | 59.1s | **4.0s** | **63.1s** |
| 100M rows (20M txns Γ 200k items) | 124.1s | **10.1s** | **134.2s** |
| **200M rows** (40M txns Γ 200k items) | 229.2s | **17.6s** | **246.8s** |
The mining step is fast β the bottleneck at scale is the long-format β sparse-matrix conversion. If your pipeline already produces a CSR/sparse matrix (e.g., from a Parquet/warehouse export), you skip the conversion entirely and only pay the mining cost.
#### Power-user path: Direct CSR β Rust
```python
import numpy as np
from scipy import sparse as sp
from rusket import FPGrowth
# Build CSR directly from integer IDs (no pandas!)
csr = sp.csr_matrix(
(np.ones(len(txn_ids), dtype=np.int8), (txn_ids, item_ids)),
shape=(n_transactions, n_items),
)
freq = FPGrowth(csr, item_names=item_names).mine(
min_support=0.001, max_len=3, use_colnames=True
)
```
> At 100M rows, the mining step itself takes **10.1 seconds**. Building the CSR directly skips the `from_transactions` conversion cost (~124s) but does not change the mining time.
### Real-World Datasets
| Dataset | Transactions | Items | `rusket` |
|---------|:----------:|:-----:|:--------:|
| [andi_data.txt](https://github.com/andi611/Apriori-and-Eclat-Frequent-Itemset-Mining) | 8,416 | 119 | **9.7 s** (22.8M itemsets) |
| [andi_data2.txt](https://github.com/andi611/Apriori-and-Eclat-Frequent-Itemset-Mining) | 540,455 | 2,603 | **7.9 s** |
Run benchmarks yourself:
```bash
uv run pytest benchmarks/bench_scale.py -v -s # Scale benchmark
uv run python benchmarks/bench_realworld.py # Real-world datasets
uv run pytest tests/test_benchmark.py -v -s # pytest-benchmark
```
### Recommender Benchmarks vs LibRecommender
> **Measured with `pytest-benchmark`** (5 rounds, warmed up, GC disabled). MovieLens 100k dataset (943 users, 1,682 items, 100k ratings). Only `model.fit()` is timed β no startup or data loading overhead.
| Benchmark | rusket | LibRecommender | **Speedup** |
|---|:---:|:---:|:---:|
| **ALS (Cholesky)** (64 factors, 15 epochs) | **427 ms** | 1,324 ms | **3.1Γ** |
| **ALS (eALS)** (64 factors, 15 epochs) | **360 ms** | *N/A* | β |
| **BPR** (64 factors, 10 epochs) | **33 ms** | 681 ms | **20.4Γ** |
| **ItemKNN** (k=100) | **55 ms** | 287 ms | **5.2Γ** |
| **SVD** (64 factors, 20 epochs) | **55 ms** | β TF-only (broken) | β |
| **EASE** | **71 ms** | *N/A* | β |
> **Note:** LibRecommender requires TensorFlow + PyTorch + gensim + Cython (~2 GB of dependencies). rusket has **zero runtime dependencies**.
```bash
uv run pytest benchmarks/bench_pytest_librecommender.py -v --benchmark-columns=mean,stddev,rounds
```
---
## π Architecture
### Data Flow
```
pandas dense βββΊ np.uint8 array (C-contiguous) βββΊ Rust fpgrowth_from_dense
pandas Arrow backend βββΊ Arrow β np.uint8 (zero-copy) βββΊ Rust fpgrowth_from_dense
pandas sparse βββΊ CSR int32 arrays βββΊ Rust fpgrowth_from_csr
polars βββΊ Arrow β np.uint8 (zero-copy) βββΊ Rust fpgrowth_from_dense
numpy ndarray βββΊ np.uint8 (C-contiguous) βββΊ Rust fpgrowth_from_dense
```
All mining and rule generation happens **inside Rust**. No Python loops, no round-trips.
### The 1 Billion Row Architecture
To pass the "1 Billion Row" threshold without OOM crashes, `rusket` employs a zero-allocation mining loop:
- **Eclat Scratch Buffers:** `intersect_count_into` writes intersections directly into thread-local pre-allocated memory bytes and computes `popcnt` in a single pass. It implements **early-exit** loop termination the moment it proves a combination cannot reach `min_support`.
- **FPGrowth Parallel Tree Build:** Conditional FP-trees are collected concurrently inside the rayon parallel mining step, replacing the standard sequential loop and eliminating memory contention bottlenecks.
- **`AHashMap` Deduplication:** Extremely fast O(N) duplicate basket counting replaces standard O(N log N) unstable sorts in the core pipeline.
---
## π§βπ» Development
### Prerequisites
- **Rust** 1.83+ (`rustup update`)
- **Python** 3.10+
- [**uv**](https://docs.astral.sh/uv/) (recommended package manager)
### Getting Started
```bash
# Clone
git clone https://github.com/bmsuisse/rusket.git
cd rusket
# Build Rust extension in dev mode
uv run maturin develop --release
# Run the full test suite
uv run pytest tests/ -x -q
# Type-check the Python layer
uv run pyright rusket/
# Cargo check (Rust)
cargo check
```
### Run Examples
```bash
# Getting started
uv run python examples/01_getting_started.py
# Market basket analysis with Faker
uv run python examples/02_market_basket_faker.py
# Polars input
uv run python examples/03_polars_input.py
# Sparse input
uv run python examples/04_sparse_input.py
# Large-scale mining (100k+ rows)
uv run python examples/05_large_scale.py
```
---
## π€ AI Disclosure
A large part of this library β including the Rust core algorithms, the Python wrappers, the OOP class hierarchy, and the Spark integration layer β was written with substantial assistance from **AI pair-programming tools** (specifically [Google Gemini / Antigravity](https://deepmind.google/technologies/gemini/)). Human review, benchmarking, and architectural decisions were applied throughout.
We believe in transparency about AI-assisted development. The algorithms are correct, the tests pass, and the performance numbers are real β but if you find a bug or a piece of "AI slop", please open an issue!
---
## π License
[MIT License](LICENSE)