{"id":45190199,"url":"https://github.com/bmsuisse/rusket","last_synced_at":"2026-03-12T23:03:34.155Z","repository":{"id":339480265,"uuid":"1161995546","full_name":"bmsuisse/rusket","owner":"bmsuisse","description":"rusket 🦀🧺","archived":false,"fork":false,"pushed_at":"2026-02-28T00:32:18.000Z","size":303997,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-28T05:51:50.335Z","etag":null,"topics":["als","association-rules","bpr","collaborative-filtering","data-science","eclat","fp-growth","lightgcn","machine-learning","market-basket-analysis","matrix-factorization","personalization","pyo3","python","recommendation-engine","recommender-system","recsys","rust","sasrec","sequential-recommendation"],"latest_commit_sha":null,"homepage":"https://bmsuisse.github.io/rusket/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bmsuisse.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-02-19T18:54:08.000Z","updated_at":"2026-02-28T00:32:20.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/bmsuisse/rusket","commit_stats":null,"previous_names":["bmsuisse/rusket"],"tags_count":70,"template":false,"template_full_name":null,"purl":"pkg:github/bmsuisse/rusket","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bmsuisse%2Frusket","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bmsuisse%2Frusket/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bmsuisse%2Frusket/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bmsuisse%2Frusket/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bmsuisse","download_url":"https://codeload.github.com/bmsuisse/rusket/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bmsuisse%2Frusket/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29998203,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-02T09:59:02.300Z","status":"ssl_error","status_checked_at":"2026-03-02T09:59:02.001Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["als","association-rules","bpr","collaborative-filtering","data-science","eclat","fp-growth","lightgcn","machine-learning","market-basket-analysis","matrix-factorization","personalization","pyo3","python","recommendation-engine","recommender-system","recsys","rust","sasrec","sequential-recommendation"],"created_at":"2026-02-20T12:01:47.362Z","updated_at":"2026-03-12T23:03:34.147Z","avatar_url":"https://github.com/bmsuisse.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/assets/logo_wide.svg\" alt=\"rusket logo\" width=\"520\" height=\"200\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eUltra-fast Recommender Engines \u0026 Market Basket Analysis for Python, written in Rust.\u003c/strong\u003e\u003cbr\u003e\n  \u003cem\u003eMade with ❤️ by the Data \u0026 AI Team.\u003c/em\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/rusket/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/rusket?color=%2334D058\u0026logo=pypi\u0026logoColor=white\" alt=\"PyPI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.python.org/\"\u003e\u003cimg src=\"https://img.shields.io/badge/python-3.10%2B-blue?logo=python\u0026logoColor=white\" alt=\"Python\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.rust-lang.org/\"\u003e\u003cimg src=\"https://img.shields.io/badge/rust-1.83%2B-orange?logo=rust\" alt=\"Rust\"\u003e\u003c/a\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-MIT-green\" alt=\"License\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://bmsuisse.github.io/rusket/\"\u003e\u003cimg src=\"https://img.shields.io/badge/docs-Zensical-blue\" alt=\"Docs\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n## 🎯 Goals\n\n| Goal | Details |\n|---|---|\n| ⚡ **Blazing fast** | All algorithms run in compiled Rust (via PyO3) with multi-threaded Rayon parallelism and SIMD-accelerated kernels. ALS is **11×**, and FP-Growth is **140×** faster than PySpark. |\n| 📦 **Zero dependencies** | No TensorFlow, no PyTorch, no JVM. A single ~3 MB wheel is all you need — `pip install rusket` and go. |\n| 🧑‍💻 **Easy to use** | Common cases are one-liners: `model.recommend_items(user_id)`, `model.recommend_users(item_id)`, `model.export_item_factors()` for vector/embedding export. No boilerplate. |\n| 🏗️ **Modern data stack** | Native Pandas, Polars, and Apache Spark support with zero-copy Arrow transfers. Works seamlessly with Delta Lake, Databricks, Snowflake, and any dbt/Parquet pipeline. |\n\n---\n\n\u003e **⚠️ Note:** `rusket` is currently under heavy construction. The API will probably change in upcoming versions.\n\n**rusket** is a modern, Rust-powered library for Market Basket Analysis and Recommender Engines. It delivers significant speed-ups and lower memory usage compared to traditional Python implementations, while natively supporting Pandas, Polars, and Spark out of the box.\n\n**Zero runtime dependencies.** No TensorFlow, no PyTorch, no JVM — just `pip install rusket` and go. The entire engine is compiled Rust, distributed as a single ~3 MB wheel.\n\nIt features Collaborative Filtering (ALS, BPR, SVD, LightGCN, ItemKNN, UserKNN, EASE), Sequential Recommendation (FPMC, SASRec), Context-aware Prediction (FM), Pattern Mining (FP-Growth, Eclat, FIN, LCM, HUPM, PrefixSpan), and built-in Hyperparameter Tuning (Optuna + MLflow tracking) with high performance and low memory footprints. Both functional and OOP APIs are available for seamless integration.\n\n---\n\n## ✨ Highlights\n\n| | `rusket` | `LibRecommender` | `implicit` | `pyspark.ml` |\n|---|---|---|---|---|\n| **Core language** | Rust (PyO3) | TF + PyTorch + Cython | Cython / C++ | Scala / Java (JVM) |\n| **Runtime deps** | **0** | TF + PyTorch + gensim (~2 GB) | OpenBLAS / MKL | JVM + Spark |\n| **Install size** | ~3 MB | ~2 GB | ~50 MB | ~300 MB |\n| **Algorithms** | ALS, BPR, SVD, LightGCN, ItemKNN, UserKNN, EASE, FM, FPMC, SASRec, FP-Growth, Eclat, FIN, LCM, HUPM, PrefixSpan | ALS, BPR, SVD, LightGCN, ItemCF, FM, DeepFM, ... | ALS, BPR | ALS, FP-Growth, PrefixSpan |\n| **Recommender API** | ✅ Hybrid Engine + i2i Similarity | ✅ | ✅ | ✅ (ALS only) |\n| **Graph \u0026 Embeddings** | ✅ NetworkX Export, Vector DB Export | ❌ | ❌ | ❌ |\n| **OOP class API** | ✅ `ALS.from_transactions(df).fit()` | ✅ | ✅ | ✅ |\n| **Pandas / Polars / Spark** | ✅ / ✅ / ✅ | ✅ / ❌ / ❌ | ❌ / ❌ / ❌ | ❌ / ❌ / ✅ |\n| **Parallel execution** | ✅ Rayon work-stealing | ✅ TF/PyTorch threads | ✅ OpenMP | ✅ Spark Cluster |\n| **Memory** | Low (native Rust buffers) | High (TF/PyTorch graphs) | Low (C++ arrays) | High (JVM overhead) |\n\n---\n\n## 📦 Installation\n\n```bash\npip install rusket\n# or with uv:\nuv add rusket\n```\n\n**Optional extras:**\n\n```bash\n# Polars support\npip install \"rusket[polars]\"\n\n# Pandas/NumPy support (usually already installed)\npip install \"rusket[pandas]\"\n```\n\n---\n\n## 🚀 Quick Start\n\n### \"Frequently Bought Together\" — Grocery Checkout Data\n\nIdentify which products co-occur most in customer baskets — the foundation of cross-sell widgets, promotional bundles, and shelf placement decisions.\n\n```python\nimport pandas as pd\nfrom rusket import FPGrowth\n\n# One week of supermarket checkout data (1 row = 1 receipt, 1 col = 1 SKU)\nreceipts = pd.DataFrame({\n    \"milk\":         [1, 1, 0, 1, 1, 0, 1],\n    \"bread\":        [1, 0, 1, 1, 0, 1, 1],\n    \"butter\":       [1, 0, 1, 0, 0, 1, 0],\n    \"eggs\":         [0, 1, 1, 0, 1, 0, 1],\n    \"coffee\":       [0, 1, 0, 0, 1, 1, 0],\n    \"orange_juice\": [1, 0, 0, 1, 0, 0, 1],\n}, dtype=bool)\n\n# Step 1 — which SKU combinations appear in ≥40% of receipts?\n\nmodel = FPGrowth(receipts, min_support=0.4)\nfreq = model.mine(use_colnames=True)\n\n# Step 2 — keep rules with ≥60% confidence\nrules = model.association_rules(metric=\"confidence\", min_threshold=0.6)\n\n# Lift \u003e 1 means customers buy these together more than chance alone\nprint(rules[[\"antecedents\", \"consequents\", \"support\", \"confidence\", \"lift\"]]\n      .sort_values(\"lift\", ascending=False))\n```\n\n---\n\n### 🛒 E-Commerce Order Lines (Long Format)\n\nReal-world data arrives as `(order_id, sku)` rows from a database — not one-hot matrices.\n\nAll mining algorithms expose a class-based API that goes straight from order lines to recommendations:\n\n```python\nimport pandas as pd\nfrom rusket import FPGrowth\n\n# Order line export from your e-commerce backend\norders = pd.DataFrame({\n    \"order_id\": [1001, 1001, 1001, 1002, 1002, 1003, 1003],\n    \"sku\":      [\"HDPHONES\", \"USB_DAC\", \"AUX_CABLE\",\n                 \"HDPHONES\", \"CARRY_CASE\",\n                 \"USB_DAC\",  \"AUX_CABLE\"],\n})\n\nmodel = FPGrowth.from_transactions(\n    orders,\n    transaction_col=\"order_id\",\n    item_col=\"sku\",\n    min_support=0.3,\n)\n\nfreq  = model.mine(use_colnames=True)              # Miner classes: mine() never auto-fits\nrules = model.association_rules(metric=\"confidence\", min_threshold=0.6)\n\n# Which accessories should be suggested when headphones are in the cart?\nsuggestions = model.recommend_items([\"HDPHONES\"], n=3)\n# → e.g. [\"USB_DAC\", \"AUX_CABLE\", \"CARRY_CASE\"]\n```\n\nOr use the explicit type variants:\n\n```python\nfrom rusket import FPGrowth\n\nohe = FPGrowth.from_pandas(orders, transaction_col=\"order_id\", item_col=\"sku\")\nohe = FPGrowth.from_polars(pl_orders, transaction_col=\"order_id\", item_col=\"sku\")\nohe = FPGrowth.from_transactions([[\"HDPHONES\", \"USB_DAC\"], [\"HDPHONES\", \"CARRY_CASE\"]])  # list of lists\n```\n\n\u003e **Spark** is also supported: `FPGrowth.from_spark(spark_df)` calls `.toPandas()` internally.\n\n---\n\n### 🐻‍❄️ Polars Input — Reading from Data Lake Parquet\n\nFor teams running a modern data stack with Parquet files on S3/GCS/Azure Blob, `rusket` natively accepts [Polars](https://pola.rs/) DataFrames. Data is transferred via Arrow zero-copy buffers — **no conversion overhead**.\n\nThe fastest path from a data lake to \"Frequently Bought Together\" rules:\n\n```python\nimport polars as pl\nfrom rusket import FPGrowth\n\n# ── 1. Read a one-hot basket matrix directly from S3/GCS/local Parquet ──\n# Columns = SKUs (bool), rows = receipts — produced by your dbt or Spark pipeline\nbaskets = pl.read_parquet(\"s3://data-lake/gold/basket_ohe.parquet\")\nprint(f\"Loaded {baskets.shape[0]:,} receipts × {baskets.shape[1]} SKUs\")\n\n# ── 2. Instantiate FPGrowth (zero-copy from Polars) ─────────────────\nmodel = FPGrowth(baskets, min_support=0.02, max_len=3)\n\n# ── 3. Mine frequent combinations ────────────────────────────────────\nfreq = model.mine(use_colnames=True)\nprint(f\"Found {len(freq):,} frequent itemsets\")\nprint(freq.sort_values(\"support\", ascending=False).head(10))\n\n# ── 4. Generate cross-sell rules ────────────────────────────────────\nrules = model.association_rules(metric=\"lift\", min_threshold=1.2)\nprint(f\"Rules with lift \u003e 1.2: {len(rules):,}\")\nprint(\n    rules[[\"antecedents\", \"consequents\", \"confidence\", \"lift\"]]\n    .sort_values(\"lift\", ascending=False)\n    .head(8)\n)\n```\n\n\u003e **How it works under the hood:**  \n\u003e Polars → Arrow buffer → `np.uint8` (zero-copy) → Rust `fpgrowth_from_dense`\n\n---\n\n### 💎 High-Utility Pattern Mining (HUPM) — Profit-Driven Bundle Discovery\n\nFrequent items aren't always the most profitable. HUPM finds product combinations that generate the **highest total gross margin** — even if they appear rarely. `rusket` implements the state-of-the-art **EFIM** algorithm in Rust.\n\n```python\nimport pandas as pd\nfrom rusket import HUPM\n\n# Specialty foods retailer: receipt line items with gross margin per unit sold\norders = pd.DataFrame({\n    \"receipt_id\": [1, 1, 1, 2, 2, 3, 3],\n    \"product\": [\"aged_cheese\", \"wine_flight\", \"charcuterie\",\n                \"aged_cheese\", \"charcuterie\",\n                \"wine_flight\", \"charcuterie\"],\n    \"margin\": [8.50, 12.00, 6.50,   # receipt 1 — margin per item\n               8.50, 6.50,           # receipt 2\n               12.00, 6.50],         # receipt 3\n})\n\n# Find all product bundles generating ≥ €20 total margin across all receipts\nhigh_margin = HUPM.from_transactions(\n    orders,\n    transaction_col=\"receipt_id\",\n    item_col=\"product\",\n    utility_col=\"margin\",\n    min_utility=20.0,\n).mine()\nprint(high_margin.head())\n# e.g. aged_cheese + wine_flight + charcuterie → total margin 81.0\n```\n\n---\n\n### 📊 Sparse Pandas Input\n\nFor very sparse datasets (e.g. e-commerce with thousands of SKUs), use Pandas `SparseDtype` to minimize memory. `rusket` passes the raw CSR arrays straight to Rust — **no densification ever happens**.\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom rusket import FPGrowth\n\nrng = np.random.default_rng(7)\nn_rows, n_cols = 30_000, 500\n\n# Very sparse: average basket size ≈ 3 items out of 500\np_buy = 3 / n_cols\nmatrix = rng.random((n_rows, n_cols)) \u003c p_buy\nproducts = [f\"sku_{i:04d}\" for i in range(n_cols)]\n\ndf_dense = pd.DataFrame(matrix.astype(bool), columns=products)\ndf_sparse = df_dense.astype(pd.SparseDtype(\"bool\", fill_value=False))\n\ndense_mb = df_dense.memory_usage(deep=True).sum() / 1e6\nsparse_mb = df_sparse.memory_usage(deep=True).sum() / 1e6\nprint(f\"Dense  memory: {dense_mb:.1f} MB\")\nprint(f\"Sparse memory: {sparse_mb:.1f} MB  ({dense_mb / sparse_mb:.1f}× smaller)\")\n\n# Same API, same results — just faster and lighter\nfreq = FPGrowth(df_sparse, min_support=0.01).mine(use_colnames=True)\nprint(f\"Frequent itemsets: {len(freq):,}\")\n```\n\n\u003e **How it works under the hood:**  \n\u003e Sparse DataFrame → COO → CSR → `(indptr, indices)` → Rust `fpgrowth_from_csr`\n\n---\n\n### 🌊 Out-of-Core Processing (FPMiner Streaming)\n\nFor datasets scaling to **Billion-row** sizes that don't fit in memory, use the `FPMiner` accumulator. It accepts chunks of `(txn_id, item_id)` pairs, sorting them in-place immediately, and uses a memory-safe **k-way merge** across all chunks to build the CSR matrix on the fly avoiding massive memory spikes.\n\n```python\nimport numpy as np\nfrom rusket import FPMiner\n\nn_items = 5_000\nminer = FPMiner(n_items=n_items)\n\n# Feed chunks incrementally (e.g. from Parquet/CSV/SQL)\nfor chunk in dataset:\n    txn_ids = chunk[\"txn_id\"].to_numpy(dtype=np.int64)\n    item_ids = chunk[\"item_id\"].to_numpy(dtype=np.int32)\n    \n    # Fast O(k log k) per-chunk sort\n    miner.add_chunk(txn_ids, item_ids)\n\n# Stream k-way merge and mine in one pass!\n# Returns a DataFrame with 'support' and 'itemsets' just like fpgrowth()\nfreq = miner.mine(min_support=0.001, max_len=3)\n```\n\n**Memory efficiency:** The peak memory overhead at `mine()` time is just $O(k)$ for the cursors (where $k$ is the number of chunks), plus the final compressed CSR allocation. \n\n---\n\n### 🌩️ Distributed Computing with Apache Spark\n\n`rusket` ships a full Spark integration layer in `rusket.spark`. All algorithms run as **Native Arrow UDFs** via `applyInArrow` — Rust is called directly on each executor, with zero Python overhead per row.\n\n#### How it works\n\n```\nPySpark DataFrame\n  └─► groupby(group_col).applyInArrow(...)\n        └─► Arrow Table (per partition / per group)\n              └─► Polars zero-copy conversion\n                    └─► rusket Rust extension (on the executor)\n                          └─► results → PyArrow → PySpark DataFrame\n```\n\n#### Full Example — Retail Basket Analysis per Store\n\n```python\nfrom pyspark.sql import SparkSession\nfrom rusket.spark import mine_grouped, rules_grouped\n\nspark = SparkSession.builder.appName(\"rusket-demo\").getOrCreate()\n\n# ── 1. Load your OHE transaction table (one row = one basket) ──────────────\n#    Schema: store_id (string), bread (bool), butter (bool), milk (bool), ...\nspark_df = spark.read.parquet(\"s3://data/baskets/\")\n\n# ── 2. Mine frequent itemsets per store in parallel ──────────────────────────\n#    Each Spark task calls the Rust FP-Growth/Eclat engine on its Arrow batch.\nfreq_df = mine_grouped(\n    spark_df,\n    group_col=\"store_id\",\n    min_support=0.05,    # 5% support per store\n\n)\n# freq_df schema: store_id | support (double) | itemsets (array\u003cstring\u003e)\n\n# ── 3. Count transactions per store (needed for rule support) ────────────────\nfrom pyspark.sql import functions as F\ncounts = (\n    spark_df.groupby(\"store_id\")\n    .agg(F.count(\"*\").alias(\"n\"))\n    .rdd.collectAsMap()          # {\"store_1\": 12000, \"store_2\": 8500, ...}\n)\n\n# ── 4. Generate association rules per store ──────────────────────────────────\nrules_df = rules_grouped(\n    freq_df,\n    group_col=\"store_id\",\n    num_itemsets=counts,         # pass per-group counts as a dict\n    metric=\"confidence\",\n    min_threshold=0.6,\n)\n# rules_df schema: store_id | antecedents | consequents | confidence | lift | ...\n\nrules_df.orderBy(\"lift\", ascending=False).show(10, truncate=False)\n```\n\n#### Sequential Patterns per Category\n\n```python\nfrom rusket.spark import prefixspan_grouped\n\n# event_log schema: category_id, user_id, item_id, event_ts\nevent_log = spark.read.parquet(\"s3://data/events/\")\n\nseq_df = prefixspan_grouped(\n    event_log,\n    group_col=\"category_id\",   # mine independently per product category\n    user_col=\"user_id\",        # sequence identifier within the group\n    time_col=\"event_ts\",       # ordering column\n    item_col=\"item_id\",\n    min_support=50,            # absolute count: pattern must appear in ≥50 sessions\n    max_len=4,\n)\n# seq_df schema: category_id | support (long) | sequence (array\u003cstring\u003e)\nseq_df.show(5, truncate=False)\n```\n\n#### High-Utility Patterns per Region\n\n```python\nfrom rusket.spark import hupm_grouped\n\n# profit_log schema: region_id, txn_id, item_id, profit\nprofit_log = spark.read.parquet(\"s3://data/profit/\")\n\nutility_df = hupm_grouped(\n    profit_log,\n    group_col=\"region_id\",\n    transaction_col=\"txn_id\",\n    item_col=\"item_id\",\n    utility_col=\"profit\",\n    min_utility=500.0,         # only itemsets with combined profit ≥ €500\n)\n# utility_df schema: region_id | utility (double) | itemset (array\u003clong\u003e)\nutility_df.show(5, truncate=False)\n```\n\n#### Batch Recommendations across the Cluster\n\n```python\nfrom rusket.spark import recommend_batches\nfrom rusket import ALS\n\n# 1. Train an ALS model locally (or load a pre-trained one)\nals = ALS.from_transactions(\n    events_pd,\n    user_col=\"user_id\",\n    item_col=\"item_id\",\n).fit()  # ← always call .fit() after from_transactions()\n\n# 2. Scale-out scoring: one recommendation row per user\nuser_df = spark.read.parquet(\"s3://data/users/\").select(\"user_id\")\n\nrecs_df = recommend_batches(user_df, model=als, user_col=\"user_id\", k=10)\n# recs_df schema: user_id (string) | recommended_items (array\u003cint\u003e)\nrecs_df.show(5, truncate=False)\n```\n\n\u003e **Tip — Databricks / Delta Lake:** All functions return a standard PySpark DataFrame, so you can write results back with `.write.format(\"delta\").save(...)` or `.saveAsTable(...)` directly.\n\n---\n\n## 📖 API Reference\n\n### OOP Class API\n\nEvery algorithm in `rusket` exposes a **class-based API** in addition to the functional helpers. All classes share a unified interface inherited from `BaseModel`:\n\n| Class | Inherits from | Description |\n|-------|--------------|-------------|\n| `FPGrowth` | `Miner`, `RuleMinerMixin` | FP-Tree parallel mining |\n| `Eclat` | `Miner`, `RuleMinerMixin` | Vertical bitset mining |\n| `FPGrowth` | `Miner`, `RuleMinerMixin` | Frequent Pattern Growth algorithm |\n| `FIN` | `Miner`, `RuleMinerMixin` | FP-tree Node-list intersection mining |\n| `LCM` | `Miner`, `RuleMinerMixin` | Linear-time Closed itemset Mining |\n| `HUPM` | `Miner` | High-Utility Pattern Mining (EFIM) |\n| `PrefixSpan` | `Miner` | Sequential pattern mining |\n| `ALS` | `ImplicitRecommender` | Alternating Least Squares CF |\n| `BPR` | `ImplicitRecommender` | Bayesian Personalized Ranking CF |\n| `SVD` | `ImplicitRecommender` | Funk SVD (biased SGD) |\n| `LightGCN` | `ImplicitRecommender` | Graph Convolutional CF |\n| `ItemKNN` | `ImplicitRecommender` | Item-based k-NN CF |\n| `UserKNN` | `ImplicitRecommender` | User-based k-NN CF |\n| `EASE` | `ImplicitRecommender` | Embarrassingly Shallow Autoencoders |\n| `FM` | `BaseModel` | Factorization Machines (CTR prediction) |\n| `FPMC` | `SequentialRecommender` | Factorizing Personalized Markov Chains |\n| `SASRec` | `SequentialRecommender` | Self-Attentive Sequential Recommendation |\n| `HybridEmbeddingIndex` | — | CF + semantic embedding fusion |\n\nAll classes share the following data-ingestion class methods inherited from `BaseModel`:\n\n```python\n# Load from long-format (transaction_id, item_id) DataFrame or list of lists\nmodel = FPGrowth.from_transactions(df, transaction_col=\"order_id\", item_col=\"item\", min_support=0.3)\n\n# Typed convenience aliases — same result\nmodel = FPGrowth.from_pandas(df,  ...)\nmodel = FPGrowth.from_polars(pl_df, ...)\nmodel = FPGrowth.from_spark(spark_df, ...)\n```\n\n`Miner` subclasses (`FPGrowth`, `Eclat`) additionally expose `RuleMinerMixin`, giving a fluent pipeline:\n\n```python\nmodel  = FPGrowth.from_transactions(df, min_support=0.3)\nfreq   = model.mine(use_colnames=True)             # pd.DataFrame [support, itemsets]\nrules  = model.association_rules(metric=\"lift\")    # pd.DataFrame [antecedents, consequents, ...]\nrecs   = model.recommend_items([\"bread\", \"milk\"])  # list of suggested items\n```\n\n`ImplicitRecommender` subclasses (`ALS`, `BPR`, `SVD`, `LightGCN`, `ItemKNN`, `UserKNN`, `EASE`) follow the **scikit-learn** `fit()`/`predict()` pattern.\n`SequentialRecommender` subclasses (`FPMC`, `SASRec`) use `from_transactions(..., time_col=...).fit()` for sequential next-item prediction:\n\n```python\n# Option A — construct then fit with a sparse matrix\nmodel = ALS(factors=64, iterations=15)\nmodel.fit(user_item_csr)\n\n# Option B — from event log, then explicit .fit()\nmodel = ALS(factors=64).from_transactions(\n    df, user_col=\"user_id\", item_col=\"item_id\"\n).fit()  # ← .fit() is always required\n\n# Predict / recommend\nitems, scores = model.recommend_items(user_id=42, n=10, exclude_seen=True)\nusers, scores = model.recommend_users(item_id=99, n=5)\n```\n\n\u003e **Breaking change vs older versions:** `from_transactions()` no longer auto-fits.\n\u003e Always chain `.fit()` after it.\n\n\n\n## 🧠 Advanced Pattern \u0026 Recommendation Algorithms\n\n`rusket` provides more than just basic market basket analysis. It includes an entire suite of modern algorithms and a high-level Business Recommender API.\n\n### 🎯 ItemKNN \u0026 UserKNN — Nearest-Neighbor Collaborative Filtering\n\nTwo complementary memory-based methods that consistently rank among the **top performers** in academic benchmarks (see [Anelli et al. 2022](https://arxiv.org/abs/2203.01155)).\n\n- **ItemKNN** — Finds items similar to what the user already liked. Fast, stable, and scales well with pre-computed item-item similarity.\n- **UserKNN** — Finds users similar to the target user and recommends what they liked. Often more serendipitous and performs particularly well on dense datasets.\n\nBoth support **BM25**, **TF-IDF**, **Cosine**, and raw **Count** weighting, with the top-K neighbor pruning running in parallel Rust.\n\n```python\nfrom rusket import ItemKNN, UserKNN\n\n# ── Item-based: \"Customers who bought X also bought Y\" ────────────\nitem_knn = ItemKNN.from_transactions(\n    purchases, user_col=\"user_id\", item_col=\"item_id\",\n    method=\"bm25\", k=100,\n).fit()\nitems, scores = item_knn.recommend_items(user_id=42, n=10)\n\n# ── User-based: \"Users similar to you enjoyed these items\" ────────\nuser_knn = UserKNN.from_transactions(\n    purchases, user_col=\"user_id\", item_col=\"item_id\",\n    method=\"cosine\", k=50,\n).fit()\nitems, scores = user_knn.recommend_items(user_id=42, n=10)\n```\n\n\u003e **Which one to choose?** Start with `ItemKNN(method=\"bm25\")` — it's the fastest and most stable. Switch to `UserKNN` if you have a dense dataset or want more diverse recommendations. In production, try both and evaluate with `rusket.evaluate()`.\n\n### 🎯 ALS \u0026 BPR Collaborative Filtering\n\nBoth models learn user and item embeddings from **implicit feedback** (purchases, clicks, plays) and power personalised recommendations at scale. Use **ALS** for broad serendipitous discovery; use **BPR** when you care only about top-N ranking.\n\n```python\nfrom rusket import ALS, BPR\n\n# ── \"For You\" homepage — music streaming platform ────────────────────\n# event log: user_id | track_id | plays (optional weight)\nplays = pd.DataFrame({\n    \"user_id\":  [101, 101, 102, 102, 103, 103, 103],\n    \"track_id\": [\"T01\", \"T03\", \"T01\", \"T05\", \"T02\", \"T03\", \"T05\"],\n    \"plays\":    [12, 5, 8, 3, 20, 1, 7],  # play count as confidence weight\n})\n\nals = ALS(factors=64, iterations=15, alpha=40.0).from_transactions(\n    plays, user_col=\"user_id\", item_col=\"track_id\", rating_col=\"plays\"\n).fit()  # ← always call .fit() after from_transactions()\n\n# Top-10 tracks for user 101, excluding already-played tracks\ntracks, scores = als.recommend_items(user_id=101, n=10, exclude_seen=True)\n\n# Which users are most likely to enjoy track T05? — useful for email campaigns\nusers, scores = als.recommend_users(item_id=\"T05\", n=50)\n\n# BPR — optimise ranking directly rather than reconstruction\nbpr = BPR(factors=64, learning_rate=0.05, iterations=150).fit(user_item_csr)\n```\n\n### 🎯 Hybrid Recommender API\n\nCombine **Collaborative Filtering** (ALS/BPR) with **Frequent Pattern Mining** to cover every placement surface — personalised homepage (\"For You\") and active cart (\"Frequently Bought Together\") — in a single engine.\n\n```python\nfrom rusket import ALS, Recommender, FPGrowth\n\n# 1. Train on purchase history (implicit feedback)\nals = ALS(factors=64, iterations=15).fit(user_item_csr)\n\n# 2. Mine co-purchase rules from basket data\nminer = FPGrowth(basket_ohe, min_support=0.01)\nfreq  = miner.mine()\nrules = miner.association_rules()\n\n# 3. Create the Hybrid Engine\nrec = Recommender(model=als, rules_df=rules)\n\n# \"For You\" homepage — personalised for customer 1001\nitems, scores = rec.recommend_for_user(user_id=1001, n=5)\n\n# Blend CF + product embeddings (e.g. from a PIM or sentence-transformer)\nitems, scores = rec.recommend_for_user(user_id=1001, n=5, alpha=0.7,\n                                       target_item_for_semantic=\"HDPHONES\")\n\n# Active cart cross-sell — \"Frequently Bought Together\"\nadd_ons = rec.recommend_for_cart([\"USB_DAC\", \"AUX_CABLE\"], n=3)\n\n# Overnight batch — score all customers, write to CRM\nbatch_df = rec.predict_next_chunk(user_history_df, user_col=\"customer_id\", k=5)\n```\n\n### 🧬 Hybrid Embedding Fusion — CF + Semantic in One Vector Space\n\nCollaborative filtering embeddings capture *behavioral* signals (who bought what); semantic text embeddings capture *content* meaning (product descriptions). Fusing them into a **single vector space** lets you do ANN retrieval, vector DB export, and clustering in one shot.\n\n```python\nimport rusket\n\n# 1. Train ALS on implicit feedback\nals = rusket.ALS(factors=64, iterations=15).fit(interactions)\n\n# 2. Get semantic embeddings (e.g. from sentence-transformers)\nfrom sentence_transformers import SentenceTransformer\nencoder = SentenceTransformer(\"all-MiniLM-L6-v2\")\ntext_vectors = encoder.encode(product_descriptions)  # (n_items, 384)\n\n# 3. Fuse into a single hybrid vector space\nhybrid = rusket.HybridEmbeddingIndex(\n    cf_embeddings=als.item_factors,       # (n_items, 64)\n    semantic_embeddings=text_vectors,      # (n_items, 384)\n    strategy=\"weighted_concat\",            # \"concat\" | \"weighted_concat\" | \"projection\"\n    alpha=0.6,                             # 60% CF, 40% semantic\n)\n\n# 4. Similar items via cosine on the fused space\nids, scores = hybrid.query(item_id=42, n=10)\n\n# 5. Build an ANN index for sub-millisecond retrieval\nann = hybrid.build_ann_index(backend=\"native\")  # or \"faiss\"\n\n# 6. Export to a vector DB for production serving\nhybrid.export_vectors(qdrant_client, collection_name=\"hybrid_items\")\n\n# 7. Or export as separate named vectors for DB-side fusion\nhybrid.export_vectors(qdrant_client, mode=\"multi\", collection_name=\"hybrid_items\")\n# → Qdrant/Meilisearch/Weaviate store \"cf\" and \"semantic\" as separate named vectors\n```\n\nThree fusion strategies:\n\n| Strategy | Description | Use Case |\n|---|---|---|\n| `\"concat\"` | L2-normalise each space, concatenate | Equal importance, no tuning |\n| `\"weighted_concat\"` | Scale by `α` / `1−α`, then concat | **Default** — tune `alpha` to balance CF vs semantic |\n| `\"projection\"` | Concat + PCA to `projection_dim` | Compact vectors for large-scale deployment |\n\n\u003e **Standalone function:** If you just need the fused matrix without an index, use `rusket.fuse_embeddings(cf, sem, strategy=\"weighted_concat\", alpha=0.6)`.\n\n### 🎯 Multi-Stage Recommendation Pipeline\n\nFor production systems requiring advanced retrieval and ranking, use the `Pipeline` class. This mirrors the \"retrieve → rerank → filter\" paradigm used by Twitter/X and modern ML stacks.\n\nIt chains multiple models together:\n1. **Retrieve:** Candidate generation\n2. **Rerank:** Re-score candidates using a heavier scoring function\n3. **Filter:** Apply business rules (e.g. exclude out-of-stock items, diversify)\n\n```python\nfrom rusket import ALS, BPR, Pipeline, RuleBasedRecommender\nimport pandas as pd\n\n# 1. Train multiple base models\nals = ALS(factors=64).fit(interactions)\nbpr = BPR(factors=128).fit(interactions)\n\n# 2. Define explicit business rules (e.g. promoting warranties with laptops)\nrules_df = pd.DataFrame({\n    \"antecedent\": [\"102\"],   # Laptop SKU\n    \"consequent\": [\"999\"],   # Warranty SKU\n    \"score\": [2.0]\n})\nrules = RuleBasedRecommender.from_transactions(\n    interactions, rules=rules_df, user_col=\"user\", item_col=\"item\"\n).fit()\n\n# 3. Compose the Pipeline (Retrieve from ALS, rerank with deeper BPR vectors)\n# Items from the `rules` model receive an artificial +1,000,000 score \n# ensuring they rank at the top *after* the algorithmic reranking.\npipeline = Pipeline(\n    retrieve=[als, bpr],\n    merge_strategy=\"max\",  # how to combine candidate scores\n    rerank=bpr,\n    rules=rules, \n)\n\n# Recommend for a user\nitems, scores = pipeline.recommend(user_id=42, n=10, exclude_seen=True)\n\n# Blazing-fast Batch Scoring utilizing Rust inner loops\nbatch_recs = pipeline.recommend_batch(\n    user_ids=[1, 2, 3],\n    n=10,\n    format=\"polars\"  # Returns a native Polars DataFrame instantly\n)\n```\n\n### 💾 Saving, Loading and Serving (LanceDB / Vector DBs)\n\n`rusket` models use a unified `BaseModel` that provides `.save()` and `.load()` functionality. You can also export trained models to a Vector Database for fast, real-time serving in production. We even provide `load_model` which automatically infers the model architecture from the pickle file.\n\n```python\nimport rusket\n\n# 1. Train the model\nmodel = rusket.ALS(factors=32).fit(interactions)\n\n# 2. Save your trained model to disk\nmodel.save(\"my_als_model.pkl\")\n\n# 3. Load it back using the generic loader\nloaded_model = rusket.load_model(\"my_als_model.pkl\")\n\n# 4. Export the embeddings for a Vector Database\nitems_df = rusket.export_item_factors(\n    loaded_model, \n    normalize=True,     # Best for Cosine Similarity search\n    format=\"pandas\"\n)\n\n# 5. Serve it in real-time (Example using LanceDB)\nimport lancedb\n\n# Create a local vector database\ndb = lancedb.connect(\"./lancedb_store\")\ntable = db.create_table(\"items\", data=items_df)\n\n# Query the table with a specific user's latent factors\nuser_emb = loaded_model.user_factors[0]\n\n# Retrieve top 5 item recommendations for this user using L2-normalized vector search!\nresults = table.search(user_emb).limit(5).to_pandas()\n```\n\n### 🔍 Analytics Helpers\n\n```python\nfrom rusket import find_substitutes, customer_saturation\n\n# Identify cannibalizing SKUs (lift \u003c 1.0) for assortment rationalisation\nsubs = find_substitutes(rules_df, max_lift=0.8)\n#  antecedents  consequents  lift\n#  (Cola A,)    (Cola B,)    0.61   ← these products hurt each other's sales\n\n# Segment customers by category penetration (decile 10 = buy everything; 1 = barely engaged)\nsaturation = customer_saturation(\n    purchases_df, user_col=\"customer_id\", category_col=\"category_id\"\n)\n```\n\n### 📈 BPR \u0026 Sequential Patterns\n\n- **BPR (Bayesian Personalized Ranking):** Directly optimises ranking of positive interactions over negative ones — ideal for newsfeeds, playlists, and app recommendation surfaces that prioritise top-N precision.\n- **Sequential Pattern Mining (PrefixSpan):** Discovers ordered patterns across time (e.g., \"Subscriber signed up for broadband → mobile plan → premium bundle\" or \"Customer viewed Camera → 2 weeks later bought Lens\"). \n\n`rusket` natively extracts PrefixSpan sequences from **Pandas, Polars, and PySpark** event logs with zero-copy Arrow mapping:\n\n```python\nfrom rusket import PrefixSpan\n\n# Telco product adoption journeys — what sequence of subscriptions do customers follow?\n# df: customer_id | subscription_date | product_id\nmodel = PrefixSpan.from_transactions(\n    subscription_events,\n    transaction_col=\"customer_id\",\n    item_col=\"product_id\",\n    time_col=\"subscription_date\",\n    min_support=50,    # at least 50 customers follow this path\n    max_len=4,\n)\nfreq_seqs = model.mine()\n# e.g. [broadband] → [mobile] → [tv_bundle] appears in 312 journeys\n```\n\n\n\n### 🕸️ Graph Analytics \u0026 Embeddings\n\nIntegrate natively with the modern GenAI/LLM stack:\n\n- **Vector Export:** Export user/item factors to a Pandas `DataFrame` ready for FAISS/Qdrant using `model.export_item_factors()`.\n- **Item-to-Item Similarity:** Fast Cosine Similarity on embeddings using `model.similar_items(item_id)`.\n- **Graph Generation:** Automatically convert association rules into a `networkx` directed Graph for community detection using `rusket.viz.to_networkx(rules)`.\n\n---\n\n### 🔬 MLOps: MLflow Tracking \u0026 Hyperparameter Tuning\n\n`rusket` has built-in support for [MLflow](https://mlflow.org/) experiment tracking, `mlflow.pyfunc` packaging, and Bayesian hyperparameter optimisation using [Optuna](https://optuna.org/)'s TPE sampler. For **ALS/eALS** models, each Optuna trial runs the Rust-native cross-validation backend — making the entire search blazingly fast.\n\n```python\nimport rusket\nimport rusket.mlflow\nfrom rusket import OptunaSearchSpace\n\n# ── 1. Enable MLflow Autologging ─────────────────────────────────────\nrusket.mlflow.autolog()\n\n# ── 2. Train a single model with automatic tracking ──────────────────\n# Hyperparameters (factors, iterations) and training_duration_seconds are logged!\nimport mlflow\nwith mlflow.start_run():\n    model = rusket.ALS(factors=64, iterations=15).fit(df)\n\n# Save/Load models as native MLflow pyfunc artifacts for easy deployment\nrusket.mlflow.save_model(model, \"my_als_model\")\nloaded_model = mlflow.pyfunc.load_model(\"my_als_model\")  # Has a .predict(df) method\n\n# ── 3. Quick hyperparameter search with sensible defaults ───────────\nresult = rusket.optuna_optimize(\n    rusket.ALS,\n    df,\n    user_col=\"user_id\",\n    item_col=\"item_id\",\n    n_trials=50,\n    metric=\"ndcg\",\n    k=10,\n)\nprint(f\"Best ndcg@10: {result.best_score:.4f}\")\nprint(f\"Best params:  {result.best_params}\")\n\n# ── Custom search space + refit best model ───────────────────────────\nresult = rusket.optuna_optimize(\n    rusket.eALS,\n    df,\n    user_col=\"user_id\",\n    item_col=\"item_id\",\n    search_space=[\n        OptunaSearchSpace.int(\"factors\", 16, 256, log=True),\n        OptunaSearchSpace.float(\"alpha\", 1.0, 100.0, log=True),\n        OptunaSearchSpace.float(\"regularization\", 1e-4, 1.0, log=True),\n        OptunaSearchSpace.int(\"iterations\", 5, 30),\n    ],\n    n_trials=100,\n    n_folds=3,\n    metric=\"precision\",\n    refit_best=True,  # best model is already fitted\n)\nitems, scores = result.best_model.recommend_items(user_id=42, n=10)\n\n# ── MLflow experiment tracking ───────────────────────────────────────\n# pip install mlflow optuna-integration\nimport mlflow\n\nmlflow.set_tracking_uri(\"http://localhost:5000\")\nmlflow.set_experiment(\"als-tuning\")\n\nresult = rusket.optuna_optimize(\n    rusket.ALS, df,\n    user_col=\"user_id\", item_col=\"item_id\",\n    n_trials=50, metric=\"ndcg\",\n    mlflow_tracking=True,   # ← every trial logged to MLflow\n)\n\n# ── Custom callbacks ─────────────────────────────────────────────────\nresult = rusket.optuna_optimize(\n    rusket.ALS, df,\n    user_col=\"user_id\", item_col=\"item_id\",\n    n_trials=50,\n    callbacks=[my_custom_callback],  # any Optuna-compatible callback\n)\n```\n---\n\n### 🚀 GPU Acceleration (CUDA)\n\n`rusket` supports optional GPU acceleration via **CuPy** or **PyTorch CUDA** for models that benefit from large matrix operations. Enable it globally with a single call — no need to pass `use_gpu=True` to every model.\n\n```python\nimport rusket\n\n# Enable GPU globally — every model created after this uses CUDA\nrusket.enable_gpu()\n\n# All models now default to GPU\nals = rusket.ALS(factors=128, iterations=20).fit(interactions)\nease = rusket.EASE(regularization=500).fit(interactions)\nbpr = rusket.BPR(factors=64).fit(interactions)\n\n# Per-model override: force a specific model to CPU\nsmall_model = rusket.SVD(factors=16, use_gpu=False)\n\n# Turn it off globally\nrusket.disable_gpu()\n\n# Check the current state\nrusket.is_gpu_enabled()  # → False\n```\n\n#### Supported Models\n\nAll 12 recommender models respect the global GPU flag:\n\n| Model | GPU-accelerated operations |\n|-------|--------------------------|\n| **ALS / eALS** | Gramian, Cholesky solve, batch scoring |\n| **BPR** | SGD updates, batch recommend |\n| **SVD** | Factor updates, batch scoring |\n| **EASE** | Gram matrix inversion |\n| **ItemKNN / UserKNN** | Similarity scoring |\n| **LightGCN** | Graph convolution, scoring |\n| **FM** | Prediction |\n| **FPMC** | Factor updates |\n| **SASRec / BERT4Rec** | Attention forward pass |\n| **NMF** | Multiplicative updates |\n\n#### Installation\n\n```bash\n# CuPy (recommended — fastest)\npip install cupy-cuda12x\n\n# Or PyTorch\npip install torch\n```\n\n\u003e **No GPU? No problem.** `rusket` auto-detects whether a GPU backend is available. If neither CuPy nor PyTorch CUDA is installed, `enable_gpu()` will still succeed but models will raise an `ImportError` at fit-time. Use `rusket.check_gpu_available()` to test beforehand.\n\n---\n\n## ⚡ Benchmarks\n\n\u003e **Benchmark environment:** Apple Silicon MacBook Air (M-series, arm64, 8 GB RAM). All timings are single-run wall-clock measurements.\n\n### Scale Benchmarks (1M → 200M rows)\n\n\u003e **What's measured:** `from_transactions()` converts long-format `(txn_id, item_id)` rows into a sparse OHE matrix. `fpgrowth()` then mines that matrix. Both steps have the same Rust mining cost — the only difference at large scale is whether you pay the conversion cost upfront.\n\n| Scale | `from_transactions` (conversion) | `fpgrowth` (mining) | **Total** |\n|---|:---:|:---:|:---:|\n| 1M rows | 4.9s | **0.1s** | **5.0s** |\n| 10M rows | 23.2s | **1.2s** | **24.4s** |\n| 50M rows | 59.1s | **4.0s** | **63.1s** |\n| 100M rows (20M txns × 200k items) | 124.1s | **10.1s** | **134.2s** |\n| **200M rows** (40M txns × 200k items) | 229.2s | **17.6s** | **246.8s** |\n\nThe mining step is fast — the bottleneck at scale is the long-format → sparse-matrix conversion. If your pipeline already produces a CSR/sparse matrix (e.g., from a Parquet/warehouse export), you skip the conversion entirely and only pay the mining cost.\n\n#### Power-user path: Direct CSR → Rust\n\n```python\nimport numpy as np\nfrom scipy import sparse as sp\nfrom rusket import FPGrowth\n\n# Build CSR directly from integer IDs (no pandas!)\ncsr = sp.csr_matrix(\n    (np.ones(len(txn_ids), dtype=np.int8), (txn_ids, item_ids)),\n    shape=(n_transactions, n_items),\n)\nfreq = FPGrowth(csr, item_names=item_names).mine(\n    min_support=0.001, max_len=3, use_colnames=True\n)\n```\n\n\u003e At 100M rows, the mining step itself takes **10.1 seconds**. Building the CSR directly skips the `from_transactions` conversion cost (~124s) but does not change the mining time.\n\n### Real-World Datasets\n\n| Dataset | Transactions | Items | `rusket` |\n|---------|:----------:|:-----:|:--------:|\n| [andi_data.txt](https://github.com/andi611/Apriori-and-Eclat-Frequent-Itemset-Mining) | 8,416 | 119 | **9.7 s** (22.8M itemsets) |\n| [andi_data2.txt](https://github.com/andi611/Apriori-and-Eclat-Frequent-Itemset-Mining) | 540,455 | 2,603 | **7.9 s** |\n\nRun benchmarks yourself:\n\n```bash\nuv run pytest benchmarks/bench_scale.py -v -s   # Scale benchmark\nuv run python benchmarks/bench_realworld.py     # Real-world datasets\nuv run pytest tests/test_benchmark.py -v -s      # pytest-benchmark\n```\n\n### Recommender Benchmarks vs LibRecommender\n\n\u003e **Measured with `pytest-benchmark`** (5 rounds, warmed up, GC disabled). MovieLens 100k dataset (943 users, 1,682 items, 100k ratings). Only `model.fit()` is timed — no startup or data loading overhead.\n\n| Benchmark | rusket | LibRecommender | **Speedup** |\n|---|:---:|:---:|:---:|\n| **ALS (Cholesky)** (64 factors, 15 epochs) | **427 ms** | 1,324 ms | **3.1×** |\n| **ALS (eALS)** (64 factors, 15 epochs) | **360 ms** | *N/A* | — |\n| **BPR** (64 factors, 10 epochs) | **33 ms** | 681 ms | **20.4×** |\n| **ItemKNN** (k=100) | **55 ms** | 287 ms | **5.2×** |\n| **SVD** (64 factors, 20 epochs) | **55 ms** | ❌ TF-only (broken) | — |\n| **EASE** | **71 ms** | *N/A* | — |\n\n\u003e **Note:** LibRecommender requires TensorFlow + PyTorch + gensim + Cython (~2 GB of dependencies). rusket has **zero runtime dependencies**.\n\n```bash\nuv run pytest benchmarks/bench_pytest_librecommender.py -v --benchmark-columns=mean,stddev,rounds\n```\n\n---\n\n## 🏗 Architecture\n\n### Data Flow\n\n```\npandas dense         ──► np.uint8 array (C-contiguous)  ──► Rust fpgrowth_from_dense\npandas Arrow backend ──► Arrow → np.uint8 (zero-copy)   ──► Rust fpgrowth_from_dense\npandas sparse        ──► CSR int32 arrays               ──► Rust fpgrowth_from_csr\npolars               ──► Arrow → np.uint8 (zero-copy)   ──► Rust fpgrowth_from_dense\nnumpy ndarray        ──► np.uint8 (C-contiguous)        ──► Rust fpgrowth_from_dense\n```\n\nAll mining and rule generation happens **inside Rust**. No Python loops, no round-trips.\n\n### The 1 Billion Row Architecture\n\nTo pass the \"1 Billion Row\" threshold without OOM crashes, `rusket` employs a zero-allocation mining loop:\n- **Eclat Scratch Buffers:** `intersect_count_into` writes intersections directly into thread-local pre-allocated memory bytes and computes `popcnt` in a single pass. It implements **early-exit** loop termination the moment it proves a combination cannot reach `min_support`.\n- **FPGrowth Parallel Tree Build:** Conditional FP-trees are collected concurrently inside the rayon parallel mining step, replacing the standard sequential loop and eliminating memory contention bottlenecks.\n- **`AHashMap` Deduplication:** Extremely fast O(N) duplicate basket counting replaces standard O(N log N) unstable sorts in the core pipeline.\n\n\n---\n\n## 🧑‍💻 Development\n\n### Prerequisites\n\n- **Rust** 1.83+ (`rustup update`)\n- **Python** 3.10+\n- [**uv**](https://docs.astral.sh/uv/) (recommended package manager)\n\n### Getting Started\n\n```bash\n# Clone\ngit clone https://github.com/bmsuisse/rusket.git\ncd rusket\n\n# Build Rust extension in dev mode\nuv run maturin develop --release\n\n# Run the full test suite\nuv run pytest tests/ -x -q\n\n# Type-check the Python layer\nuv run pyright rusket/\n\n# Cargo check (Rust)\ncargo check\n```\n\n### Run Examples\n\n```bash\n# Getting started\nuv run python examples/01_getting_started.py\n\n# Market basket analysis with Faker\nuv run python examples/02_market_basket_faker.py\n\n# Polars input\nuv run python examples/03_polars_input.py\n\n# Sparse input\nuv run python examples/04_sparse_input.py\n\n# Large-scale mining (100k+ rows)\nuv run python examples/05_large_scale.py\n\n```\n\n---\n\n## 🤖 AI Disclosure\n\nA large part of this library — including the Rust core algorithms, the Python wrappers, the OOP class hierarchy, and the Spark integration layer — was written with substantial assistance from **AI pair-programming tools** (specifically [Google Gemini / Antigravity](https://deepmind.google/technologies/gemini/)). Human review, benchmarking, and architectural decisions were applied throughout.\n\nWe believe in transparency about AI-assisted development. The algorithms are correct, the tests pass, and the performance numbers are real — but if you find a bug or a piece of \"AI slop\", please open an issue!\n\n---\n\n## 📜 License\n\n[MIT License](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbmsuisse%2Frusket","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbmsuisse%2Frusket","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbmsuisse%2Frusket/lists"}