{"id":36975541,"url":"https://github.com/ml-rust/treeboost","last_synced_at":"2026-01-13T22:05:08.469Z","repository":{"id":330813360,"uuid":"1123569079","full_name":"ml-rust/treeboost","owner":"ml-rust","description":"High-performance Gradient Boosted Decision Tree engine for large-scale tabular data","archived":false,"fork":false,"pushed_at":"2026-01-10T20:44:07.000Z","size":3084,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-10T21:33:46.904Z","etag":null,"topics":["automl","era-splitting","gbdt","machine-learning","random-forest","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ml-rust.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-27T06:31:09.000Z","updated_at":"2026-01-10T20:44:11.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ml-rust/treeboost","commit_stats":null,"previous_names":["ml-rust/treeboost"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ml-rust/treeboost","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Ftreeboost","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Ftreeboost/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Ftreeboost/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Ftreeboost/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ml-rust","download_url":"https://codeload.github.com/ml-rust/treeboost/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ml-rust%2Ftreeboost/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28400445,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-13T14:36:09.778Z","status":"ssl_error","status_checked_at":"2026-01-13T14:35:19.697Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automl","era-splitting","gbdt","machine-learning","random-forest","rust"],"created_at":"2026-01-13T22:05:07.749Z","updated_at":"2026-01-13T22:05:08.459Z","avatar_url":"https://github.com/ml-rust.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TreeBoost\n\n[![Crates.io](https://img.shields.io/crates/v/treeboost.svg)](https://crates.io/crates/treeboost) [![Docs](https://img.shields.io/docsrs/treeboost)](https://docs.rs/treeboost) [![License: Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE)\n\n![TreeBoost](images/treeboost.jpeg)\n\n\u003e **Practical tabular ML for messy, real-world data. Fast baselines first, deep control when you need it.**\n\nTreeBoost is a Rust-first library for tabular machine learning that starts simple and scales to expert use. It is built for the reality of real-world datasets: time series, missing values, drift, mixed feature types, and noisy labels. You get a clean path from “just give me a working model” to full control over training, backends, and constraints.\n\n## At a Glance\n\n- AutoML and AutoTuner for fast, explainable baselines\n- Multi-label, multi-class, and regression in one API\n- Hybrid Linear+Tree mode for trend extrapolation + non-linear interactions\n- Built-in preprocessing that serializes with your model\n- GPU acceleration (WebGPU, CUDA) plus AVX-512/SVE2 CPU backends\n- Zero-copy serialization and incremental TRB updates\n\n## Why TreeBoost\n\nMost libraries are tuned for leaderboard-style modeling. TreeBoost is built for shipping models:\n\n- **Fast baseline in one call** for beginners and teams under time pressure.\n- **White-box AutoML** that explains why it chose a mode and lets you iterate.\n- **Upgradeable control** without rewriting your data pipeline.\n- **Deployment-friendly**: zero-copy serialization, fast inference, and a CLI for batch jobs.\n\n## Three API Levels (Start Simple, Go Deep)\n\n- **AutoModel** — One call trains a solid baseline and produces a model you can ship. Export a `config.json` when you want to improve it later.\n- **UniversalModel** — Choose the learning mode (PureTree, LinearThenTree, RandomForest) and tune it without leaving a high-level API.\n- **GBDTModel** — Lowest-level API for maximum control, backend selection, and benchmarking.\n\nYou can move between these levels without changing your dataset format.\n\n📖 **See [docs/API.md](docs/API.md) for complete API documentation with examples.**\n\n## Recommended Workflow\n\n1. **AutoModel** for a strong baseline and a training report.\n2. **Inspect the report and config** to understand the model choice.\n3. **Refine with UniversalModel or GBDTModel** for extra accuracy, constraints, or incremental updates.\n\n## Quick Start (AutoModel)\n\n```rust\nuse treeboost::{auto_train, AutoModel};\n\nlet model = auto_train(\u0026df, \"target\")?;\nprintln!(\"{}\", model.summary());\n\nmodel.save(\"model.rkyv\")?;\nmodel.save_config(\"config.json\")?;\n```\n\nThis gives you a deployable model plus a config you can tweak later.\n\n## Expected Outputs (After Training)\n\nAfter training, you typically save:\n\n- `model.rkyv` for fast inference and deployment\n- `config.json` for reproducible retraining or fine-tuning\n- `model.trb` if you want incremental updates later\n\n```rust\nlet model = auto_train(\u0026df, \"target\")?;\nmodel.save(\"model.rkyv\")?;\nmodel.save_config(\"config.json\")?;\nmodel.save_trb(\"model.trb\", \"initial training\")?;\n```\n\n**Example `config.json` (abridged):**\n\n```json\n{\n  \"mode\": \"LinearThenTree\",\n  \"num_rounds\": 120,\n  \"learning_rate\": 0.1,\n  \"subsample\": 0.9,\n  \"validation_ratio\": 0.1,\n  \"early_stopping_rounds\": 20,\n  \"linear_rounds\": 10,\n  \"tree_config\": {\n    \"max_depth\": 6,\n    \"max_leaves\": 31,\n    \"lambda\": 1.0,\n    \"min_samples_leaf\": 20,\n    \"colsample\": 0.8\n  },\n  \"linear_config\": {\n    \"lambda\": 1.0,\n    \"l1_ratio\": 0.0,\n    \"shrinkage_factor\": 0.3\n  }\n}\n```\n\nThe actual file includes the full set of tree/linear fields so you can tweak every detail.\n\n## Inference (Simple and Fast)\n\n```rust\nuse treeboost::UniversalModel;\nuse treeboost::dataset::DatasetLoader;\n\n// Use either a static model (.rkyv) or incremental model (.trb)\nlet model = UniversalModel::load(\"model.rkyv\")?;\n// let model = UniversalModel::load_trb(\"model.trb\")?;\n\nlet loader = DatasetLoader::new(255);\nlet dataset = loader.load_parquet(\"new_data.parquet\", \"target\", None)?;\n\nlet predictions = model.predict(\u0026dataset);\n```\n\n## Quick Start (UniversalModel)\n\n```rust\nuse treeboost::{UniversalConfig, UniversalModel, BoostingMode};\nuse treeboost::dataset::DatasetLoader;\nuse treeboost::loss::MseLoss;\n\nlet loader = DatasetLoader::new(255);\nlet dataset = loader.load_parquet(\"data.parquet\", \"target\", None)?;\n\nlet config = UniversalConfig::new()\n    .with_mode(BoostingMode::LinearThenTree)\n    .with_num_rounds(100)\n    .with_linear_rounds(10)\n    .with_learning_rate(0.1);\n\nlet model = UniversalModel::train(\u0026dataset, config, \u0026MseLoss)?;\nlet predictions = model.predict(\u0026dataset);\n```\n\n**Quick mode selection:**\n\n| Your Data                                  | Use This Mode                  |\n| ------------------------------------------ | ------------------------------ |\n| General tabular, categoricals              | `BoostingMode::PureTree`       |\n| Time-series, trending, needs extrapolation | `BoostingMode::LinearThenTree` |\n| Noisy data, want robustness                | `BoostingMode::RandomForest`   |\n\n## Quick Start (Multi-Label Classification)\n\nFor problems where each sample can belong to multiple independent labels (e.g., multi-tag classification, multiple disease diagnoses):\n\n```rust\nuse treeboost::AutoModel;\n\n// Train multi-label model (default: LinearThenTree mode)\nlet target_cols = vec![\"tag_finance\", \"tag_tech\", \"tag_urgent\"];\nlet model = AutoModel::train_multilabel(\u0026df, \u0026target_cols)?;\n\n// Predictions with probabilities\nlet probs = model.predict_proba_multilabel(\u0026df)?;  // Vec\u003cVec\u003cf32\u003e\u003e\n\n// Predictions with labels (threshold 0.5)\nlet labels = model.predict_labels(\u0026df)?;  // Vec\u003cVec\u003cbool\u003e\u003e\n\n// Tune thresholds on validation set for better F1 scores\nlet mut model = AutoModel::train_multilabel(\u0026df, \u0026target_cols)?;\nlet tune_result = model.tune_thresholds(\u0026val_df, \u0026target_cols)?;\nlet labels = model.predict_labels_tuned(\u0026df)?;  // Uses tuned thresholds\n```\n\n## Quick Start (GBDTModel)\n\n```rust\nuse treeboost::{GBDTConfig, GBDTModel};\n\nlet config = GBDTConfig::new()\n    .with_num_rounds(200)\n    .with_max_depth(6)\n    .with_learning_rate(0.05);\n\nlet model = GBDTModel::train(\u0026features, num_features, \u0026targets, config, None)?;\n```\n\n## Python (GBDTModel)\n\n```python\nimport numpy as np\nfrom treeboost import GBDTConfig, GBDTModel\n\nX = np.random.randn(10000, 20).astype(np.float32)\ny = (X[:, 0] + X[:, 1] * 2 + np.random.randn(10000) * 0.1).astype(np.float32)\n\nconfig = GBDTConfig()\nconfig.num_rounds = 100\nconfig.max_depth = 6\nconfig.learning_rate = 0.1\n\nmodel = GBDTModel.train(X, y, config)\n```\n\n## What You Get\n\n- **AutoML mode selection** that evaluates probes and explains its choice.\n- **Hybrid Linear+Tree architecture** for trend extrapolation and interactions.\n- **Built-in preprocessing**: encoders, scalers, and imputers that serialize with the model.\n- **Linear Trees** for piecewise-linear data with far fewer trees.\n- **Conformal prediction** for uncertainty intervals.\n- **Incremental learning** via TRB format with drift detection.\n\n## Advanced Features\n\nTreeBoost includes battle-tested capabilities for real-world deployments.\n\n### Feature Matrix\n\n| Category           | Capability                        | Use Case                              | API Entry Point                              |\n| ------------------ | --------------------------------- | ------------------------------------- | -------------------------------------------- |\n| **Classification** | Multi-Label Classification        | Multi-tag prediction, multi-diagnosis | `AutoModel::train_multilabel()`              |\n|                    | Threshold Tuning                  | Per-label F1 optimization             | `model.tune_thresholds()`                    |\n|                    | Multi-Class Classification        | Single category prediction            | `AutoModel::train()` with softmax loss       |\n| **Model Updates**  | Incremental Learning (TRB format) | Daily model updates, streaming data   | `UniversalModel::update()`                   |\n|                    | O(1) Append Updates               | Efficient model versioning            | `save_trb_update()`                          |\n|                    | Memory-Mapped I/O                 | Large model inference                 | `MmapTrbReader` (mmap feature)               |\n| **Monitoring**     | Drift Detection (PSI, KL, KS)     | Distribution shift alerts             | `IncrementalDriftDetector`                   |\n|                    | Drift History Tracking            | Long-term monitoring                  | `DriftHistory`                               |\n| **Ensembles**      | Multi-Seed Training               | Variance reduction                    | `with_ensemble_seeds()`                      |\n|                    | Stacked Blending                  | Meta-learner combination              | `StackingStrategy::Ridge`                    |\n| **Constraints**    | Monotonic Constraints             | Domain knowledge enforcement          | `TreeConfig::with_monotonic_constraints()`   |\n|                    | Interaction Constraints           | Feature interaction control           | `TreeConfig::with_interaction_constraints()` |\n| **Encoding**       | Ordered Target Encoding           | High-cardinality categoricals         | `OrderedTargetEncoder`                       |\n|                    | Count-Min Sketch Filtering        | Rare category handling                | `CategoryFilter`                             |\n| **Features**       | Time-Series (Lag/Rolling/EWMA)    | Panel data, forecasting               | `LagGenerator`, `RollingGenerator`           |\n|                    | Cross-Sectional (Poly/Ratio)      | Feature engineering                   | `PolynomialGenerator`, `RatioGenerator`      |\n| **Preprocessing**  | Incremental Scaler (Welford)      | Adaptive preprocessing                | `StandardScaler::with_forget_factor()`       |\n|                    | Outlier Detection (IQR/Z-score)   | Robust pipelines                      | `OutlierDetector`, `RobustScaler`            |\n| **Uncertainty**    | Split Conformal Prediction        | Distribution-free intervals           | `GBDTConfig::with_conformal()`               |\n\n### Example: Incremental Learning Workflow\n\n```rust\nuse treeboost::{AutoModel, UniversalModel};\nuse treeboost::monitoring::IncrementalDriftDetector;\nuse treeboost::loss::MseLoss;\n\n// 1. Initial training\nlet auto = AutoModel::train(\u0026df, \"target\")?;\nauto.inner().save_trb(\"model.trb\", \"Initial training\")?;\n\n// 2. Production: Load and monitor for drift\nlet mut model = UniversalModel::load_trb(\"model.trb\")?;\nlet detector = IncrementalDriftDetector::from_dataset(\u0026train_data);\n\n// 3. Before updating, check for drift\nlet result = detector.check_update(\u0026new_data);\nif !result.has_critical_drift() {\n    let report = model.update(\u0026new_data, \u0026MseLoss, 10)?;\n    model.save_trb_update(\"model.trb\", new_data.num_rows(), \"Weekly update\")?;\n} else {\n    eprintln!(\"Critical drift detected: {}\", result);\n}\n```\n\n### Why These Features Matter\n\n- **Incremental Learning**: Update models in O(new_data) instead of O(total_data) - essential for daily retraining\n- **Drift Detection**: Catch distribution shifts before they degrade model performance\n- **Ensemble Methods**: Reduce variance and improve stability in noisy environments\n- **Constraints**: Enforce domain knowledge (e.g., \"age must increase risk\") for trust and interpretability\n- **High-Cardinality Encoding**: Handle millions of categories without memory explosion\n- **Time-Series Features**: Automatic lag/rolling/EWMA generation for panel data\n- **Conformal Prediction**: Valid uncertainty estimates regardless of data distribution\n\n📖 **For detailed API documentation with examples, see [docs/API.md](docs/API.md)**\n\n## Backends (Automatic by Default)\n\nTreeBoost auto-selects the fastest backend. You can override it if needed.\n\n```rust\nuse treeboost::{GBDTConfig, GBDTModel};\nuse treeboost::backend::BackendType;\n\nlet config = GBDTConfig::new()\n    .with_backend(BackendType::Scalar);\n\nlet model = GBDTModel::train(\u0026features, num_features, \u0026targets, config, None)?;\n```\n\nSupported backends: Scalar, AVX-512, SVE2, WGPU, CUDA.\n\n## CLI\n\n```bash\n# Train a model\ntreeboost train --data data.csv --target price --output model.rkyv\n\n# Predict\ntreeboost predict --model model.rkyv --data test.csv --output predictions.json\n\n# Predict (.trb)\ntreeboost predict --model model.trb --data test.csv --output predictions.json\n```\n\nRun `treeboost --help` for full options.\n\n## Installation\n\n```bash\ncargo add treeboost\n```\n\n```bash\n# Python bindings (requires Rust toolchain + maturin)\npip install treeboost\n```\n\nFeature flags: `gpu`, `cuda`, `mmap`, `python`.\n\n## Project Links\n\n- **API Reference**: [docs/API.md](docs/API.md) - Complete API documentation with examples\n- Docs: https://docs.rs/treeboost\n- Crate: https://crates.io/crates/treeboost\n- GitHub: https://github.com/ml-rust/treeboost\n\n## License\n\nApache License 2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fml-rust%2Ftreeboost","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fml-rust%2Ftreeboost","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fml-rust%2Ftreeboost/lists"}