{"id":48529997,"url":"https://github.com/jzombie/rust-triplets","last_synced_at":"2026-04-07T23:32:34.773Z","repository":{"id":340165162,"uuid":"1164849445","full_name":"jzombie/rust-triplets","owner":"jzombie","description":"Composable data sampling primitives for deterministic multi-source ML/AI training-data orchestration.","archived":false,"fork":false,"pushed_at":"2026-03-30T02:25:40.000Z","size":471,"stargazers_count":0,"open_issues_count":3,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-30T05:25:48.979Z","etag":null,"topics":["algorithms","artificial-intelligence","bm25","dataset-sampling","science","text-processing","train-test-split","training-data","triplet-mining"],"latest_commit_sha":null,"homepage":"https://crates.io/crates/triplets","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jzombie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-23T14:54:44.000Z","updated_at":"2026-03-30T02:25:41.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jzombie/rust-triplets","commit_stats":null,"previous_names":["jzombie/rust-triplets"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jzombie/rust-triplets","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jzombie%2Frust-triplets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jzombie%2Frust-triplets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jzombie%2Frust-triplets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jzombie%2Frust-triplets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jzombie","download_url":"https://codeload.github.com/jzombie/rust-triplets/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jzombie%2Frust-triplets/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31215262,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-30T15:24:02.938Z","status":"ssl_error","status_checked_at":"2026-03-30T15:23:44.804Z","response_time":138,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithms","artificial-intelligence","bm25","dataset-sampling","science","text-processing","train-test-split","training-data","triplet-mining"],"created_at":"2026-04-07T23:32:34.632Z","updated_at":"2026-04-07T23:32:34.753Z","avatar_url":"https://github.com/jzombie.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003ch1 align=\"center\"\u003e⛏️ triplets\u003c/h1\u003e\n  \u003cp align=\"center\"\u003e\u003cstrong\u003eComposable data sampling primitives for deterministic multi-source ML/AI training-data orchestration.\u003c/strong\u003e\u003c/p\u003e\n  \u003cp align=\"center\"\u003e\n    \u003ca href=\"#getting-started\"\u003eGetting Started\u003c/a\u003e \u0026middot;\n    \u003ca href=\"#cargo-features\"\u003eCargo Features\u003c/a\u003e \u0026middot;\n    \u003ca href=\"#configuring-sources\"\u003eSources\u003c/a\u003e \u0026middot;\n    \u003ca href=\"#sampling-and-mixing\"\u003eSampling \u0026amp; Mixing\u003c/a\u003e \u0026middot;\n    \u003ca href=\"#epochs-and-determinism\"\u003eEpochs\u003c/a\u003e \u0026middot;\n    \u003ca href=\"#license\"\u003eLicense\u003c/a\u003e\n  \u003c/p\u003e\n  \u003cp align=\"center\"\u003e\n    \u003ca href=\"https://www.rust-lang.org/\"\u003e\u003cimg src=\"https://img.shields.io/badge/Made%20with-Rust-black\" alt=\"Made with Rust\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://crates.io/crates/triplets\"\u003e\u003cimg src=\"https://img.shields.io/crates/v/triplets.svg\" alt=\"crates.io\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/jzombie/rust-triplets/blob/main/LICENSE-MIT\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-MIT-blue.svg\" alt=\"MIT licensed\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/jzombie/rust-triplets/blob/main/LICENSE-APACHE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-Apache%202.0-blue.svg\" alt=\"Apache 2.0 licensed\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://coveralls.io/github/jzombie/rust-triplets?branch=main\"\u003e\u003cimg src=\"https://coveralls.io/repos/github/jzombie/rust-triplets/badge.svg?branch=main\" alt=\"Coverage Status\"\u003e\u003c/a\u003e\n    \u003cbr\u003e\u003csub\u003e\u003cem\u003eTested on macOS, Linux, and Windows.\u003c/em\u003e\u003c/sub\u003e\n  \u003c/p\u003e\n\u003c/p\u003e\n\n---\n\nGenerate an effectively unlimited stream of [training triplets](https://en.wikipedia.org/wiki/Triplet_loss), pairs, or plaintext samples from your existing corpus. This crate handles ingestion, multi-source mixing, deterministic train/validation/test splitting, and optional [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) hard-negative mining.\n\n**Designed as a data-pipeline layer for a training loop.**\n\n\u003e A training loop has two halves: the *data side* and the *model side*. `triplets` owns the data side — deterministic and reproducible train/validation/test splitting, seeded shuffling across epochs, weighted multi-source mixing, BM25 hard-negative mining, and static per-record KVP metadata for input conditioning. What it intentionally does *not* include is the model side: forward passes, loss computation, and optimizer steps. The design goal is that you plug this crate's output stream directly into your training framework (crates like [Candle](https://github.com/huggingface/candle), [burn](https://crates.io/crates/burn), [tch](https://crates.io/crates/tch), [PyO3](https://crates.io/crates/pyo3)) and it already handles the parts of the data pipeline that are hardest to get right — correctness, reproducibility, and scale.\n\n**Work in progress.**\n\n## Overview\n\nIn metric learning and language model training, a **triplet** consists of an **anchor**, a **positive** example (similar to the anchor), and a **negative** example (dissimilar to the anchor).\n\n`triplets` provides a high-throughput streaming pipeline to:\n1. **Ingest** data from local text/CSV files, Hugging Face, or custom backends.\n2. **Mix** sources with configurable weights to balance your training data.\n3. **Split** data deterministically into train, validation, and test sets.\n4. **Sample** triplets or pairs using rule-based \"recipes\".\n5. **Mine** hard negatives using BM25 to improve model discrimination.\n\n```text\n      Anchor\n      /    \\\n Positive Negative\n\n Triplet: (Anchor, Positive, Negative)\n```\n\n## Getting Started\n\nA `TripletSampler` needs a `SplitStore` for record-to-split assignments and a `SamplerConfig` for runtime behavior.\n\n```rust\nuse std::sync::Arc;\nuse triplets::{\n    BatchPrefetcher, SamplerConfig, TripletSampler, TripletBatch,\n    SplitRatios, DeterministicSplitStore, SplitLabel,\n};\n\nfn main() -\u003e Result\u003c(), Box\u003cdyn std::error::Error\u003e\u003e {\n    // 1. Define your train/validation/test ratios (e.g., 80/10/10).\n    let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };\n\n    // 2. Initialize a deterministic split store.\n    // The seed ensures record IDs are always assigned to the same split.\n    let seed = 42;\n    let store = Arc::new(DeterministicSplitStore::new(ratios, seed)?);\n\n    // 3. Create the sampler wrapped in Arc — required for prefetching.\n    let sampler = Arc::new(TripletSampler::new(SamplerConfig::default(), store));\n\n    // 4. Register one or more sources (CSV, text files, Hugging Face, or custom).\n    //    See the [Configuring Sources](#configuring-sources) section for full examples.\n    //    sampler.register_source(Box::new(my_source));\n\n    // 5. Spawn a background prefetcher with a queue depth of 4.\n    //    The worker thread starts filling the queue immediately; your training\n    //    loop calls prefetcher.next() and blocks only when the queue is empty.\n    let prefetcher: BatchPrefetcher\u003cTripletBatch\u003e =\n        Arc::clone(\u0026sampler).prefetch_triplet_batches(SplitLabel::Train, 4);\n\n    // 6. Pull batches in your training loop.\n    for _step in 0..10 {\n        let batch = prefetcher.next()?;\n        for triplet in batch.triplets {\n            println!(\"anchor:   {}\", triplet.anchor.text);\n            println!(\"positive: {}\", triplet.positive.text);\n            println!(\"negative: {}\", triplet.negative.text);\n        }\n    }\n    // The prefetcher's background thread shuts down automatically when dropped.\n\n    Ok(())\n}\n```\n\n## Cargo Features\n\n| Feature            | What it enables                                                               | Default |\n| ------------------ | ----------------------------------------------------------------------------- | ------- |\n| `huggingface`      | [Streaming from Hugging Face dataset repositories.](#hugging-face-source)     | No      |\n| `bm25-mining`      | [BM25 hard-negative ranking within strategy-defined pools.](#negative-mining) | No      |\n| `extended-metrics` | Additional per-triplet diagnostics for debugging.                             | No      |\n\n\u003e _[CSV](#csv-source), [text file](#text-file-source), and [custom source](#custom-source) support are enabled in all builds._\n\n## Configuring Sources\n\n### Hugging Face Source\n\nStreams rows directly from the Hugging Face Hub without requiring a full dataset download. Map dataset columns to anchor, positive, or plain-text roles the same way as the CSV source.\n\n```rust,no_run\n#[cfg(feature = \"huggingface\")]\n{\n    use std::sync::Arc;\n    use triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, Sampler};\n    use triplets::{HuggingFaceRowSource, HuggingFaceRowsConfig};\n\n    fn main() -\u003e Result\u003c(), Box\u003cdyn std::error::Error\u003e\u003e {\n        let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };\n        let store = Arc::new(DeterministicSplitStore::new(ratios, 42)?);\n        let mut sampler = TripletSampler::new(SamplerConfig::default(), store);\n        // Configure the source to pull the \"train\" split of a dataset.\n        // Note: While we specify \"train\" here as the ingestion source, the crate\n        // automatically handles its own deterministic split assignments (train/val/test)\n        // at the record level across all loaded data.\n        let config = HuggingFaceRowsConfig::new(\n            \"hf_finance\",          // Source identifier\n            \"financial_phrasebank\", // HF Dataset name\n            \"default\",             // Dataset config\n            \"train\",               // Dataset split\n            \"cache/hf_snapshots\"   // Local cache for downloaded shards\n        );\n\n        let source = HuggingFaceRowSource::new(config)?;\n        sampler.register_source(Box::new(source));\n        Ok(())\n    }\n}\n```\n\n#### Column Mapping Modes\n\nThe HF source supports two exclusive extraction modes, selected by which fields are populated on `HuggingFaceRowsConfig`:\n\n**Role mode** — activated when `anchor_columns`, `positive_columns`, or `context_columns` is non-empty. Each row produces a `DataRecord` with explicitly assigned section roles:\n\n| Config field       | Coalesces? | `SectionRole` produced          | Behaviour when missing / empty                   |\n| ------------------ | ---------- | ------------------------------- | ------------------------------------------------ |\n| `anchor_columns`   | Yes        | `Anchor`                        | Row is skipped                                   |\n| `positive_columns` | Yes        | `Context`                       | Row is skipped                                   |\n| `context_columns`  | No         | `Context` (one section per col) | Row is skipped if **any** column is absent/blank |\n\n*Coalescing* means multiple candidate column names can be supplied; the first with a non-empty value is used and the rest are ignored. `context_columns` does **not** coalesce — every listed column is strictly required and each contributes its own independent section.\n\n**Text mode** — used when `anchor_columns` is empty and `text_columns` is non-empty. The first non-empty candidate column supplies the sole content for the row. This is the SimCSE-style path where the model learns from augmented views of the same text.\n\n##### Role mode: three-column datasets (question / answer / context)\n\nDatasets that pair a question with both an answer and a passage of supporting context — common in RAG evaluation sets — can be ingested with a single source-list line:\n\n```\n# in hf_sources.txt\nhf://zeitgeist-ai/financial-rag-nvidia-sec/default/train anchor=question positive=answer context=context\n```\n\nOr programmatically via `context_columns`:\n\n```rust,no_run\n#[cfg(feature = \"huggingface\")]\nfn main() -\u003e Result\u003c(), Box\u003cdyn std::error::Error\u003e\u003e {\n    use triplets::{HuggingFaceRowSource, HuggingFaceRowsConfig};\n\n    let mut config = HuggingFaceRowsConfig::new(\n        \"hf_fin_rag\",\n        \"zeitgeist-ai/financial-rag-nvidia-sec\",\n        \"default\",\n        \"train\",\n        \"cache/hf_snapshots\",\n    );\n    config.anchor_columns   = vec![\"question\".to_string()];\n    config.positive_columns = vec![\"answer\".to_string()];\n    config.context_columns  = vec![\"context\".to_string()];\n\n    let source = HuggingFaceRowSource::new(config)?;\n    let _ = source;\n    Ok(())\n}\n```\n\nEach ingested row produces a `DataRecord` with three sections in declaration order:\n\n| Section | Source column | `SectionRole` |\n| ------- | ------------- | ------------- |\n| 0       | `question`    | `Anchor`      |\n| 1       | `answer`      | `Context`     |\n| 2       | `context`     | `Context`     |\n\nBecause both the positive column and every context column are emitted as `SectionRole::Context` sections, a recipe using `Selector::Role(SectionRole::Context)` will see all of them as candidates.\n\n\u003e **Row-skipping**: if any column listed in `context_columns` is absent from a row or contains an empty string, that row is silently dropped. This hard requirement prevents partially-populated rows from appearing in training batches. `anchor_columns` and `positive_columns` behave the same way — a row is skipped if the coalesced result is empty.\n\nMultiple context columns are supported and each produces its own section, in the order they are declared:\n\n```\nhf://my-org/my-dataset/default/train anchor=title positive=summary context=body,tags\n```\n\n#### Source-list file format\n\nWhen using `build_hf_sources` / `load_hf_sources_from_list`, sources are described one per line in a plain-text file. Lines starting with `#` are comments; blank lines are ignored.\n\n```\nhf://\u003corg\u003e/\u003cdataset\u003e/\u003cconfig\u003e/\u003csplit\u003e  key=value  [key=value ...]\n```\n\nEvery accepted key and its semantics:\n\n| Key                       | Value                       | Accepts commas? | Required?                                                              | Description                                                                                                                                                                              |\n| ------------------------- | --------------------------- | --------------- | ---------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `anchor=`                 | one or more column names    | Yes             | At least one of `anchor`, `positive`, `context`, or `text` is required | Activates role mode. Columns are tried in order; the first non-empty value is used as the `Anchor` section. Row skipped if all candidates are absent/empty.                              |\n| `positive=`               | one or more column names    | Yes             | No                                                                     | Activates role mode. Columns are tried in order; the first non-empty value becomes a `Context` section. Row skipped if all candidates are absent/empty.                                  |\n| `context=`                | one or more column names    | Yes             | No                                                                     | Activates role mode. Every listed column is required — if any is absent or blank the row is dropped. Each column becomes its own `Context` section, in declaration order. No coalescing. |\n| `text=` / `text_columns=` | one or more column names    | Yes             | At least one mapping key is required                                   | Activates text mode (SimCSE). Columns are tried in order; the first non-empty value is the sole content of the record. Ignored when role mode is active. Both spellings are equivalent.  |\n| `trust=`                  | float in `[0.0, 1.0]`       | No              | No (default: `0.5`)                                                    | Overrides the quality trust score stamped on every record produced by this source. Out-of-range values or non-float strings are hard errors at parse time.                               |\n| `source_id=`              | non-empty identifier string | No              | No (auto-derived when absent)                                          | Overrides the automatically generated source identifier. Must not be empty.                                                                                                              |\n\n**Auto-derived `source_id`**\n\nWhen `source_id=` is omitted, an identifier is derived from the URI:\n\n1. The short dataset name (the part after the last `/` in the org/dataset pair) is taken as the base.\n2. If the config is not `\"default\"`, it is appended as `.config`.\n3. If the split is not `\"train\"`, it is appended as `.split`.\n4. Special characters are sanitized to underscores.\n5. If two sources produce the same auto-slug, `.{index}` is appended to the second and subsequent collisions.\n\nExamples: `hf://org/wikipedia/20231101.en/train` → `wikipedia.20231101_en`; `hf://org/dataset/default/validation` → `dataset.validation`.\n\n**Error behaviour**\n\nUnknown keys (including typos such as `positve=`) are hard errors — the parser rejects the line immediately rather than silently ignoring the key. This prevents misconfigured sources from being silently loaded with missing column mappings. A line with no recognised mapping key (`anchor=`, `positive=`, `context=`, or `text=`) is also rejected.\n\n#### Authenticating with Private Datasets\n\nTo access private or gated datasets set the `HF_TOKEN` environment variable to a valid\nHugging Face API token. Tokens with at least **read** scope are sufficient and can be\ngenerated at \u003chttps://huggingface.co/settings/tokens\u003e.\n\nWhen `HF_TOKEN` is set to a non-empty value, `HuggingFaceRowsConfig::new()` picks it up\nautomatically and sends it as a `Bearer` credential on every API request and shard\ndownload. If the token is invalid or expired, `HuggingFaceRowSource::new()` returns an\nerror immediately rather than silently degrading later.\n\n| Platform                 | Command                                                |\n| ------------------------ | ------------------------------------------------------ |\n| macOS / Linux            | `export HF_TOKEN=\"hf_...\"`                             |\n| Windows — Command Prompt | `set HF_TOKEN=hf_...`                                  |\n| Windows — PowerShell     | `$env:HF_TOKEN = \"hf_...\"`                             |\n| Windows — persistent     | *System Properties → Advanced → Environment Variables* |\n\nThe token can also be set programmatically on the config struct if you prefer not to rely on\nthe process environment:\n\n```rust,no_run\n#[cfg(feature = \"huggingface\")]\nfn main() -\u003e Result\u003c(), Box\u003cdyn std::error::Error\u003e\u003e {\n    use triplets::{HuggingFaceRowSource, HuggingFaceRowsConfig};\n\n    let mut config = HuggingFaceRowsConfig::new(\n        \"private_dataset\",\n        \"my-org/private-dataset\",\n        \"default\",\n        \"train\",\n        \"cache/hf_snapshots\",\n    );\n    // Override after construction (or set HF_TOKEN env var before calling new()).\n    config.hf_token = Some(\"hf_...\".to_string());\n    // new() validates the token immediately; an invalid token returns an error.\n    let source = HuggingFaceRowSource::new(config)?;\n    let _ = source;\n    Ok(())\n}\n```\n\n\u003e **Security**: never commit tokens to source control. Use environment variables, a secrets\n\u003e manager, or a credential file listed in `.gitignore`.\n\n### CSV Source\n\nLoad rows from a CSV file with explicit column mappings. The file **must have a named header row** — columns are always selected by name. Supports two modes:\n\n- **Role mode** — map separate columns to anchor and positive (context) roles.\n- **Text mode** — map a single column for SimCSE-style contrastive pre-training.\n\n```rust,no_run\nuse std::sync::Arc;\nuse triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore};\nuse triplets::source::{CsvSource, CsvSourceConfig};\n\nlet ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };\nlet store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());\nlet mut sampler = TripletSampler::new(SamplerConfig::default(), store);\n\n// Role mode: map \"question\" → anchor, \"answer\" → positive.\nlet config = CsvSourceConfig::new(\"qna\", \"data/qna.csv\")\n    .with_anchor_column(\"question\")\n    .with_positive_column(\"answer\")\n    .with_trust(0.9);\nlet source = CsvSource::new(config).unwrap();\nsampler.register_source(Box::new(source));\n\n// Text mode (SimCSE): single column used for both anchor and context.\nlet config2 = CsvSourceConfig::new(\"corpus\", \"data/corpus.csv\")\n    .with_text_column(\"text\");\nlet source2 = CsvSource::new(config2).unwrap();\nsampler.register_source(Box::new(source2));\n```\n\nRows with empty required fields are skipped. Column name matching is case-insensitive.\n\n### Text File Source\n\nRecursively indexes plain-text files from a directory. Each file's stem (filename without extension) becomes the **anchor** and its body content becomes the **context**. Useful for local corpora where files are already titled meaningfully.\n\n```rust\nuse std::sync::Arc;\nuse triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore};\nuse triplets::source::{FileSource, FileSourceConfig};\n\nlet ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };\nlet store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());\nlet mut sampler = TripletSampler::new(SamplerConfig::default(), store);\n// Point at a directory; all text files are indexed recursively.\n// The filename stem is the anchor; the file body is the context.\nlet config = FileSourceConfig::new(\"docs\", \"./data/corpus\")\n    .with_text_files_only(true)\n    .with_trust(0.9); // Assign a quality score to this source\n\nlet source = FileSource::new(config);\nsampler.register_source(Box::new(source));\n```\n\n### Custom Source\n\nImplement the `IndexableSource` trait to integrate any backend that can fetch records by a stable integer index.\n\n```rust\nuse std::sync::Arc;\nuse triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore};\nuse chrono::Utc;\nuse triplets::{DataRecord, SamplerError};\nuse triplets::data::{RecordSection, SectionRole};\nuse triplets::source::{IndexableSource, IndexableAdapter};\n\nstruct MyApiSource;\n\nimpl IndexableSource for MyApiSource {\n    fn id(\u0026self) -\u003e \u0026str { \"api_source\" }\n    fn len_hint(\u0026self) -\u003e Option\u003cusize\u003e { Some(1000) }\n    fn record_at(\u0026self, idx: usize) -\u003e Result\u003cOption\u003cDataRecord\u003e, SamplerError\u003e {\n        // Fetch record 'idx' from your database or API.\n        // Return Ok(None) to skip a record (e.g. deleted rows or filtered entries).\n        Ok(Some(DataRecord {\n            id: format!(\"api_{idx}\"),\n            source: self.id().into(),\n            created_at: Utc::now(),\n            updated_at: Utc::now(),\n            quality: Default::default(),\n            // Optional free-form tags for filtering or recipe targeting.\n            // Examples: domain labels, year strings, content-type markers.\n            taxonomy: vec![\"finance\".into(), \"2025\".into()],\n            // Each section represents one logical view of the record's content.\n            // SectionRole::Anchor  — the primary subject text (e.g. a question, title, or key passage).\n            // SectionRole::Context — supporting or related text (e.g. an answer, body, or description).\n            // Recipes select sections by role: Selector::Role(SectionRole::Anchor / Context).\n            //\n            // `sentences` is an optional pre-split list of individual sentences within `text`.\n            // Providing it gives the chunker more accurate boundaries when creating token windows.\n            // Leave it as vec![] and the chunker will split `text` automatically.\n            sections: vec![\n                RecordSection {\n                    role: SectionRole::Anchor,\n                    heading: Some(\"Title\".into()),\n                    text: format!(\"Primary content for record {idx}.\"),\n                    sentences: vec![], // or: vec![\"Sentence one.\".into(), \"Sentence two.\".into()]\n                },\n                RecordSection {\n                    role: SectionRole::Context,\n                    heading: None,\n                    text: format!(\"Supporting context for record {idx}.\"),\n                    sentences: vec![],\n                },\n            ],\n            // Optional: attach a KvpPrefixSampler to inject structured key-value\n            // metadata into sampled chunk text at training time. For example:\n            //\n            //   meta: source=api | date=2025-01-01\n            //   \u003cactual chunk text\u003e\n            //\n            // The sampler controls dropout (how often the prefix appears) and\n            // per-field presence probability, so the model learns to handle both\n            // prefixed and plain chunks. See the \"Metadata Prefixes and Tag Dropout\"\n            // section for full usage.\n            meta_prefix: None,\n        }))\n    }\n}\n\nlet ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };\nlet store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());\nlet mut sampler = TripletSampler::new(SamplerConfig::default(), store);\nlet adapter = IndexableAdapter::new(MyApiSource);\nsampler.register_source(Box::new(adapter));\n```\n\n## Sampling and Mixing\n\n### Weighted Sampling\n\nAdjust per-source sampling frequency to handle class imbalance or dataset quality differences.\n\n```rust,no_run\nuse std::sync::Arc;\nuse std::collections::HashMap;\nuse triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, SplitLabel, Sampler};\nuse triplets::source::{CsvSource, CsvSourceConfig, FileSource, FileSourceConfig};\n\nfn main() -\u003e Result\u003c(), Box\u003cdyn std::error::Error\u003e\u003e {\n    let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };\n    let store = Arc::new(DeterministicSplitStore::new(ratios, 42)?);\n    let mut sampler = TripletSampler::new(SamplerConfig::default(), store);\n\n    // Source 1: structured Q\u0026A pairs from a CSV file.\n    // Each row maps a \"question\" column → anchor, \"answer\" column → positive.\n    let csv_config = CsvSourceConfig::new(\"hf_finance\", \"data/finance_qa.csv\")\n        .with_anchor_column(\"question\")\n        .with_positive_column(\"answer\")\n        .with_trust(0.9);\n    sampler.register_source(Box::new(CsvSource::new(csv_config)?));\n\n    // Source 2: local plain-text corpus of internal documentation.\n    // Files are indexed recursively; filename stem → anchor, body → context.\n    let file_config = FileSourceConfig::new(\"docs\", \"./data/internal_docs\")\n        .with_text_files_only(true)\n        .with_trust(0.7); // lower trust — unreviewed internal docs\n    sampler.register_source(Box::new(FileSource::new(file_config)));\n\n    // Override the mixing ratio for this batch: pull from the high-quality\n    // CSV source 70% of the time and the local docs 30% of the time.\n    // Sources not listed here fall back to uniform sampling.\n    let mut weights = HashMap::new();\n    weights.insert(\"hf_finance\".to_string(), 0.7);\n    weights.insert(\"docs\".to_string(), 0.3);\n\n    let batch = sampler.next_triplet_batch_with_weights(SplitLabel::Train, \u0026weights)?;\n    Ok(())\n}\n```\n\n### Recipe Selection Weights\n\nThe `weight` field on `TripletRecipe` controls **how often a recipe is selected** relative to other active recipes. The sampler expands each recipe into a proportional number of selection slots, shuffles them, and cycles through — so a recipe with `weight = 3.0` is drawn approximately three times as often as one with `weight = 1.0`.\n\n| `weight` value                            | Effect                                                                                                  |\n| ----------------------------------------- | ------------------------------------------------------------------------------------------------------- |\n| Equal across all recipes (e.g. all `1.0`) | Uniform round-robin — each recipe is selected equally often (default behavior).                         |\n| `2.0` vs `1.0`                            | The `2.0` recipe is tried ~2× as often per batch.                                                       |\n| `0.0` or negative                         | Recipe is **excluded entirely** — useful for disabling a recipe without removing it from configuration. |\n\n```rust,no_run\nuse triplets::{SamplerConfig, TripletRecipe, NegativeStrategy, Selector, SectionRole};\n\nlet config = SamplerConfig {\n    recipes: vec![\n        // High-signal structured pairs: tried 3× as often as the fallback.\n        TripletRecipe {\n            name: \"structured\".into(),\n            anchor: Selector::Role(SectionRole::Anchor),\n            positive_selector: Selector::Role(SectionRole::Context),\n            negative_selector: Selector::Random,\n            negative_strategy: NegativeStrategy::WrongArticle,\n            weight: 3.0,\n            instruction: None, // See the Instruction Tuning section to attach a task prompt.\n            allow_same_anchor_positive: false,\n        },\n        // Fallback recipe with random chunk selection.\n        TripletRecipe {\n            name: \"random_fallback\".into(),\n            anchor: Selector::Random,\n            positive_selector: Selector::Random,\n            negative_selector: Selector::Random,\n            negative_strategy: NegativeStrategy::WrongArticle,\n            weight: 1.0,\n            instruction: None,\n            allow_same_anchor_positive: false,\n        },\n        // Disabled recipe — excluded from sampling until weight is set above zero.\n        TripletRecipe {\n            name: \"experimental\".into(),\n            anchor: Selector::Random,\n            positive_selector: Selector::Random,\n            negative_selector: Selector::Random,\n            negative_strategy: NegativeStrategy::WrongArticle,\n            weight: 0.0,\n            instruction: None,\n            allow_same_anchor_positive: false,\n        },\n    ],\n    ..SamplerConfig::default()\n};\n```\n\n\u003e **Sampling frequency vs. output score**: `TripletRecipe::weight` controls how often the recipe is *selected*. It is also one factor in the output `SampleTriplet::weight`, but the two serve different roles — see [Output Format](#output-format) below.\n\n### Instruction Tuning\n\nThe `instruction` field on `TripletRecipe` attaches a static task prompt to every triplet, pair, or text sample produced by that recipe. It is copied verbatim into `SampleTriplet::instruction` (and the equivalent field on `SamplePair` / `TextSample`) so your training loop can prepend it to the anchor text before passing it to the model.\n\nThis lets different recipes express different task hypotheses over the same underlying data — for example, a retrieval recipe and a similarity recipe can share the same source but carry different prompts:\n\n```rust,no_run\nuse triplets::{SamplerConfig, TripletRecipe, NegativeStrategy, Selector, SectionRole};\n\nlet config = SamplerConfig {\n    recipes: vec![\n        // Retrieval recipe: every triplet from this recipe carries a task prompt.\n        TripletRecipe {\n            name: \"retrieval\".into(),\n            anchor: Selector::Role(SectionRole::Anchor),\n            positive_selector: Selector::Role(SectionRole::Context),\n            negative_selector: Selector::Random,\n            negative_strategy: NegativeStrategy::WrongArticle,\n            weight: 1.0,\n            instruction: Some(\"Retrieve a passage that answers the question:\".into()),\n            allow_same_anchor_positive: false,\n        },\n        // Plain contrastive recipe: no prompt — model sees bare chunk text.\n        TripletRecipe {\n            name: \"similarity\".into(),\n            anchor: Selector::Role(SectionRole::Context),\n            positive_selector: Selector::Role(SectionRole::Context),\n            negative_selector: Selector::Random,\n            negative_strategy: NegativeStrategy::WrongArticle,\n            weight: 1.0,\n            instruction: None,\n            allow_same_anchor_positive: false,\n        },\n    ],\n    ..SamplerConfig::default()\n};\n```\n\nIn your training loop, prepend the instruction to the anchor when present:\n\n```rust,no_run\nuse std::sync::Arc;\nuse triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, SplitLabel, Sampler};\nlet ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };\nlet store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());\nlet mut sampler = TripletSampler::new(SamplerConfig::default(), store);\nlet batch = sampler.next_triplet_batch(SplitLabel::Train).unwrap();\nfor triplet in batch.triplets {\n    // Prepend the task instruction to the anchor when the recipe specifies one.\n    // Recipes without an instruction pass the anchor text through unchanged.\n    //\n    // With instruction:    \"Retrieve a passage that answers the question:\\nWhat is X?\"\n    // Without instruction: \"What is X?\"\n    let anchor_input = match \u0026triplet.instruction {\n        Some(instr) =\u003e format!(\"{instr}\\n{}\", triplet.anchor.text),\n        None =\u003e triplet.anchor.text.clone(),\n    };\n\n    // The positive and negative slots are never prefixed with the instruction —\n    // only the anchor carries the task prompt.\n    let positive_input = triplet.positive.text.clone();\n    let negative_input = triplet.negative.text.clone();\n\n    // Pass all three to your model's embedding function and compute triplet loss.\n    // let loss = model.triplet_loss(\u0026anchor_input, \u0026positive_input, \u0026negative_input);\n}\n```\n\n### Output Format\n\nEach `SampleTriplet` contains the sampled text and a computed training score.\n\n```rust,no_run\nuse std::sync::Arc;\nuse triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, SplitLabel, Sampler};\nlet ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };\nlet store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());\nlet mut sampler = TripletSampler::new(SamplerConfig::default(), store);\nlet batch = sampler.next_triplet_batch(SplitLabel::Train).unwrap();\nfor triplet in batch.triplets {\n    // Primary content\n    let anchor_text = \u0026triplet.anchor.text;\n    let pos_text    = \u0026triplet.positive.text;\n    let neg_text    = \u0026triplet.negative.text;\n\n    // Metadata\n    let recipe      = \u0026triplet.recipe;      // which recipe produced this triplet\n    let weight      = triplet.weight;       // training score — see below\n    let instruction = triplet.instruction;  // task prompt set on the recipe, if any — see Instruction Tuning\n}\n```\n\n#### What `triplet.weight` means and how it is calculated\n\n`SampleTriplet::weight` is a **per-triplet training score** in the range `(0.0, recipe.weight]`. Use it to scale each triplet's contribution to the loss — triplets that are more structurally coherent or come from higher-trust sources receive a higher score.\n\nThe value is computed as `triplet.weight = recipe.weight × chunk_quality`, where `chunk_quality` is the average of three per-slot signals (one per chunk: anchor, positive, negative). Each signal is the product of two independent factors:\n\n| Factor                    | What it measures                                                                                                          | How it is set                                    |\n| ------------------------- | ------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------ |\n| **Window position score** | `1 / (window_index + 1)` — earlier chunks in a section score higher (1.0 at index 0, 0.5 at index 1, 0.25 at index 3, …). | Automatic.                                       |\n| **Source trust**          | Configured quality signal for the originating source (clamped to `[0, 1]`).                                               | Set via `.with_trust(0.9)` on the source config. |\n\nThe resulting raw signal is clamped to `[chunk_weight_floor, 1.0]` (default floor: `0.1`) before averaging.\n\nThe anchor/positive pair additionally has a **proximity multiplier** applied: chunks that are closer together within the same section receive a higher multiplier (two adjacent windows score 1.0; the score decreases as window distance grows). This rewards pairs that share local context.\n\nA practical reading: a triplet from a high-trust source where all three chunks come from the opening windows of their sections will have `chunk_quality ≈ 1.0`, so `triplet.weight ≈ recipe.weight`. A triplet with chunks deep in long documents from a lower-trust source will have a noticeably smaller score.\n\nIn a training loop pass the weight straight into your criterion:\n\n```rust,no_run\nuse std::sync::Arc;\nuse triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, SplitLabel, Sampler};\nlet ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };\nlet store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());\nlet mut sampler = TripletSampler::new(SamplerConfig::default(), store);\nlet batch = sampler.next_triplet_batch(SplitLabel::Train).unwrap();\n// Example: accumulate weighted loss over a batch.\nlet _weighted_loss: f32 = batch.triplets.iter().map(|t| {\n    let triplet_loss = 0.0_f32; // replace with your model's per-triplet loss\n    triplet_loss * t.weight\n}).sum();\n```\n\n### Source Within a Source\n\nEach `TripletRecipe` is an **independent code path** over the sections of a record. Two recipes registered against the same source can express completely different training hypotheses about the same underlying data — no second source registration needed.\n\nThe mechanism is straightforward:\n\n- Populate each `DataRecord::sections` with as many `RecordSection` entries as your data has natural views.\n- Assign each section a `SectionRole` (or let position carry the meaning with `Selector::Paragraph(n)`).\n- Write one `TripletRecipe` per hypothesis; each recipe independently specifies which sections fill the anchor, positive, and negative slots.\n- Sources declare their own recipes via `default_triplet_recipes()` so callers need no recipe configuration at all.\n\n**Sparse sections — optional data in the same record pool**\n\nNot every record needs to have all sections. If a recipe targets `Selector::Paragraph(2)` (the third section) and a record only has two sections, the sampler simply skips that record *for that recipe only* — the record continues to serve all other recipes normally. This lets you mix densely-covered and sparsely-covered training hypotheses in a single source without any record filtering logic in your data pipeline.\n\n**Example — financial data source with two recipe strategies**\n\nImagine each record represents one publicly-traded company with up to three sections:\n\n| Index | Role           | Content                                                       | Always present?                       |\n| ----- | -------------- | ------------------------------------------------------------- | ------------------------------------- |\n| 0     | `Anchor`       | Linearized financial metrics — view A (a random tag subset)   | Yes                                   |\n| 1     | `Context`      | Linearized financial metrics — view B (a disjoint tag subset) | Yes                                   |\n| 2     | *(positional)* | Earnings-call transcript for the same period                  | No — only when a transcript was found |\n\nTwo recipes target different aspects of the same records:\n\n```rust,no_run\nuse triplets::config::{NegativeStrategy, Selector, TripletRecipe};\nuse triplets::data::SectionRole;\n\n/// Cross-view recipe: both metric views are always present, so every record\n/// participates. Teaches the model that two different linearized views of the\n/// same company are semantically closer than any view of a different company.\nfn metrics_cross_view_recipe() -\u003e TripletRecipe {\n    TripletRecipe {\n        name: \"metrics_cross_view\".into(),\n        // Anchor: metric view A.\n        anchor: Selector::Role(SectionRole::Anchor),\n        // Positive: metric view B — disjoint tags, same company and period.\n        positive_selector: Selector::Role(SectionRole::Context),\n        // Negative: metric view A of a different company.\n        negative_selector: Selector::Role(SectionRole::Anchor),\n        negative_strategy: NegativeStrategy::WrongArticle,\n        weight: 1.0,\n        instruction: None,\n        allow_same_anchor_positive: false,\n    }\n}\n\n/// Transcript recipe: targets an optional third section (index 2).\n/// Records without a transcript are skipped for *this recipe only* —\n/// they still serve the metrics_cross_view recipe above without any\n/// record filtering logic in the data pipeline.\n///\n/// Lower weight reflects partial coverage: fewer records satisfy this\n/// recipe, so letting it drive the same number of gradient steps as the\n/// dense recipe would over-represent the companies with transcripts.\nfn metrics_to_transcript_recipe() -\u003e TripletRecipe {\n    TripletRecipe {\n        name: \"metrics_to_transcript\".into(),\n        // Anchor: metric view A.\n        anchor: Selector::Role(SectionRole::Anchor),\n        // Positive: earnings-call transcript at section index 2.\n        // Records that lack this section are skipped for this recipe.\n        positive_selector: Selector::Paragraph(2),\n        // Negative: metric view A of a different company.\n        negative_selector: Selector::Role(SectionRole::Anchor),\n        negative_strategy: NegativeStrategy::WrongArticle,\n        // Half the weight of the dense recipe; adjust as transcript coverage grows.\n        weight: 0.5,\n        instruction: None,\n        allow_same_anchor_positive: false,\n    }\n}\n```\n\nThe source returns both recipes from `default_triplet_recipes()` so that no recipe configuration is needed at the call site:\n\n```rust,no_run\nuse triplets::config::TripletRecipe;\nuse triplets::source::{DataSource, IndexablePager, IndexableSource, SourceCursor, SourceSnapshot};\nuse triplets::{DataRecord, SamplerConfig, SamplerError};\n\n# use triplets::config::{NegativeStrategy, Selector};\n# use triplets::data::SectionRole;\n# fn metrics_cross_view_recipe() -\u003e TripletRecipe { TripletRecipe { name: \"\".into(), anchor: Selector::Random, positive_selector: Selector::Random, negative_selector: Selector::Random, negative_strategy: NegativeStrategy::WrongArticle, weight: 1.0, instruction: None, allow_same_anchor_positive: false } }\n# fn metrics_to_transcript_recipe() -\u003e TripletRecipe { metrics_cross_view_recipe() }\nstruct FinancialReportsSource { /* store handle, symbol index, … */ }\n\nimpl IndexableSource for FinancialReportsSource {\n    fn id(\u0026self) -\u003e \u0026str { \"financial_reports\" }\n    fn len_hint(\u0026self) -\u003e Option\u003cusize\u003e { Some(5000) }\n\n    fn record_at(\u0026self, _idx: usize) -\u003e Result\u003cOption\u003cDataRecord\u003e, SamplerError\u003e {\n        // Build a record with 2 or 3 sections depending on transcript availability.\n        // Sparse records (None returns) are skipped entirely by the pager.\n        Ok(None) // replace with real record construction\n    }\n}\n\nimpl DataSource for FinancialReportsSource {\n    fn id(\u0026self) -\u003e \u0026str { \"financial_reports\" }\n\n    fn refresh(\n        \u0026self,\n        _config: \u0026SamplerConfig,\n        cursor: Option\u003c\u0026SourceCursor\u003e,\n        limit: Option\u003cusize\u003e,\n    ) -\u003e Result\u003cSourceSnapshot, SamplerError\u003e {\n        IndexablePager::new(DataSource::id(self)).refresh(self, cursor, limit)\n    }\n\n    fn reported_record_count(\u0026self, _config: \u0026SamplerConfig) -\u003e Result\u003cu128, SamplerError\u003e {\n        Ok(5000)\n    }\n\n    /// Source declares its own recipes — no recipe config required at call site.\n    fn default_triplet_recipes(\u0026self) -\u003e Vec\u003cTripletRecipe\u003e {\n        vec![\n            metrics_cross_view_recipe(),      // dense: all records, weight 1.0\n            metrics_to_transcript_recipe(),   // sparse: records with transcripts, weight 0.5\n        ]\n    }\n}\n```\n\nWhen the sampler processes a record that has only two sections, it attempts each recipe in weighted order: `metrics_cross_view` succeeds (both `Role(Anchor)` and `Role(Context)` sections are present), while `metrics_to_transcript` returns no candidate for that slot (section index 2 is absent). The sampler moves on without any special handling in the data pipeline.\n\nThe same single `register_source` call enables both training hypotheses:\n\n```rust,no_run\nuse std::sync::Arc;\nuse triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, SplitLabel, Sampler};\n\n# struct FinancialReportsSource;\n# impl triplets::source::DataSource for FinancialReportsSource {\n#   fn id(\u0026self) -\u003e \u0026str { \"financial_reports\" }\n#   fn refresh(\u0026self, _: \u0026SamplerConfig, _: Option\u003c\u0026triplets::source::SourceCursor\u003e, _: Option\u003cusize\u003e) -\u003e Result\u003ctriplets::source::SourceSnapshot, triplets::SamplerError\u003e { unimplemented!() }\n#   fn reported_record_count(\u0026self, _: \u0026SamplerConfig) -\u003e Result\u003cu128, triplets::SamplerError\u003e { Ok(0) }\n# }\nlet ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };\nlet store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());\nlet mut sampler = TripletSampler::new(SamplerConfig::default(), store);\n\n// One registration — the source provides both recipes.\nsampler.register_source(Box::new(FinancialReportsSource { /* … */ }));\n\nlet batch = sampler.next_triplet_batch(SplitLabel::Train).unwrap();\n// batch.triplets is a mix of \"metrics_cross_view\" and \"metrics_to_transcript\"\n// samples, proportional to their configured weights and record coverage.\n```\n\n## Metadata Prefixes and Tag Dropout\n\n`KvpPrefixSampler` attaches structured key-value metadata to a record. When a chunk is selected for training, the sampler may prepend a `meta:` line to the chunk text before it reaches the model. What that line looks like varies per sample — a variant is selected at random, each field picks one value from its declared list, and the field order within the line is shuffled:\n\n```text\nmeta: source=daily-update | date=2025-01-01\n\u003cactual chunk content begins here\u003e\n\n# same record, different sample — different value, different field order:\nmeta: date=Jan 1, 2025 | source=daily-update\n\u003cactual chunk content begins here\u003e\n```\n\n### Tag dropout\n\nThe `dropout` parameter controls how often the prefix is included at all:\n\n| `dropout` | Effect                                                                              |\n| --------- | ----------------------------------------------------------------------------------- |\n| `1.0`     | Prefix is **always** prepended.                                                     |\n| `0.5`     | Prefix is prepended ~half the time; the rest of the time the model sees plain text. |\n| `0.0`     | Prefix is **never** prepended.                                                      |\n\nTraining with `dropout \u003c 1.0` teaches the model to handle both cases — chunks with metadata context and chunks without. This prevents the model from becoming dependent on the tags being present at inference time.\n\nIndividual fields also have their own **presence probability** controlled by `.with_presence(p)`. A field with `presence = 0.7` is omitted from a given prefix 30% of the time, independently of the sampler-level dropout.\n\n```rust\nuse triplets::kvp::{KvpField, KvpPrefixSampler};\n\n// dropout=0.8: 80% of chunks get a prefix, 20% see plain text.\nlet mut sampler = KvpPrefixSampler::new(0.8);\n\nsampler.add_variant_fields([\n    // \"date\" appears in every emitted prefix (presence=1.0 is the default).\n    KvpField::many(\"date\", [\"2025-01-01\", \"Jan 1, 2025\"]),\n    // \"source\" is omitted from ~30% of emitted prefixes.\n    KvpField::one(\"source\", \"daily-update\").with_presence(0.7),\n]);\n```\n\nThe two value options for `date` are chosen at random each time the prefix is rendered, and — when a variant has more than one field — the order the fields appear in the line is also shuffled. The model therefore never sees a consistent positional signal for any individual tag.\n\nYou can call `add_variant` / `add_variant_fields` multiple times to register alternative field sets. One set is selected uniformly at random per sample — useful when you want to teach the model different metadata \"views\" of the same record:\n\n```rust\nuse triplets::kvp::{KvpField, KvpPrefixSampler};\n\nlet mut sampler = KvpPrefixSampler::new(1.0);\n// Variant A: structural tags\nsampler.add_variant([(\"type\", \"earnings-call\"), (\"quarter\", \"Q1-2025\")]);\n// Variant B: temporal tags\nsampler.add_variant_fields([KvpField::many(\"date\", [\"2025-01-15\", \"Jan 15, 2025\"])]);\n```\n\n### Attaching a prefix to a record\n\nSet `DataRecord::meta_prefix` on any record before registering it with a source:\n\n```rust\nuse chrono::Utc;\nuse triplets::DataRecord;\nuse triplets::kvp::{KvpField, KvpPrefixSampler};\n\nlet mut prefix = KvpPrefixSampler::new(0.9);\nprefix.add_variant_fields([\n    KvpField::many(\"date\", [\"2025-01-01\", \"Jan 1, 2025\"]),\n    KvpField::one(\"source\", \"daily-update\").with_presence(0.7),\n]);\n\nlet record = DataRecord {\n    id: \"rec-001\".into(),\n    source: \"news\".into(),\n    created_at: Utc::now(),\n    updated_at: Utc::now(),\n    quality: Default::default(),\n    taxonomy: vec![],\n    sections: vec![],\n    meta_prefix: Some(prefix),\n};\n```\n\n### Inspecting metadata on output chunks\n\nEvery `RecordChunk` carries a `kvp_meta: HashMap\u003cString, Vec\u003cString\u003e\u003e` field containing **all** declared keys and every possible value across all variants. This is populated unconditionally — even when dropout suppresses the prefix text for that particular chunk:\n\n```rust,no_run\nuse std::sync::Arc;\nuse triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, SplitLabel, Sampler};\nlet ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };\nlet store = Arc::new(DeterministicSplitStore::new(ratios, 42).unwrap());\nlet mut sampler = TripletSampler::new(SamplerConfig::default(), store);\nlet batch = sampler.next_triplet_batch(SplitLabel::Train).unwrap();\nfor triplet in \u0026batch.triplets {\n    // All declared keys and values are here regardless of dropout.\n    println!(\"{:?}\", triplet.anchor.kvp_meta);\n}\n```\n\n## Epochs and Determinism\n\n### Iterating Epochs\n\n```rust,no_run\nuse std::sync::Arc;\nuse triplets::{SamplerConfig, TripletSampler, SplitRatios, DeterministicSplitStore, SplitLabel, Sampler};\nfn main() -\u003e Result\u003c(), Box\u003cdyn std::error::Error\u003e\u003e {\n    let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };\n    let store = Arc::new(DeterministicSplitStore::new(ratios, 42)?);\n    let mut sampler = TripletSampler::new(SamplerConfig::default(), store);\n    let mut batches_left = 1;\n    let mut training_not_finished = || {\n        let ret = batches_left \u003e 0;\n        batches_left -= 1;\n        ret\n    };\n    // In your training loop:\n    for epoch in 0..10 {\n        sampler.set_epoch(epoch)?;\n\n        while training_not_finished() {\n            let batch = sampler.next_triplet_batch(SplitLabel::Train)?;\n            // ... pass batch to your model ...\n        }\n\n        // Save state at the end of each epoch to allow resuming if training is interrupted.\n        sampler.save_sampler_state(None)?;\n    }\n\n    Ok(())\n}\n```\n\n### Deterministic Resuming\n\nTo resume training, initialize a `FileSplitStore` at the same path. The sampler automatically restores cursors, RNG state, and epoch progress from that store.\n\n```rust,no_run\nuse std::sync::Arc;\nuse triplets::{SamplerConfig, TripletSampler, FileSplitStore, SplitRatios, Sampler};\n\nfn main() -\u003e Result\u003c(), Box\u003cdyn std::error::Error\u003e\u003e {\n    let ratios = SplitRatios { train: 0.8, validation: 0.1, test: 0.1 };\n    let seed = 42;\n\n    // Opening an existing FileSplitStore automatically loads its persisted state.\n    let store = Arc::new(FileSplitStore::open(\"checkpoints/splits.bin\", ratios, seed)?);\n\n    // The sampler will resume from the exact record and recipe it was on.\n    let mut sampler = TripletSampler::new(SamplerConfig::default(), store);\n    Ok(())\n}\n```\n\n\u003e **Note**: Sampler state is intentionally lightweight. It persists source identifiers, integer record cursors, and compact RNG state vectors, not full data records. This keeps frequent checkpointing practical in long-running training jobs.\n\n## Technical Details\n\n### Threading Model\n\nConcurrency is handled at multiple levels for high throughput:\n- **Prefetching**: `BatchPrefetcher` runs a dedicated background worker thread that fills a bounded queue.\n- **Parallel Ingestion**: Source refresh executes concurrently across registered sources during ingestion cycles.\n- **Synchronous API**: Sampling calls are synchronous at the API boundary for straightforward training-loop integration.\n- **Thread-Safe Shared Use**: `TripletSampler` is safe to share across threads (for example via `Arc`); concurrent calls are internally synchronized with a mutex, so a single sampler instance is callable from multiple threads without data races.\n\n### Chunking and Windows\n\nLong documents are handled through a pluggable `ChunkingAlgorithm`. The default `SlidingWindowChunker` splits sections into fixed-size token windows with configurable overlap, preserving full coverage of long text.\n\n### Negative Mining\n\nNegative selection is delegated to a pluggable backend.\n- **DefaultBackend**: Uniform random selection from the candidate pool.\n- **Bm25Backend**: (Requires `bm25-mining`) Ranks candidates by lexical overlap with the anchor to provide harder training examples.\n\n## Capabilities\n\n| Capability              | Description                                                                   |\n| ----------------------- | ----------------------------------------------------------------------------- |\n| **Source Agnostic**     | Implement `DataSource` or `IndexableSource` for any DB or API.                |\n| **Weighted Sampling**   | Tune source and recipe frequencies to handle class imbalance.                 |\n| **Epoch Shuffling**     | Deterministic pseudo-random shuffling that re-permutes per epoch.             |\n| **Instruction Tuning**  | Attach task-specific prompts (e.g., \"Summarize this...\") to specific recipes. |\n| **Metadata Decorators** | Inject structured prefixes into sampled text via `KvpPrefixSampler`.          |\n| **Anti-Shortcut**       | Includes anchor/positive swapping to avoid asymmetric slot bias.              |\n\n## License\n\n`triplets` is distributed under both the MIT license and the Apache License (Version 2.0).\n\nSee [LICENSE-APACHE](https://github.com/jzombie/rust-triplets/blob/main/LICENSE-APACHE) and [LICENSE-MIT](https://github.com/jzombie/rust-triplets/blob/main/LICENSE-MIT) for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjzombie%2Frust-triplets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjzombie%2Frust-triplets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjzombie%2Frust-triplets/lists"}