https://github.com/codeyousef/trainer

Last synced: about 16 hours ago
JSON representation

Host: GitHub
URL: https://github.com/codeyousef/trainer
Owner: codeyousef
Created: 2026-06-10T20:12:42.000Z (16 days ago)
Default Branch: feat/trainer-seen-native-sinai
Last Pushed: 2026-06-13T17:50:18.000Z (13 days ago)
Last Synced: 2026-06-26T01:30:44.471Z (about 16 hours ago)
Size: 1.08 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Seen Trainer

Seen-native Sinai trainer for MiniLM/SentenceTransformer-style embedding models.

This project is implemented in Seen. It consumes local JSONL/source exports and local model artifacts, trains with mean pooling for training/evaluation/inference, can dispatch tensor kernels through Seen's Vulkan GPU runtime, and emits a SentenceTransformer-compatible package with Seen manifests.

## CLI

```sh
trainer mine --config config.json
trainer train --config config.json
trainer calibrate --config config.json
trainer eval --config config.json
trainer package --config config.json
trainer run-all --config config.json
```

If `--config` is omitted, the CLI reads `config/example.config.json`.

## Data

Training JSONL rows use this triplet schema:

```json
{
"query_text": "question text",
"positive_chunk_text": "matching answer chunk",
"hard_negative_chunk_text": "hard negative chunk",
"domain": "domain name",
"source": "source id"
}
```

Source adapters can also normalize CSV/TSV/JSONL rows into `(query, positive)` pairs before mining.
Input JSONL/CSV/TSV readers use chunked byte reads and stop after configured accepted-row caps.
Set any cap to `0` for unbounded processing:

```json
{
"max_source_pairs": 0,
"max_mined_triplets": 0,
"max_train_triplets": 0,
"max_calibration_triplets": 0,
"max_eval_triplets": 0
}
```

These are intended for memory-safe real-data smoke runs: `max_source_pairs` bounds the candidate
pool loaded for mining, `max_mined_triplets` bounds accepted mined output, and the train,
calibration, and eval caps bound the examples consumed by their respective phases.
Hard-negative mining precomputes positive embeddings, normalizes query/candidate rows through
`tensorNormalizeRows` when `backend` is `gpu`, then uses Seen's `tensorTopKInnerProduct` GPU
kernel; if dispatch is unavailable it falls back to the same scalar top-k and domain/source
exclusion semantics.
Loaded MiniLM forward passes also thread the configured backend into Q/K/V, attention-output,
intermediate, and output dense projections through Seen's `Tensor.matmul` dispatch, plus
LayerNorm forward normalization/affine through `tensorLayerNormRows` and elementwise kernels,
while keeping the scalar path as the correctness reference.
MiniLM backward tail gradients now thread the configured backend into dense projection input
gradients for FFN/output/attention paths, reusing `Tensor.matmul` through the dense-gradient
helpers. GELU backward dispatches derivative and product evaluation through `tensorGeluBackward`
when `backend` is `gpu`; fused attention context dispatches through `tensorAttentionContext`,
with the Tensor matmul/scale/softmax composition retained as a fallback. Q/K/V attention-gradient
products route through Tensor matmul/scale/elementwise/reduction kernels for GPU configs. Triplet
margin loss evaluation dispatches through `tensorTripletMarginLoss` for GPU configs while
gradient-producing training paths keep their scalar-stat reference math.
These paths retain scalar fallbacks for shape diagnostics and tests.

## Model Outputs

The package step writes SentenceTransformer-compatible files plus Seen manifests. When local MiniLM safetensors are loaded, training updates sparse embedding rows, embedding LayerNorm, and all ready encoder layer surfaces. These trainable base surfaces apply AdamW to the loaded safetensors base value plus the Seen delta, then persist the resulting delta so it can be materialized into `seen_trained_base_model.safetensors`.
Packages also include `minilm_parameter_registry.json`, a Seen-native object graph of loaded MiniLM safetensors tensors as `Parameter` buffers with value, gradient, and Adam moment arrays. When MiniLM slots are loaded, sparse embedding, embedding LayerNorm, and encoder layer-surface AdamW updates run through those registry buffers and sync the resulting deltas back into the package-compatible delta artifacts.

Safetensors metadata is inspected through header reads, tensor loads use byte-range reads, and materialization patches tensor-sized slices or sparse rows instead of loading the whole model file into a Seen byte array.
For constrained real-model smokes, `weight_load_cap_elements` controls MiniLM embedding/forward tensor eligibility, while `parameter_registry_load_cap_elements` independently controls whether safetensors values are loaded into trainable registry buffers. `train_all_minilm_layers` can be set to `false` to exercise adapter-only MiniLM training without allocating all all-layer delta surfaces, `train_minilm_deltas` can be set to `false` to use MiniLM embeddings while training only the projection adapters, `max_minilm_layers` can bound the runtime forward/training layer count while leaving packaged model metadata intact, and `cache_minilm_tensors` can hold the bounded forward tensors once per model load to avoid repeated safetensors allocations. `resume_output_artifacts` defaults to `false` so repeated `run-all` executions start from the base model instead of parsing previous large JSON delta artifacts; set it to `true` when an explicit resume run is needed.

The direct update manifest is written to:

```text
/seen_base_weight_update_manifest.json
```

It reports whether the full MiniLM encoder surface was materialized for the loaded model.

## Capped Verification

Always run Seen builds/checks/tests under a memory cap:

```sh
CAP_KB=$(awk '/MemAvailable/ { v=int($2/2); if (v>8388608) v=8388608; print v }' /proc/meminfo)
ulimit -v "$CAP_KB"
SEEN_JOBS=1 SEEN_OPT_JOBS=1 seen check src/main.seen
SEEN_JOBS=1 SEEN_OPT_JOBS=1 seen compile src/main.seen target/trainer --fast --no-fork --emit-glsl --no-cache --jobs=1 --opt-jobs=1
```

Test sources can be checked and run from the test project:

```sh
cd tests
CAP_KB=$(awk '/MemAvailable/ { v=int($2/2); if (v>8388608) v=8388608; print v }' /proc/meminfo)
ulimit -v "$CAP_KB"
for test in test_*.seen; do SEEN_JOBS=1 SEEN_OPT_JOBS=1 seen check "$test" || exit 1; done
for test in test_*.seen; do
name=${test%.seen}
SEEN_JOBS=1 SEEN_OPT_JOBS=1 seen compile "$test" "../target/$name" --fast --no-fork --emit-glsl --no-cache --jobs=1 --opt-jobs=1 || exit 1
"../target/$name" || exit 1
done
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/codeyousef/trainer

Awesome Lists containing this project

README