{"id":51144238,"url":"https://github.com/codeyousef/trainer","last_synced_at":"2026-06-26T01:30:47.102Z","repository":{"id":364262260,"uuid":"1265478776","full_name":"codeyousef/trainer","owner":"codeyousef","description":null,"archived":false,"fork":false,"pushed_at":"2026-06-13T17:50:18.000Z","size":1129,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"feat/trainer-seen-native-sinai","last_synced_at":"2026-06-26T01:30:44.471Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codeyousef.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-10T20:12:42.000Z","updated_at":"2026-06-12T18:10:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/codeyousef/trainer","commit_stats":null,"previous_names":["codeyousef/trainer"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/codeyousef/trainer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codeyousef%2Ftrainer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codeyousef%2Ftrainer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codeyousef%2Ftrainer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codeyousef%2Ftrainer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codeyousef","download_url":"https://codeload.github.com/codeyousef/trainer/tar.gz/refs/heads/feat/trainer-seen-native-sinai","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codeyousef%2Ftrainer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34799570,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-25T02:00:05.521Z","response_time":101,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-26T01:30:43.939Z","updated_at":"2026-06-26T01:30:47.099Z","avatar_url":"https://github.com/codeyousef.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Seen Trainer\n\nSeen-native Sinai trainer for MiniLM/SentenceTransformer-style embedding models.\n\nThis project is implemented in Seen. It consumes local JSONL/source exports and local model artifacts, trains with mean pooling for training/evaluation/inference, can dispatch tensor kernels through Seen's Vulkan GPU runtime, and emits a SentenceTransformer-compatible package with Seen manifests.\n\n## CLI\n\n```sh\ntrainer mine --config config.json\ntrainer train --config config.json\ntrainer calibrate --config config.json\ntrainer eval --config config.json\ntrainer package --config config.json\ntrainer run-all --config config.json\n```\n\nIf `--config` is omitted, the CLI reads `config/example.config.json`.\n\n## Data\n\nTraining JSONL rows use this triplet schema:\n\n```json\n{\n  \"query_text\": \"question text\",\n  \"positive_chunk_text\": \"matching answer chunk\",\n  \"hard_negative_chunk_text\": \"hard negative chunk\",\n  \"domain\": \"domain name\",\n  \"source\": \"source id\"\n}\n```\n\nSource adapters can also normalize CSV/TSV/JSONL rows into `(query, positive)` pairs before mining.\nInput JSONL/CSV/TSV readers use chunked byte reads and stop after configured accepted-row caps.\nSet any cap to `0` for unbounded processing:\n\n```json\n{\n  \"max_source_pairs\": 0,\n  \"max_mined_triplets\": 0,\n  \"max_train_triplets\": 0,\n  \"max_calibration_triplets\": 0,\n  \"max_eval_triplets\": 0\n}\n```\n\nThese are intended for memory-safe real-data smoke runs: `max_source_pairs` bounds the candidate\npool loaded for mining, `max_mined_triplets` bounds accepted mined output, and the train,\ncalibration, and eval caps bound the examples consumed by their respective phases.\nHard-negative mining precomputes positive embeddings, normalizes query/candidate rows through\n`tensorNormalizeRows` when `backend` is `gpu`, then uses Seen's `tensorTopKInnerProduct` GPU\nkernel; if dispatch is unavailable it falls back to the same scalar top-k and domain/source\nexclusion semantics.\nLoaded MiniLM forward passes also thread the configured backend into Q/K/V, attention-output,\nintermediate, and output dense projections through Seen's `Tensor.matmul` dispatch, plus\nLayerNorm forward normalization/affine through `tensorLayerNormRows` and elementwise kernels,\nwhile keeping the scalar path as the correctness reference.\nMiniLM backward tail gradients now thread the configured backend into dense projection input\ngradients for FFN/output/attention paths, reusing `Tensor.matmul` through the dense-gradient\nhelpers. GELU backward dispatches derivative and product evaluation through `tensorGeluBackward`\nwhen `backend` is `gpu`; fused attention context dispatches through `tensorAttentionContext`,\nwith the Tensor matmul/scale/softmax composition retained as a fallback. Q/K/V attention-gradient\nproducts route through Tensor matmul/scale/elementwise/reduction kernels for GPU configs. Triplet\nmargin loss evaluation dispatches through `tensorTripletMarginLoss` for GPU configs while\ngradient-producing training paths keep their scalar-stat reference math.\nThese paths retain scalar fallbacks for shape diagnostics and tests.\n\n## Model Outputs\n\nThe package step writes SentenceTransformer-compatible files plus Seen manifests. When local MiniLM safetensors are loaded, training updates sparse embedding rows, embedding LayerNorm, and all ready encoder layer surfaces. These trainable base surfaces apply AdamW to the loaded safetensors base value plus the Seen delta, then persist the resulting delta so it can be materialized into `seen_trained_base_model.safetensors`.\nPackages also include `minilm_parameter_registry.json`, a Seen-native object graph of loaded MiniLM safetensors tensors as `Parameter` buffers with value, gradient, and Adam moment arrays. When MiniLM slots are loaded, sparse embedding, embedding LayerNorm, and encoder layer-surface AdamW updates run through those registry buffers and sync the resulting deltas back into the package-compatible delta artifacts.\n\nSafetensors metadata is inspected through header reads, tensor loads use byte-range reads, and materialization patches tensor-sized slices or sparse rows instead of loading the whole model file into a Seen byte array.\nFor constrained real-model smokes, `weight_load_cap_elements` controls MiniLM embedding/forward tensor eligibility, while `parameter_registry_load_cap_elements` independently controls whether safetensors values are loaded into trainable registry buffers. `train_all_minilm_layers` can be set to `false` to exercise adapter-only MiniLM training without allocating all all-layer delta surfaces, `train_minilm_deltas` can be set to `false` to use MiniLM embeddings while training only the projection adapters, `max_minilm_layers` can bound the runtime forward/training layer count while leaving packaged model metadata intact, and `cache_minilm_tensors` can hold the bounded forward tensors once per model load to avoid repeated safetensors allocations. `resume_output_artifacts` defaults to `false` so repeated `run-all` executions start from the base model instead of parsing previous large JSON delta artifacts; set it to `true` when an explicit resume run is needed.\n\nThe direct update manifest is written to:\n\n```text\n\u003coutput_model_dir\u003e/seen_base_weight_update_manifest.json\n```\n\nIt reports whether the full MiniLM encoder surface was materialized for the loaded model.\n\n## Capped Verification\n\nAlways run Seen builds/checks/tests under a memory cap:\n\n```sh\nCAP_KB=$(awk '/MemAvailable/ { v=int($2/2); if (v\u003e8388608) v=8388608; print v }' /proc/meminfo)\nulimit -v \"$CAP_KB\"\nSEEN_JOBS=1 SEEN_OPT_JOBS=1 seen check src/main.seen\nSEEN_JOBS=1 SEEN_OPT_JOBS=1 seen compile src/main.seen target/trainer --fast --no-fork --emit-glsl --no-cache --jobs=1 --opt-jobs=1\n```\n\nTest sources can be checked and run from the test project:\n\n```sh\ncd tests\nCAP_KB=$(awk '/MemAvailable/ { v=int($2/2); if (v\u003e8388608) v=8388608; print v }' /proc/meminfo)\nulimit -v \"$CAP_KB\"\nfor test in test_*.seen; do SEEN_JOBS=1 SEEN_OPT_JOBS=1 seen check \"$test\" || exit 1; done\nfor test in test_*.seen; do\n  name=${test%.seen}\n  SEEN_JOBS=1 SEEN_OPT_JOBS=1 seen compile \"$test\" \"../target/$name\" --fast --no-fork --emit-glsl --no-cache --jobs=1 --opt-jobs=1 || exit 1\n  \"../target/$name\" || exit 1\ndone\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodeyousef%2Ftrainer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodeyousef%2Ftrainer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodeyousef%2Ftrainer/lists"}