{"id":37061350,"url":"https://github.com/abdulvahapmutlu/reprokit-ml","last_synced_at":"2026-01-14T06:56:45.984Z","repository":{"id":315384909,"uuid":"1059274325","full_name":"abdulvahapmutlu/reprokit-ml","owner":"abdulvahapmutlu","description":"One-command determinism + manifest for ML projects.","archived":false,"fork":false,"pushed_at":"2025-09-18T09:49:00.000Z","size":42,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-27T04:22:34.326Z","etag":null,"topics":["cli","data-integrity","dataset-versioning","determinism","github-actions","hashing","jax","manifest","merkle-tree","mlops","pre-commit","provenance","python","pytorch","reproducibility","reproducible-research","tensorflow","xxhash"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abdulvahapmutlu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-18T08:23:14.000Z","updated_at":"2025-09-19T06:56:59.000Z","dependencies_parsed_at":"2025-09-18T10:29:04.374Z","dependency_job_id":"9c340d80-8d4a-47f7-9bea-fa708af89a88","html_url":"https://github.com/abdulvahapmutlu/reprokit-ml","commit_stats":null,"previous_names":["abdulvahapmutlu/reprokit-ml"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/abdulvahapmutlu/reprokit-ml","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdulvahapmutlu%2Freprokit-ml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdulvahapmutlu%2Freprokit-ml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdulvahapmutlu%2Freprokit-ml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdulvahapmutlu%2Freprokit-ml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abdulvahapmutlu","download_url":"https://codeload.github.com/abdulvahapmutlu/reprokit-ml/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abdulvahapmutlu%2Freprokit-ml/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28412473,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T05:26:33.345Z","status":"ssl_error","status_checked_at":"2026-01-14T05:21:57.251Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","data-integrity","dataset-versioning","determinism","github-actions","hashing","jax","manifest","merkle-tree","mlops","pre-commit","provenance","python","pytorch","reproducibility","reproducible-research","tensorflow","xxhash"],"created_at":"2026-01-14T06:56:45.188Z","updated_at":"2026-01-14T06:56:45.978Z","avatar_url":"https://github.com/abdulvahapmutlu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ReproKit-ML\n\n[![PyPI version](https://img.shields.io/pypi/v/reprokit-ml.svg)](https://pypi.org/project/reprokit-ml/)\n![Python versions](https://img.shields.io/pypi/pyversions/reprokit-ml.svg)\n[![CI](https://github.com/abdulvahapmutlu/reprokit-ml/actions/workflows/ci.yml/badge.svg)](https://github.com/abdulvahapmutlu/reprokit-ml/actions/workflows/ci.yml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)\n\n**One-command determinism + manifest for ML projects.**\n\n\u003e Scope is intentionally narrow: *deterministic seeds + environment freeze + Merkle data hash + single manifest*.  \n\u003e Keep using MLflow/W\u0026B for experiment tracking and DVC for dataset versioning/remotes.\n\n---\n\n## ✨ Features\n\n- **Determinism on demand:** Seeds Python/NumPy/Torch/TF/JAX and sets framework flags (e.g., cuDNN deterministic).\n- **Environment freeze:** Detects pip/conda/poetry/uv and outputs lock artifacts plus `system.json` (OS/CPU/GPU/CUDA snapshot).\n- **Merkle data hash:** Fast, sampled hashing (first/last N bytes + random chunks) with a **sqlite cache**; directory-level Merkle root.\n- **Single manifest:** JSON binding **code (git)** + **env** + **data hash** + **determinism** + **config** for re-runs and audits.\n- **Hooks:** Pre-commit hooks to guard seeds and prompt manifest freshness.\n\n---\n\n## 🚀 Install\n\nFrom PyPI:\n```\npip install reprokit-ml\n````\n\nFrom source (dev):\n\n```\npip install -e .\npre-commit install\n```\n\n\u003e Windows users: the examples below use **cmd.exe** .\n\n---\n\n## ⚡ Quickstart (Windows — cmd.exe)\n\n```\n:: in your project\nreprokit init --data .\\data\n\n:: set deterministic behavior\nreprokit seed --seed 42\n\n:: freeze environment + system snapshot\nreprokit env-freeze\n\n:: hash your data (quote globs on Windows)\nmkdir data \u0026 echo hello\u003e data\\a.txt\nreprokit hash-data .\\data --exclude \"**/.ipynb_checkpoints/**\"\n\n:: write a manifest that ties everything together\nreprokit manifest --config reprokit.toml\n```\n\nArtifacts created:\n\n* `.repro/seeds.json`\n* `repro/environment/requirements.lock` (or `environment.yml` if conda)\n* `repro/environment/system.json` (platform/CPU/GPU/CUDA)\n* `repro/data_hash.json` (Merkle root + stats)\n* `repro/manifest.json` (single source of truth)\n\n---\n\n## 🧰 Commands\n\n### `reprokit init`\n\nBootstraps config and hooks.\n\n```\nreprokit init --data .\\data --data .\\datasets\\cifar10\n```\n\n* Creates `reprokit.toml` with paths and hashing defaults.\n* Creates `.repro/` and `repro/`.\n* Writes `.pre-commit-config.yaml` (if missing).\n\n### `reprokit seed`\n\nApplies deterministic knobs in the **current process**.\n\n```\nreprokit seed --seed 42\n```\n\nSets:\n\n* Python `PYTHONHASHSEED`, `random.seed`\n* NumPy `np.random.seed`\n* Torch (if installed): `torch.use_deterministic_algorithms(True)`, cuDNN deterministic, `CUBLAS_WORKSPACE_CONFIG=\":16:8\"`, etc.\n* TensorFlow (if installed): `TF_DETERMINISTIC_OPS=1`, `tf.random.set_seed`\n* JAX (if installed): initializes `PRNGKey`\n\nOutputs `.repro/seeds.json`.\n\n### `reprokit env-freeze`\n\nDetects the manager and exports lock files + system snapshot.\n\n```\nreprokit env-freeze --out repro\\environment\n```\n\n### `reprokit hash-data`\n\nComputes a Merkle root over one or more directories.\n\n```\nreprokit hash-data .\\data .\\datasets\\cifar10 ^\n  --exclude \"**/.ipynb_checkpoints/**\" --exclude \"*.tmp\" ^\n  --workers 8 --out repro\\data_hash.json\n```\n\n* Uses sampled hashing for speed (first/last N bytes + random 4KB chunks), with a deterministic per-file RNG seed.\n* Caches per-file digests in `.repro/hash-cache.sqlite`.\n\n### `reprokit manifest`\n\nWrites the single manifest binding code/env/data/seeds/config.\n\n```\nreprokit manifest --config conf\\train.yaml --out repro\\manifest.json\n```\n\n---\n\n## ⚙️ Configuration (`reprokit.toml`)\n\nGenerated on `init`. Example:\n\n```\n[data]\npaths = [\"./data\", \"./datasets/cifar10\"]\n\n[hash]\nexclude = [\"**/.ipynb_checkpoints/**\", \"*.tmp\", \"**/__pycache__/**\"]\nalgorithm = \"xxh3+sha256-merkle\"\nsample_bytes = 1048576\nworkers = 8\ncache_path = \".repro/hash-cache.sqlite\"\n```\n\n---\n\n## 📦 Manifest (schema sketch)\n\n```\n{\n  \"run_id\": \"2025-09-17T19:12:33Z\",\n  \"code\": {\"commit\": \"3d9e…\", \"branch\": \"main\", \"remote\": \"…\", \"dirty\": false},\n  \"environment\": {\n    \"python\": \"3.11.6\",\n    \"platform\": \"Windows-10-10.0.22631\",\n    \"gpus\": [\"NVIDIA RTX …\"],\n    \"artifacts\": {\n      \"requirements.lock\": \"repro/environment/requirements.lock\",\n      \"environment.yml\": \"repro/environment/environment.yml\"\n    }\n  },\n  \"data\": {\"paths\": [\"./data\"], \"merkle_root\": \"a4f1…\"},\n  \"determinism\": {\"seed\": 42, \"PYTHONHASHSEED\": \"42\", \"torch\": true, \"tensorflow\": false, \"jax\": false},\n  \"config\": {\"files\": [\"conf/train.yaml\"], \"hashes\": [\"sha256:…\"]},\n  \"runtime\": {\"hostname\": \"WIN-…\", \"container\": false, \"ts\": 1694977953.12}\n}\n```\n\n---\n\n## 🪄 Tips \u0026 FAQs\n\n* **Windows globs**: always quote patterns: `--exclude \"**/.ipynb_checkpoints/**\"`.\n* **Speed vs safety**: sampled hashing is fast and stable; for critical subsets, add a future `--full-file` path (issue welcome).\n* **Pre-commit hooks**:\n\n  * `guard_seed` fails if you didn’t run `reprokit seed`.\n  * `check_manifest` warns if `repro/manifest.json` is older than 24h (configurable via env).\n* **Conda vs Poetry vs pip**: `env-freeze` auto-detects; if multiple are installed, Poetry takes precedence by design. Override with `--manager pip|conda|poetry|uv`.\n\n---\n\n## 🧪 Development\n\n```\npython -m venv .venv\n.\\.venv\\Scripts\\activate.bat\npip install -e . ruff mypy pytest pre-commit\npre-commit install\n\nruff check .\nruff format --check .\nmypy src\npytest -q\n```\n\n---\n\n## 🤝 Contributing\n\nIssues and PRs welcome! Good first issues: `--no-cache`, full-file hashing per path, manifest verification command, MLflow/DVC plugins.\n\n---\n\n## 📜 License\n\nMIT — see [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabdulvahapmutlu%2Freprokit-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabdulvahapmutlu%2Freprokit-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabdulvahapmutlu%2Freprokit-ml/lists"}