{"id":51302352,"url":"https://github.com/zer0contextlost/graft","last_synced_at":"2026-06-30T21:01:58.718Z","repository":{"id":359093056,"uuid":"1244155244","full_name":"zer0contextlost/graft","owner":"zer0contextlost","description":"Detect foreign code grafted into your supply chain. Byte-level anomaly detection for CI/CD pipelines.","archived":false,"fork":false,"pushed_at":"2026-05-20T10:11:16.000Z","size":43,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-20T14:29:46.913Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zer0contextlost.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-20T02:33:48.000Z","updated_at":"2026-05-20T10:11:22.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/zer0contextlost/graft","commit_stats":null,"previous_names":["zer0contextlost/graft"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/zer0contextlost/graft","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zer0contextlost%2Fgraft","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zer0contextlost%2Fgraft/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zer0contextlost%2Fgraft/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zer0contextlost%2Fgraft/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zer0contextlost","download_url":"https://codeload.github.com/zer0contextlost/graft/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zer0contextlost%2Fgraft/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34983171,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-30T02:00:05.919Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-30T21:01:56.445Z","updated_at":"2026-06-30T21:01:58.706Z","avatar_url":"https://github.com/zer0contextlost.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GRAFT\n\nDetect foreign code grafted into your supply chain.\n\nGRAFT is a CI tool that trains a byte-level compression model on your repo's\nown source history, then scores every PR-changed file for byte patterns that\ndeviate from the learned baseline. No signature database. No rule engine.\nNo labeled attack examples.\n\n---\n\n## How it works\n\nA lossless compression model assigns short codes to sequences it has learned\nto predict and long codes to sequences it has not. GRAFT trains an autoregressive\nnext-byte predictor on your repo's benign-only source files, then measures\n**bits-per-byte (BPB)** on each changed file at PR time:\n\n- **Low BPB** — model is not surprised — byte patterns are consistent with your codebase\n- **High BPB** — model is surprised — byte patterns deviate from your codebase\n\nSupply chain attacks graft foreign content into legitimate files: obfuscated\neval chains, base64-encoded payloads, binary blobs stuffed into comments,\nminified implants in build scripts. These all look radically different from\nnormal source code at the byte level. GRAFT catches them because they score\nhigh BPB against a model trained on nothing but your own code.\n\nThe underlying architecture is ByteFlow Net (Deng et al., ICLR 2026), implemented\nin [SUBSTRATE](https://github.com/zer0contextlost/substrate). GRAFT adapts it as\na per-file anomaly scorer rather than a language model.\n\n---\n\n## Detection results\n\nTrained on ~46 KB of Python and YAML source (3000 steps, ~3 min on `ubuntu-latest`):\n\n| Payload type | Max BPB | Sigma above baseline | Verdict |\n|---|---|---|---|\n| Normal Python source | 2.31 | +2.6 | MEDIUM |\n| Normal YAML workflow | 1.97 | +1.8 | CLEAN |\n| Base64 blob embedded in .py | 9.64 | **+17.9** | CRITICAL |\n| Eval obfuscation chain | 11.89 | **+22.6** | CRITICAL |\n| Null-byte stuffed config | 13.33 | **+25.6** | CRITICAL |\n| Minified / obfuscated JS | 9.04 | **+16.6** | CRITICAL |\n\nBaseline: mean=1.09 BPB, std=0.48, threshold=2.52 (3σ).\nAll four attack types score 8–26σ above the benign baseline — no overlap.\n\n---\n\n## Quick start\n\n### GitHub Actions\n\nAdd to `.github/workflows/graft.yml` in any repo you want to protect:\n\n```yaml\nname: GRAFT Supply Chain Scan\n\non:\n  pull_request:\n    types: [opened, synchronize, reopened]\n\npermissions:\n  pull-requests: write\n  contents: read\n\njobs:\n  graft-scan:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n        with:\n          fetch-depth: 0\n\n      - uses: zer0contextlost/graft@v1\n        with:\n          github-token: ${{ secrets.GITHUB_TOKEN }}\n```\n\nFirst PR after setup trains the baseline model (~3 min) and caches it.\nSubsequent PRs reuse the cache. Model retrains weekly automatically.\n\n### Gitea Actions\n\nCopy `ci/gitea-workflow.yml` to `.gitea/workflows/graft.yml`.\nRequires Gitea 1.21+ with Actions enabled.\n\n### GitLab CI\n\nCopy `ci/gitlab-ci.yml` into `.gitlab-ci.yml` (or include from it).\nSet `GITLAB_TOKEN` as a CI/CD variable with `api` scope.\n\n### Generic CI (Jenkins, Drone, Woodpecker, Forgejo, pre-push hook)\n\n```bash\nexport BASE_SHA=\"$(git merge-base origin/main HEAD)\"\nexport HEAD_SHA=\"$(git rev-parse HEAD)\"\nbash ci/generic.sh\n```\n\nTo post comments on a Gitea-compatible forge (Forgejo, Codeberg, etc.):\n\n```bash\nexport PR_NUMBER=42\nexport REPO=\"owner/myrepo\"\nexport API_TOKEN=\"your-token\"\nexport API_URL=\"https://forgejo.example.com/api/v1/repos/{repo}/issues/{pr}/comments\"\nbash ci/generic.sh\n```\n\n---\n\n## Configuration\n\n| Input | Default | Description |\n|---|---|---|\n| `github-token` | required | `secrets.GITHUB_TOKEN` — for posting PR comments |\n| `threshold-sigma` | `3.0` | Flag files with BPB \u003e `mean + sigma * std`. Lower = more sensitive |\n| `train-steps` | `3000` | Training steps when building baseline. 3000 is sufficient for most repos |\n| `max-files` | `50` | Max changed files to scan per PR |\n| `fail-on-anomaly` | `false` | Set to `true` to fail the check when HIGH/CRITICAL files are found |\n\nSeverity tiers:\n\n| Tier | Condition |\n|---|---|\n| CRITICAL | BPB \u003e baseline + 5σ |\n| HIGH | BPB \u003e baseline + 3σ |\n| MEDIUM | BPB \u003e baseline + 2σ |\n| CLEAN | within 2σ |\n\n---\n\n## What GRAFT scans\n\nChanged files with these extensions are scored:\n\n**Source code** — `.py` `.js` `.ts` `.mjs` `.jsx` `.tsx` `.go` `.rs` `.java` `.kt` `.scala` `.c` `.cpp` `.h` `.rb` `.php` `.lua` `.swift`\n\n**Build / CI / infra** — `.sh` `.bash` `.ps1` `.yml` `.yaml` `.mk` `.cmake` `.tf` `.hcl` `.dockerfile` and named files `Makefile` `Dockerfile` `Gemfile` `Pipfile`\n\n**Package manifests and lockfiles** — `.toml` `.lock` `.json` `.xml` `.gradle` `.gemspec` `requirements.txt`\n\nBinary files, files under 64 bytes, and files over 512 KB are skipped automatically.\n\n---\n\n## PR comment format\n\nGRAFT posts a comment on each scanned PR with a ranked table of findings:\n\n```\n## GRAFT Supply Chain Scan\n\nBaseline: mean=1.086 std=0.478 threshold=2.520 BPB (3.0s) | 8 file(s) scanned\n\n3 file(s) flagged:\n\n| Severity | File | Max BPB | Mean BPB | Sigma | Windows |\n|----------|------|---------|---------|-------|---------|\n| CRITICAL | src/utils.py | 9.641 | 8.804 | +17.9 | 4 |\n| HIGH     | setup.py     | 3.812 | 3.104 | +5.7  | 2 |\n| MEDIUM   | Makefile     | 2.683 | 2.401 | +3.3  | 1 |\n\n5 file(s) within normal range.\n```\n\n---\n\n## How the baseline is built\n\n1. `git ls-files` enumerates all source files tracked at HEAD\n2. Binary files and files under 64 bytes are excluded\n3. Files are packed into a binary corpus (SUBSTRATE format)\n4. A MICRO model (~3M params, T=512, K=64) trains for `train-steps` steps on CPU\n5. A random 20% holdout calibrates `baseline_mean` and `baseline_std`\n6. The checkpoint is cached with a weekly key; retrained weekly automatically\n\nThe model learns your repo's byte distribution — not generic source code patterns.\nA repo that mixes Python, Rust, and YAML will have a baseline that reflects all three.\nA repo that's all Go will have a Go-specific baseline.\n\n---\n\n## Limitations\n\n**False positives:** generated files (protobuf output, minified vendored JS, binary test\nfixtures) will score high even when legitimate. Add a `.graftignore` or use `max-files`\nto exclude paths, or raise `threshold-sigma`.\n\n**False negatives:** attacks that carefully mimic the repo's existing code style at the\nbyte level — e.g., a backdoor that uses the same variable naming, whitespace conventions,\nand import patterns as the surrounding file — may score within the normal range. GRAFT\nis a surprise-based detector, not a semantic one.\n\n**Cold start:** the first PR after setup trains the model. For repos with fewer than\n~4 KB of source, training may not produce a meaningful baseline.\n\n**Threshold is per-repo:** the default 3σ threshold is calibrated against the training\ncorpus holdout. Repos with highly heterogeneous file types (mixing minified CSS with Go\nsource, for instance) may see elevated false positive rates until the model has enough\ntraining data to learn both distributions.\n\n---\n\n## Architecture\n\nGRAFT uses the SUBSTRATE framework, which implements ByteFlow Net\n(Deng et al., ICLR 2026 — [arXiv:2603.03583](https://arxiv.org/abs/2603.03583))\nas an anomaly scoring engine. The MICRO config used here:\n\n| Parameter | Value |\n|---|---|\n| Params | ~3.2M |\n| Window (T) | 512 bytes |\n| Chunk positions (K) | 64 |\n| Local dim | 128 |\n| Global dim | 256 |\n| Layers | 1 / 2 / 1 (enc / global / dec) |\n| Training | CPU, AdamW, cosine LR, ~3 min |\n\nFull architecture documentation: [SUBSTRATE](https://github.com/zer0contextlost/substrate)\n\n---\n\n*zer0contextlost — 2026*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzer0contextlost%2Fgraft","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzer0contextlost%2Fgraft","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzer0contextlost%2Fgraft/lists"}