https://github.com/basilisk-labs/agentplane-bitgn-adapter
BitGN adapter for benchmarking AgentPlane-backed Codex execution
https://github.com/basilisk-labs/agentplane-bitgn-adapter
agentplane agents benchmark bitgn codex
Last synced: 6 days ago
JSON representation
BitGN adapter for benchmarking AgentPlane-backed Codex execution
- Host: GitHub
- URL: https://github.com/basilisk-labs/agentplane-bitgn-adapter
- Owner: basilisk-labs
- License: mit
- Created: 2026-05-14T14:25:45.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-05-14T15:06:40.000Z (about 1 month ago)
- Last Synced: 2026-05-14T16:41:58.202Z (about 1 month ago)
- Topics: agentplane, agents, benchmark, bitgn, codex
- Language: Python
- Homepage: https://bitgn.com
- Size: 48.8 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AgentPlane BitGN Adapter
[](https://github.com/basilisk-labs/agentplane-bitgn-adapter/actions/workflows/ci.yml)
Adapter scaffold for running AgentPlane-backed Codex execution against BitGN
benchmarks.
AgentPlane is not submitted as a model. It is used as a control-plane profile
around an executor:
- benchmark runtime: BitGN PCM or ECOM
- executor: Codex CLI
- control layer: policy, step loop, proof bundle, score-detail capture
## Status
Experimental. Current proven coverage is deliberately narrow:
- `bitgn/sandbox t01`: pass, score `1.00`.
- `bitgn/pac1-dev t01`: pass, score `1.00`.
- `bitgn/ecom1-dev t01`: pass, score `1.00`.
All non-`t01` PAC1 and ECOM1 tasks are not passing in current evidence and must
be treated as failing/unsupported until a live run proves otherwise. This repo
is not leaderboard-ready.
## Why this exists
BitGN evaluates observable agent behavior: runtime tool calls, files, task
state, side effects, outcome codes, compliance, and security posture. That is
the same surface where AgentPlane can add value: bounded policy, traceability,
explicit outcomes, and failure evidence.
The near-term goal is not "AgentPlane beats everyone". The useful public claim
is narrower:
> AgentPlane can wrap a strong executor, preserve BitGN benchmark validity, and
> produce auditable evidence for why trials passed or failed.
## Install
```bash
make sync
```
Install BitGN SDK dependencies from the same Buf registry used by the upstream
samples:
```bash
make sync-bitgn
```
The SDK currently tracks Python 3.14 in the sample agents, so the Make targets
create a Python 3.14 uv environment.
## Authentication
Codex can use ChatGPT subscription auth:
```bash
codex login
codex login status
```
That path is useful for local smoke runs because the adapter invokes `codex
exec`. For reproducible public runs, API-key auth is still cleaner because it is
easier to document and recreate in CI or another machine.
BitGN official runs still need:
```bash
export BITGN_API_KEY="..."
```
## PAC1 smoke
```bash
cp .env.example .env.local
$EDITOR .env.local
make oauth
make sandbox
```
`scripts/bitgn_smoke.sh` loads `.env` and then `.env.local`; keep secrets in
one of those ignored files, not in committed config.
Sandbox is the first end-to-end check because it does not require a BitGN
Platform key. PAC1 is the next check:
```bash
make pac1
```
## ECOM smoke
Set:
```bash
BENCHMARK_ID=bitgn/ecom1-dev
BITGN_RUNTIME=ecom
```
Then run a single task:
```bash
make ecom
```
## Proof bundle
Each trial writes:
```text
.agentplane-bitgn/////
AGENTS.md
proof.json
```
The proof bundle captures:
- benchmark id and runtime
- model id
- task id and trial id
- each JSON tool command requested by Codex
- runtime observations, truncated for readability
- final status
## Documentation
- [Runbook](docs/runbook.md)
- [Test strategy](docs/test-strategy.md)
- [Coverage matrix](docs/coverage.md)
- [Leaderboard plan](docs/leaderboard.md)
- [Evidence report template](docs/evidence-template.md)
- [Cost notes](docs/cost.md)
- [OAuth notes](docs/oauth.md)
- [Smoke results](docs/test-results.md)
## Leaderboard realism
PAC1 live already has multiple 104/104 runs. A naive scaffold is unlikely to
stand out there. The best AgentPlane path is:
1. Use PAC1 DEV to harden outcome selection, grounding refs, structured writes,
and injection refusal.
2. Mine `score_detail` into regression cases.
3. Move to ECOM1, where policy books, payment state, SQL, fraud controls, and
audit trails are closer to AgentPlane's control-plane strengths.
4. Publish a proof-backed run rather than only a score screenshot.
## Integrity rules
Do not:
- fetch benchmark solutions from the internet;
- inspect hidden graders or oracle solutions;
- alter BitGN scoring, task sets, or runtime contracts;
- inject task-specific hints into the adapter policy;
- claim leaderboard readiness without a reproducible run id and proof bundle.