{"id":50633023,"url":"https://github.com/bmendonca3/authzbench-saas","last_synced_at":"2026-06-07T00:01:56.400Z","repository":{"id":362641842,"uuid":"1260078493","full_name":"bmendonca3/authzbench-saas","owner":"bmendonca3","description":"Benchmark for AI agents proving multi-tenant SaaS authorization bugs","archived":false,"fork":false,"pushed_at":"2026-06-05T07:19:18.000Z","size":63,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-05T08:13:54.421Z","etag":null,"topics":["ai-agents","appsec","authorization","benchmark","owasp-api","saas-security"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bmendonca3.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-05T06:17:46.000Z","updated_at":"2026-06-05T07:19:22.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/bmendonca3/authzbench-saas","commit_stats":null,"previous_names":["bmendonca3/authzbench-saas"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/bmendonca3/authzbench-saas","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bmendonca3%2Fauthzbench-saas","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bmendonca3%2Fauthzbench-saas/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bmendonca3%2Fauthzbench-saas/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bmendonca3%2Fauthzbench-saas/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bmendonca3","download_url":"https://codeload.github.com/bmendonca3/authzbench-saas/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bmendonca3%2Fauthzbench-saas/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34003814,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-06T02:00:07.033Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agents","appsec","authorization","benchmark","owasp-api","saas-security"],"created_at":"2026-06-07T00:01:29.636Z","updated_at":"2026-06-07T00:01:56.346Z","avatar_url":"https://github.com/bmendonca3.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AuthZBench-SaaS\n\n![AuthZBench-SaaS alpha/pre-v0 overview](assets/authzbench-saas-alpha-pre-v0.png)\n\nAuthZBench-SaaS is a SaaS authorization benchmark for testing whether AI agents\ncan prove access-control failures with backend evidence while avoiding false\nreports on secure controls.\n\nThe benchmark focuses on a narrow, practical security question:\n\n\u003e Can an agent show that the wrong tenant, role, user, token, or object was\n\u003e allowed through, and can it stay quiet when access is correctly denied or\n\u003e correctly allowed?\n\nThis repository is a **released v0.0 benchmark artifact**. The strict maintainer\ngate has evidence, and the `v0.0` tag is public, but the project is not a hosted\nleaderboard and should not be called a community benchmark yet.\n\n## Why This Matters\n\nAI security tools can produce convincing vulnerability reports without proving a\nreal vulnerability. Authorization bugs are a useful stress test because a correct\nanswer needs more than fluent prose:\n\n- the right actor\n- the right tenant, organization, project, object, role, or token boundary\n- a replayable backend request\n- no finding on secure-control tasks\n- no unsafe or out-of-scope behavior\n\nAuthZBench-SaaS rewards proof and penalizes unsupported claims.\n\n## Current Snapshot\n\n| Area | Current state |\n| --- | --- |\n| Public apps | 6 synthetic SaaS targets |\n| Public tasks | 46 total: 19 vulnerable, 27 secure controls |\n| Control mix | 16 denial controls, 11 authorized-allow controls |\n| Baselines | 5 repeated current model/agent families, including one live HTTP tool-agent family |\n| Scoring | Deterministic backend replay plus v0 evidence metrics |\n| Private holdouts | Maintainer-only, ignored from public Git history |\n| Release status | v0.0 released; hosted leaderboard and v1/community claims remain future work |\n| Not included | Hosted leaderboard, rotating multi-pack holdouts, v1/community claims |\n\nPublic checkouts intentionally do not include private holdout manifests. That is\npart of the contamination-control design, not a missing file.\n\n## What Is Included\n\n- 6 local SaaS fixtures: project management, billing, support, file sharing,\n  API tokens, and audit settings\n- 46 public task manifests with seeded tenants, users, roles, objects, tokens,\n  scopes, routes, and controls\n- deterministic scorer-owned backend replay\n- Docker targets with request-log correlation for live HTTP agents\n- repeated public baseline summaries for Kiro no-tools model runs and one Kiro\n  live HTTP tool-agent family\n- protected private-holdout summaries published only as redacted aggregate\n  evidence\n- leaderboard-submission schema, source-summary validation, benchmark\n  fingerprints, and comparability keys\n- public-safe benchmark charts, task-quality matrix, benchmark card, release\n  gates, privacy checks, and fresh-clone validation\n\nAll apps are intentionally vulnerable local fixtures. Do not expose them to the\npublic internet.\n\n## Evidence Boundaries\n\nSupported claims:\n\n- AuthZBench-SaaS is a released v0.0 artifact for SaaS authorization-agent\n  evaluation.\n- The current public split has repeated baseline evidence across 5 current\n  model/agent families.\n- The scorer can verify backend-replayable evidence and false-positive behavior.\n- Maintainer-only private-holdout evidence exists without publishing private\n  task bodies, routes, seeds, or oracles.\n\nUnsupported claims:\n\n- hosted leaderboard readiness\n- v1/community-benchmark maturity\n- production vulnerability discovery\n- private model rankings from public-split scores\n- broad cyber capability measurement\n\nFor a detailed claim ledger, see\n[`docs/evidence-and-claims.md`](docs/evidence-and-claims.md).\n\n## Quick Start\n\nPrerequisites:\n\n- Python 3.10+\n- Git\n- Docker and Docker Compose for live HTTP targets or container smoke checks\n\nInstall from a fresh clone:\n\n```bash\npython3 -m pip install -e .\n```\n\nRender a public task:\n\n```bash\npython3 -m authzbench.render_task tasks/project_mgmt/pm_bola_read_alpha_from_beta.json\n```\n\nScore an example submission:\n\n```bash\npython3 -m authzbench.score \\\n  tasks/project_mgmt/pm_bola_read_alpha_from_beta.json \\\n  examples/submissions/pm_bola_read_alpha_from_beta.valid.json\n```\n\nRun public validation:\n\n```bash\npython3 scripts/validate_public.py --include-scripted-baseline\n```\n\nRun the Docker smoke gate:\n\n```bash\npython3 scripts/validate_public.py \\\n  --include-scripted-baseline \\\n  --include-container-smoke\n```\n\nAudit strict v0.0 gates in a maintainer checkout:\n\n```bash\npython3 scripts/validate_v0_release.py\n```\n\nIn a public-only checkout without private holdouts, use:\n\n```bash\npython3 scripts/validate_v0_release.py --allow-incomplete\n```\n\nThat reports gate state without pretending private tasks are public.\n\n## Target Apps\n\n| App | Port | Focus |\n| --- | ---: | --- |\n| `project_mgmt` | `8011` | project/task tenant boundaries |\n| `billing` | `8012` | plan, invoice, and entitlement authorization |\n| `support` | `8013` | ticket access, status changes, invite abuse |\n| `file_sharing` | `8014` | files, share links, stale-link behavior |\n| `api_tokens` | `8015` | tenant-bound tokens and scope checks |\n| `audit_settings` | `8016` | audit logs, exports, and admin settings |\n\nRun targets locally:\n\n```bash\ndocker compose up --build -d\npython3 scripts/container_smoke.py\ndocker compose down\n```\n\nDocker request logs are written to `captures/request-logs/`, which is ignored by\nGit.\n\n## Evaluate an Agent\n\n`python3 -m authzbench.run` gives an agent a rendered task context and expects a\nstructured JSON submission.\n\nThe runner provides:\n\n- `AUTHZBENCH_CONTEXT`: rendered task context path\n- `AUTHZBENCH_SUBMISSION`: output path for `submission.json`\n- `AUTHZBENCH_RUN_ID`, `AUTHZBENCH_TASK_ID`, and `AUTHZBENCH_AGENT_ID`: metadata\n  used for run tracking and live request-log correlation\n\nExample:\n\n```bash\npython3 -m authzbench.run \\\n  --task 'tasks/*/*.json' \\\n  --agent-cmd 'python3 my_agent.py --context {context} --out {submission}' \\\n  --results-dir results/my-agent \\\n  --timeout-seconds 30 \\\n  --benchmark-commit-sha \"$(git rev-parse HEAD)\" \\\n  --agent my-agent \\\n  --model my-model \\\n  --harness-type custom\n```\n\nAfter a run, inspect:\n\n- `summary.json`: aggregate counts and v0 evidence metrics\n- `\u003ctask_id\u003e/submission.json`: agent claims\n- `\u003ctask_id\u003e/score.json`: exploit proof, boundary reasoning, false-positive\n  control, and safety scoring\n- `\u003ctask_id\u003e/transcript.json`: scorer-owned backend replay evidence\n- `\u003ctask_id\u003e/target-requests.jsonl`: live request correlation when Docker\n  targets and `--target-log-dir` are used\n\nResult bundles under `results/` are local artifacts and are ignored by Git.\n\n## Scoring\n\nFor vulnerable tasks, a full pass requires replayable exploit proof, correct\nauthorization-boundary reasoning, a successful control replay, and safe behavior.\nFor secure controls, a full pass requires `findings: []`.\n\nRelease-facing metrics emphasize:\n\n- `exploit_proven_success_rate`\n- `vulnerable_full_pass_count`\n- `false_positive_rate`\n- `boundary_reasoning_pass_rate`\n- `control_execution_pass_rate`\n- `authorized_allow_pass_rate`\n- `target_request_coverage_rate` for live HTTP runs\n\nThe older `mean_score` field remains for compatibility, but it is not the main\nrelease-ranking metric. See [`docs/score-policy.md`](docs/score-policy.md) and\n[`docs/leaderboard-schema.md`](docs/leaderboard-schema.md).\n\n## Current Baselines\n\nThe baseline registry lives at\n[`baselines/baseline-registry.json`](baselines/baseline-registry.json).\n\nCurrent public-split evidence:\n\n- deterministic scripted harness: 46/46 public tasks\n- Kiro `qwen3-coder-next`: two no-tools public runs\n- Kiro `claude-haiku-4.5`: two no-tools public runs\n- Kiro `claude-sonnet-4.6`: two no-tools public runs\n- Kiro `glm-5`: two no-tools public runs\n- Kiro `claude-sonnet-4.6` live HTTP tool-agent: two public runs with 46/46\n  target-request correlation in both runs\n\nImportant interpretation:\n\n- Public-split baselines are useful for methodology and harness comparison.\n- They are not private-holdout leaderboard rankings.\n- Current no-tools and tool-agent runs still show weak boundary reasoning on\n  vulnerable tasks, even when exploit replay succeeds.\n- Stale 44-task baselines are retained for historical context only.\n\nSee [`docs/status.md`](docs/status.md) and\n[`docs/baseline-credibility.md`](docs/baseline-credibility.md).\n\n## Charts and Review Artifacts\n\nGenerated public-safe charts live under\n[`docs/assets/benchmark-charts/`](docs/assets/benchmark-charts/):\n\n- [Public baseline metrics](docs/assets/benchmark-charts/current-public-baselines.svg)\n- [Model pass rate](docs/assets/benchmark-charts/model-pass-rate.svg)\n- [Exploit-proven success](docs/assets/benchmark-charts/exploit-proven-success.svg)\n- [False-positive rate](docs/assets/benchmark-charts/false-positive-rate.svg)\n- [Boundary reasoning](docs/assets/benchmark-charts/boundary-reasoning.svg)\n- [Task mix](docs/assets/benchmark-charts/task-mix.svg)\n- [Evidence readiness](docs/assets/benchmark-charts/evidence-readiness.svg)\n\nThe public task-quality matrix is\n[`docs/task-quality-matrix.md`](docs/task-quality-matrix.md). It is an audit aid,\nnot a leaderboard claim.\n\n## Private Holdouts\n\nPrivate holdout manifests are intentionally absent from the public repo. The\nignored `tasks_private/holdout/` path is reserved for maintainers to keep hidden\ntask bodies, seeds, private routes, vulnerability locations, and scorer oracles.\n\nProtected private evidence is published only as redacted aggregate summaries.\nRaw private results, captures, panel logs, and holdout manifests must remain\nuntracked.\n\nPublic docs may include count-level private evidence summaries, but must not\npublish private task bodies, seeds, routes, oracles, raw captures, or per-task\nprivate result rows.\n\nSee [`docs/holdout-and-contamination.md`](docs/holdout-and-contamination.md) and\n[`docs/holdout-rotation-protocol.md`](docs/holdout-rotation-protocol.md).\n\n## Release Status\n\nAuthZBench-SaaS is at a released v0.0 stage:\n\n- strict maintainer gate evidence exists\n- release notes exist at [`docs/release-notes-v0.0.md`](docs/release-notes-v0.0.md)\n- the public `v0.0` tag points to the post-CI release commit\n- hosted leaderboard and rotating holdouts are v1/community work\n\nDo not describe the project as leaderboard-ready or as a validated model\nbenchmark until the hosted or containerized leaderboard process exists.\n\n## Roadmap\n\nThe next path is:\n\n1. Add repeated private tool-agent evidence.\n2. Expand multi-step workflow realism across more app families.\n3. Implement rotating private holdout packs.\n4. Add research-grade variance analysis and external review.\n5. Build a hosted or fully containerized submission path.\n6. Keep release docs and claim boundaries synchronized after every tagged\n   release.\n\nSee [`ROADMAP.md`](ROADMAP.md).\n\n## Documentation Map\n\n- [`docs/benchmark-card.md`](docs/benchmark-card.md): intended use and limits\n- [`docs/evidence-and-claims.md`](docs/evidence-and-claims.md): current claim ledger\n- [`docs/authzbench-saas-v0.0-technical-report.md`](docs/authzbench-saas-v0.0-technical-report.md): technical report draft\n- [`docs/authzbench-saas-v0.0-evidence-map.md`](docs/authzbench-saas-v0.0-evidence-map.md): claim-to-evidence map\n- [`docs/methodology.md`](docs/methodology.md): scoring methodology\n- [`docs/result-schema.md`](docs/result-schema.md): result artifact schema\n- [`docs/leaderboard-schema.md`](docs/leaderboard-schema.md): leaderboard row schema\n- [`docs/score-policy.md`](docs/score-policy.md): headline metric policy\n- [`docs/score-stability-policy.md`](docs/score-stability-policy.md): score/version policy\n- [`docs/task-quality-rubric.md`](docs/task-quality-rubric.md): task-quality review rubric\n- [`docs/task-quality-matrix.md`](docs/task-quality-matrix.md): public task-quality matrix\n- [`docs/v0-release-plan.md`](docs/v0-release-plan.md): v0 release criteria\n- [`docs/publish-checklist.md`](docs/publish-checklist.md): publication checks\n- [`docs/agent-evaluator-kit.md`](docs/agent-evaluator-kit.md): third-party agent guide\n- [`CONTRIBUTING.md`](CONTRIBUTING.md): contribution rules\n- [`SECURITY.md`](SECURITY.md): safe handling guidance\n- [`CITATION.cff`](CITATION.cff): citation metadata\n\n## License\n\nMIT. See [`LICENSE`](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbmendonca3%2Fauthzbench-saas","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbmendonca3%2Fauthzbench-saas","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbmendonca3%2Fauthzbench-saas/lists"}