{"id":48109382,"url":"https://github.com/labelbox/multichallenge_rubric_generation","last_synced_at":"2026-04-04T16:00:41.608Z","repository":{"id":341144306,"uuid":"1157419333","full_name":"Labelbox/multichallenge_rubric_generation","owner":"Labelbox","description":null,"archived":false,"fork":false,"pushed_at":"2026-02-13T20:22:04.000Z","size":533,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-28T12:37:43.549Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Labelbox.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-13T19:51:12.000Z","updated_at":"2026-02-13T20:22:08.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Labelbox/multichallenge_rubric_generation","commit_stats":null,"previous_names":["labelbox/multichallenge_rubric_generation"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Labelbox/multichallenge_rubric_generation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labelbox%2Fmultichallenge_rubric_generation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labelbox%2Fmultichallenge_rubric_generation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labelbox%2Fmultichallenge_rubric_generation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labelbox%2Fmultichallenge_rubric_generation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Labelbox","download_url":"https://codeload.github.com/Labelbox/multichallenge_rubric_generation/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Labelbox%2Fmultichallenge_rubric_generation/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31405191,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T10:20:44.708Z","status":"ssl_error","status_checked_at":"2026-04-04T10:20:06.846Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-04T16:00:26.097Z","updated_at":"2026-04-04T16:00:41.589Z","avatar_url":"https://github.com/Labelbox.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Rubric Generation Pipeline\n\nA generator-validator loop for producing high-quality synthetic rubrics for RL training.\n\n## Architecture Overview\n\n```\n┌─────────────────────────────────────────────────────────────────────────────┐\n│                              PIPELINE FLOW                                   │\n├─────────────────────────────────────────────────────────────────────────────┤\n│                                                                             │\n│  ┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐          │\n│  │ GENERATE │ ──▶ │  GATE 1  │ ──▶ │  GATE 2  │ ──▶ │  GATE 3  │          │\n│  │  Opus    │     │ (Auto)   │     │  Opus    │     │  Opus    │          │\n│  │(high tmp)│     │40-60% rej│     │40-60%pass│     │60-80%pass│          │\n│  └──────────┘     └────┬─────┘     └────┬─────┘     └────┬─────┘          │\n│                        │                │                 │                │\n│                        ▼                ▼                 ▼                │\n│                   ┌─────────┐     ┌───────────┐     ┌───────────┐         │\n│                   │ FAILED  │     │ BORDERLINE│     │ VALIDATED │         │\n│                   └─────────┘     └─────┬─────┘     └───────────┘         │\n│                                         │                                  │\n│                                         ▼                                  │\n│                                   ┌───────────┐                           │\n│                                   │  REFINE   │ ◀── both_pass failures    │\n│                                   │   Opus    │                           │\n│                                   └─────┬─────┘                           │\n│                                         │                                  │\n│                                         ▼                                  │\n│                                   ┌───────────┐                           │\n│                                   │ RECOVERED │ ──▶ Gates 2-3 again       │\n│                                   └───────────┘                           │\n│                                                                             │\n└─────────────────────────────────────────────────────────────────────────────┘\n```\n\n**Goal: Generate exactly 6 validated rubrics per example with diversity-aware generation**\n\n## Repo Workflow, Components, and End-to-End Flow\n\nThis repo is set up as a **generator → multi-gate validator → refinement loop** because rubric quality is harder to verify than to draft. We bias toward **high recall early** (generate many candidates) and **high precision late** (strict gates), then recover borderline cases instead of discarding useful signal.\n\n### How the Repo Works (At a Glance)\n\n```mermaid\nflowchart TD\n  A[Inputs: preference pairs\u003cbr/\u003edata/examples/] --\u003e B[Generator\u003cbr/\u003esrc/agents/generator.py]\n  B --\u003e C[Gate 1: Structural\u003cbr/\u003esrc/agents/structural.py]\n  C --\u003e|pass| D[Gate 2: Semantic\u003cbr/\u003esrc/agents/semantic.py]\n  C --\u003e|fail| Cx[\"Rejected pool\u003cbr/\u003e(in memory)\"]\n  D --\u003e|pass| E[Gate 3: Discrimination\u003cbr/\u003esrc/agents/discriminator.py]\n  D --\u003e|borderline| Dx[\"Borderline pool\u003cbr/\u003e(in memory)\"]\n  D --\u003e|fail| Dy[\"Rejected pool\u003cbr/\u003e(in memory)\"]\n  E --\u003e|pass| F[\"Validated rubrics\u003cbr/\u003edata/validated/\"]\n  E --\u003e|fail| Ex[\"Gate 3 failures\u003cbr/\u003e(in memory)\"]\n  Dx --\u003e H[Refiner\u003cbr/\u003esrc/agents/refiner.py]\n  Ex --\u003e H\n  H --\u003e D\n  H --\u003e E\n  F --\u003e M[Metrics\u003cbr/\u003edata/metrics/]\n  E --\u003e M\n  P[pipeline.log] -.-\u003e M\n```\n\n### Components and Responsibilities\n\n```mermaid\nflowchart LR\n  subgraph Entrypoint[\"run.py\"]\n    EP[CLI with subcommands:\u003cbr/\u003erun, metrics, check-imports, clean]\n  end\n  subgraph Agents[\"src/agents/\"]\n    G1[generator.py - Candidate generation]\n    V1[structural.py - Gate 1]\n    V2[semantic.py - Gate 2]\n    V3[discriminator.py - Gate 3]\n    R1[refiner.py - Refinement loop]\n    O1[orchestrator.py - Pipeline orchestration]\n  end\n  subgraph Core[\"src/\"]\n    C1[client.py - AnthropicClient]\n    C2[config.py - PipelineConfig]\n    C3[models.py - Data models]\n  end\n  subgraph Data[\"data/\"]\n    D1[examples/]\n    D3[validated/]\n    D4[metrics/]\n  end\n  D1 --\u003e EP\n  EP --\u003e O1\n  O1 --\u003e G1\n  O1 --\u003e V1\n  O1 --\u003e V2\n  O1 --\u003e V3\n  O1 --\u003e R1\n  V3 --\u003e D3\n  EP --\u003e D4\n```\n\n### End-to-End Workflow (Step-by-Step)\n\n```mermaid\nsequenceDiagram\n  autonumber\n  participant U as User\n  participant P as run.py\n  participant O as Orchestrator\n  participant G as Generator\n  participant S as Gate 1 (Structural)\n  participant M as Gate 2 (Semantic)\n  participant D as Gate 3 (Discrimination)\n  participant R as Refiner\n  participant Out as Output\n\n  U-\u003e\u003eP: Run pipeline on examples\n  P-\u003e\u003eO: Initialize PipelineOrchestrator\n  O-\u003e\u003eG: Generate N candidate rubrics\n  G--\u003e\u003eO: candidates kept in memory\n  O-\u003e\u003eS: Structural checks (fast, deterministic)\n  S--\u003e\u003eO: pass/fail + routing\n  O-\u003e\u003eM: Semantic checks (Claude Opus)\n  M--\u003e\u003eO: pass/fail + borderline\n  O-\u003e\u003eD: Discrimination checks (Claude Opus)\n  D--\u003e\u003eO: pass/fail + both-pass cases\n  O-\u003e\u003eR: Refine borderline and both-pass failures\n  R--\u003e\u003eO: improved candidates\n  O-\u003e\u003eM: Re-check semantics\n  O-\u003e\u003eD: Re-check discrimination\n  D--\u003e\u003eOut: Save validated rubrics + metrics\n```\n\n### Why It's Set Up This Way\n\n- **Generate wide, filter hard**: High-quality rubrics are rare; generating many candidates increases odds of capturing strong discriminators.\n- **Cheap checks first**: Structural validation is deterministic and fast, preventing wasted LLM calls on malformed rubrics.\n- **Semantic before discrimination**: A rubric must be clear and verifiable before we test whether it separates preferred vs. rejected.\n- **Discrimination is the core signal**: The final gate ensures rubrics reflect the actual preference delta, which is what RL training needs.\n- **Refinement saves value**: Borderline candidates often become strong after targeted edits, reducing waste without lowering standards.\n\n## Agent Handoff (Comprehensive)\n\nUse this section to orient a coding agent quickly and accurately.\n\n### Source of Truth\n\n- **Primary entrypoint:** `run.py` (CLI with subcommands: `run`, `metrics`, `check-imports`, `clean`).\n- **Pipeline orchestration:** `src/agents/orchestrator.py` handles the full loop, concurrency, and I/O.\n- **API client:** `src/client.py` (`AnthropicClient`) — async Anthropic SDK with retry and structured output.\n- **Configuration:** `src/config.py` (`PipelineConfig`) — all constants in a frozen dataclass.\n\n### What Actually Runs (Code Map)\n\n- `run.py`: CLI entrypoint with subcommands.\n- `src/agents/orchestrator.py`: Pipeline orchestration, concurrency, and I/O.\n- `src/agents/generator.py`: Candidate generation (Claude Opus 4.6).\n- `src/agents/structural.py`: Deterministic structural checks (Gate 1).\n- `src/agents/semantic.py`: Semantic quality gate (Claude Opus 4.6).\n- `src/agents/discriminator.py`: Discrimination gate (Claude Opus 4.6).\n- `src/agents/refiner.py`: Refinement loop (Claude Opus 4.6).\n- `src/client.py`: Anthropic API client with retry, structured output, and concurrency control.\n- `src/config.py`: All pipeline constants (`PipelineConfig` frozen dataclass).\n- `src/models.py`: Pydantic data models.\n\n### Data Flow (Actual Outputs)\n\n- **Inputs:** `data/examples/*.json`\n- **Outputs:**\n  - `data/validated/\u003cexample_id\u003e.json` (single file containing array of 6 rubrics per example)\n  - `data/metrics/\u003cexample_id\u003e.json` and `data/metrics/summary.json`\n  - `pipeline.log` (log file in repo root)\n- **Not persisted by default:** intermediate candidates, gate pass/fail pools, and refinement queues are kept in memory.\n\n### Key Behaviors and Gotchas\n\n- **Normalization:** `afm_response` is auto-mapped to `rejected_response`; `metadata.rubric_criteria` becomes `rubric_hints`.\n- **Deduping:** Candidates are deduped by rubric text before gates.\n- **Gate order:** Gate 1 → Gate 2 → Gate 3; borderline + both-pass failures are refined.\n- **Randomization:** Gate 3 randomizes response order to avoid position bias.\n- **Diversity-aware generation:** After each Gate 3 round, analyzes category coverage and targets underrepresented failure categories in subsequent generation rounds.\n- **Forced fill:** If still short after max attempts, the pipeline force-fills using best failures or fallback templates/hints to reach 6 rubrics.\n- **Concurrency knobs:** `--parallel`, `--gate2-concurrency`, `--gate3-concurrency` (CLI args) + `PARALLEL_EXAMPLES`, `GATE2_CONCURRENCY`, `GATE3_CONCURRENCY` (env vars) + defaults in `src/config.py`.\n\n### Model and Configuration\n\nAll pipeline stages use **Claude Opus 4.6** (`claude-opus-4-6`), configured in `src/config.py`:\n\n```python\n@dataclass(frozen=True)\nclass PipelineConfig:\n    target_rubrics: int = 6\n    max_generation_rounds: int = 7\n    initial_candidates: int = 30\n    additional_candidates: int = 12\n    model: str = \"anthropic/claude-opus-4-6\"\n    # ... temperature/token settings per stage\n```\n\n### How to Run (CLI)\n\n```bash\nuv run python run.py run --examples data/examples/pref-0000.json\nuv run python run.py run --examples data/examples/ --parallel 12\nuv run python run.py metrics\nuv run python run.py check-imports\nuv run python run.py clean\n```\n\n## Quick Start\n\n### 1. Setup\n\n```bash\n# Navigate to project\ncd rubric_generation\n\n# Install dependencies\nuv sync\n\n# Set your API key\nexport LB_API_KEY='your-labelbox-key'\n\n# Or copy the example env file\ncp .env.example .env\n# Edit .env with your Labelbox key\n```\n\n### 2. Prepare Your Data\n\nPlace your examples in `data/examples/`:\n\n```json\n{\n  \"id\": \"pref-0000\",\n  \"prompt\": \"User's original prompt/conversation\",\n  \"preferred_response\": \"The better response\",\n  \"afm_response\": \"The worse response (rejected)\",\n  \"category_hint\": \"scheduling\",\n  \"metadata\": {\n    \"rubric_criteria\": [\n      \"Hint 1 from human annotators\",\n      \"Hint 2 from human annotators\"\n    ]\n  }\n}\n```\n\n**Note:** The pipeline accepts both `afm_response` and `rejected_response` fields.\n\n### 3. Run the Pipeline\n\n```bash\n# Single example\nuv run python run.py run --examples data/examples/pref-0000.json\n\n# All examples in directory\nuv run python run.py run --examples data/examples/\n\n# Custom output directory\nuv run python run.py run --examples data/examples/ --output results/\n```\n\n**High-throughput options (recommended for large datasets):**\n\n```bash\n# Shard across multiple terminals/machines\nuv run python run.py run --examples data/examples/ --shard 0/4\n\n# Resume mode (default on) skips completed examples\nuv run python run.py run --examples data/examples/ --resume\n\n# Override concurrency\nuv run python run.py run --examples data/examples/ --parallel 12 --gate2-concurrency 30 --gate3-concurrency 12\n```\n\n### 4. Check Results\n\n```bash\n# Validated rubrics\nls data/validated/\n\n# Pipeline metrics\nuv run python run.py metrics\n\n# Metrics for a specific example\nuv run python run.py metrics pref-0010\n```\n\n## Directory Structure\n\n```\nrubric_generation/\n├── run.py                  # CLI entrypoint (subcommands: run, metrics, check-imports, clean)\n├── pyproject.toml          # Dependencies (openai, pydantic, python-dotenv, tqdm)\n├── .env.example            # API key template\n│\n├── src/\n│   ├── client.py           # AnthropicClient (async, retry, structured output)\n│   ├── config.py           # PipelineConfig (frozen dataclass)\n│   ├── models.py           # Pydantic data models\n│   ├── agents/\n│   │   ├── orchestrator.py # Pipeline orchestration\n│   │   ├── generator.py    # Candidate generation\n│   │   ├── structural.py   # Gate 1: Structural validation\n│   │   ├── semantic.py     # Gate 2: Semantic validation\n│   │   ├── discriminator.py# Gate 3: Discrimination validation\n│   │   ├── refiner.py      # Refinement loop\n│   │   └── base.py         # Base agent class\n│   └── utils/\n│       └── json_parser.py  # JSON extraction from LLM output\n│\n├── data/\n│   ├── examples/           # Input: (prompt, preferred, rejected) tuples\n│   ├── validated/          # OUTPUT: Final validated rubrics\n│   └── metrics/            # Pipeline metrics\n│\n├── RUBRIC_STYLE_GUIDE.md   # Mined rubric principles (input to generator)\n├── pipeline.log            # Run logs (appended)\n└── README.md               # This file\n```\n\n## The Four Gates\n\n### Gate 1: Structural Validation\n- **Model**: Automated (no LLM calls)\n- **Target**: Reject 40-60%\n- **Checks**: Format, atomicity, self-containment, actionable language, anti-patterns\n\n### Gate 2: Semantic Validation\n- **Model**: Claude Opus 4.6\n- **Target**: Pass 40-60% of Gate 1 survivors\n- **Checks**: Quality checklist (self-containment, specificity 2-4, unambiguity, verifiability, meaningfulness)\n\n### Gate 3: Discrimination Validation\n- **Model**: Claude Opus 4.6\n- **Target**: Pass 60-80% of Gate 2 survivors\n- **Checks**: Does the rubric correctly distinguish preferred from rejected?\n- **This is the most important gate**\n\n### Refinement Loop\n- **Model**: Claude Opus 4.6\n- **Input**: Borderline candidates + \"both pass\" failures\n- **Max cycles**: 2 per candidate\n- **Strategies**: Increase specificity, decrease specificity, fix ambiguity, strengthen discrimination\n\n## Key Metrics to Monitor\n\n| Metric | Target | Alert If |\n|--------|--------|----------|\n| Gate 1 rejection rate | 40-60% | \u003c30% (too clean) or \u003e70% (quality issue) |\n| Gate 2 pass rate | 40-60% | \u003c30% (semantic issues) |\n| Gate 3 pass rate | 60-80% | \u003c50% (discrimination problems) |\n| Inverted rate | \u003c5% | \u003e5% (fundamental generator issue) |\n| Refinement success | 20-40% | \u003c15% (refinement prompts need work) |\n| Avg discrimination strength | \u003e3.0 | \u003c2.5 (weak rubrics) |\n\n## Configuration\n\nAll pipeline constants are in `src/config.py` (`PipelineConfig` frozen dataclass):\n\n```python\n@dataclass(frozen=True)\nclass PipelineConfig:\n    target_rubrics: int = 6\n    max_generation_rounds: int = 7\n    initial_candidates: int = 30\n    additional_candidates: int = 12\n    max_total_candidates: int = 300\n    max_no_new_rounds: int = 3\n    max_seconds_per_example: int = 240\n\n    parallel_examples: int = 12\n    gate2_concurrency: int = 30\n    gate3_concurrency: int = 12\n\n    model: str = \"anthropic/claude-opus-4-6\"\n\n    generator_temperature: float = 0.8\n    gate2_temperature: float = 0.2\n    gate3_temperature: float = 0.1\n    refinement_temperature: float = 0.6\n\n    failure_categories: tuple[str, ...] = (\n        \"instruction-retention\",\n        \"inference-memory\",\n        \"version-editing\",\n        \"self-coherence\",\n    )\n```\n\n## Troubleshooting\n\n### High \"both_pass\" rate\n- Generation isn't targeting the actual contrast\n- Solution: Add more specific constraint prompts in generator\n\n### High inverted rate\n- Generator fundamentally misunderstands what makes preferred better\n- Solution: Re-analyze examples, check for ambiguous preferences\n\n### Low refinement success\n- Refinement strategies aren't effective\n- Solution: Add more targeted strategies, reduce max cycles, accept lower recovery\n\n## Output Format\n\nThe pipeline generates exactly **6 validated rubrics** per example in `data/validated/\u003cexample_id\u003e.json`:\n\n```\ndata/validated/\n├── pref-0000.json\n├── pref-0001.json\n└── ...\n```\n\nEach file contains an array of 6 rubrics:\n\n```json\n[\n  {\n    \"id\": \"rubric-uuid\",\n    \"example_id\": \"pref-0000\",\n    \"rubric_text\": \"[Objective] The response should...\",\n    \"abstraction_level\": \"category\",\n    \"category\": \"instruction-retention\",\n    \"gate3_result\": {\n      \"discrimination_strength\": 4,\n      \"high_value\": true,\n      \"discrimination_type\": \"clean\",\n      \"passed\": true\n    }\n  },\n  // ... 5 more rubrics\n]\n```\n\n## Integration with RL Training\n\nFor RL training:\n- Use `abstraction_level: \"category\"` rubrics for generalization\n- Prioritize `high_value: true` rubrics (discriminate on subtle differences)\n- Filter by `discrimination_strength \u003e= 3` for clean gradients\n\n## References\n\n- RUBRIC_STYLE_GUIDE.md - Mined rubric principles\n- src/config.py - All pipeline constants\n- src/client.py - API client configuration\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flabelbox%2Fmultichallenge_rubric_generation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flabelbox%2Fmultichallenge_rubric_generation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flabelbox%2Fmultichallenge_rubric_generation/lists"}