https://github.com/labelbox/multichallenge_rubric_generation
https://github.com/labelbox/multichallenge_rubric_generation
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/labelbox/multichallenge_rubric_generation
- Owner: Labelbox
- Created: 2026-02-13T19:51:12.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2026-02-13T20:22:04.000Z (5 months ago)
- Last Synced: 2026-02-28T12:37:43.549Z (4 months ago)
- Language: Python
- Size: 521 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Rubric Generation Pipeline
A generator-validator loop for producing high-quality synthetic rubrics for RL training.
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ PIPELINE FLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ GENERATE │ ──▶ │ GATE 1 │ ──▶ │ GATE 2 │ ──▶ │ GATE 3 │ │
│ │ Opus │ │ (Auto) │ │ Opus │ │ Opus │ │
│ │(high tmp)│ │40-60% rej│ │40-60%pass│ │60-80%pass│ │
│ └──────────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌───────────┐ ┌───────────┐ │
│ │ FAILED │ │ BORDERLINE│ │ VALIDATED │ │
│ └─────────┘ └─────┬─────┘ └───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────┐ │
│ │ REFINE │ ◀── both_pass failures │
│ │ Opus │ │
│ └─────┬─────┘ │
│ │ │
│ ▼ │
│ ┌───────────┐ │
│ │ RECOVERED │ ──▶ Gates 2-3 again │
│ └───────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
**Goal: Generate exactly 6 validated rubrics per example with diversity-aware generation**
## Repo Workflow, Components, and End-to-End Flow
This repo is set up as a **generator → multi-gate validator → refinement loop** because rubric quality is harder to verify than to draft. We bias toward **high recall early** (generate many candidates) and **high precision late** (strict gates), then recover borderline cases instead of discarding useful signal.
### How the Repo Works (At a Glance)
```mermaid
flowchart TD
A[Inputs: preference pairs
data/examples/] --> B[Generator
src/agents/generator.py]
B --> C[Gate 1: Structural
src/agents/structural.py]
C -->|pass| D[Gate 2: Semantic
src/agents/semantic.py]
C -->|fail| Cx["Rejected pool
(in memory)"]
D -->|pass| E[Gate 3: Discrimination
src/agents/discriminator.py]
D -->|borderline| Dx["Borderline pool
(in memory)"]
D -->|fail| Dy["Rejected pool
(in memory)"]
E -->|pass| F["Validated rubrics
data/validated/"]
E -->|fail| Ex["Gate 3 failures
(in memory)"]
Dx --> H[Refiner
src/agents/refiner.py]
Ex --> H
H --> D
H --> E
F --> M[Metrics
data/metrics/]
E --> M
P[pipeline.log] -.-> M
```
### Components and Responsibilities
```mermaid
flowchart LR
subgraph Entrypoint["run.py"]
EP[CLI with subcommands:
run, metrics, check-imports, clean]
end
subgraph Agents["src/agents/"]
G1[generator.py - Candidate generation]
V1[structural.py - Gate 1]
V2[semantic.py - Gate 2]
V3[discriminator.py - Gate 3]
R1[refiner.py - Refinement loop]
O1[orchestrator.py - Pipeline orchestration]
end
subgraph Core["src/"]
C1[client.py - AnthropicClient]
C2[config.py - PipelineConfig]
C3[models.py - Data models]
end
subgraph Data["data/"]
D1[examples/]
D3[validated/]
D4[metrics/]
end
D1 --> EP
EP --> O1
O1 --> G1
O1 --> V1
O1 --> V2
O1 --> V3
O1 --> R1
V3 --> D3
EP --> D4
```
### End-to-End Workflow (Step-by-Step)
```mermaid
sequenceDiagram
autonumber
participant U as User
participant P as run.py
participant O as Orchestrator
participant G as Generator
participant S as Gate 1 (Structural)
participant M as Gate 2 (Semantic)
participant D as Gate 3 (Discrimination)
participant R as Refiner
participant Out as Output
U->>P: Run pipeline on examples
P->>O: Initialize PipelineOrchestrator
O->>G: Generate N candidate rubrics
G-->>O: candidates kept in memory
O->>S: Structural checks (fast, deterministic)
S-->>O: pass/fail + routing
O->>M: Semantic checks (Claude Opus)
M-->>O: pass/fail + borderline
O->>D: Discrimination checks (Claude Opus)
D-->>O: pass/fail + both-pass cases
O->>R: Refine borderline and both-pass failures
R-->>O: improved candidates
O->>M: Re-check semantics
O->>D: Re-check discrimination
D-->>Out: Save validated rubrics + metrics
```
### Why It's Set Up This Way
- **Generate wide, filter hard**: High-quality rubrics are rare; generating many candidates increases odds of capturing strong discriminators.
- **Cheap checks first**: Structural validation is deterministic and fast, preventing wasted LLM calls on malformed rubrics.
- **Semantic before discrimination**: A rubric must be clear and verifiable before we test whether it separates preferred vs. rejected.
- **Discrimination is the core signal**: The final gate ensures rubrics reflect the actual preference delta, which is what RL training needs.
- **Refinement saves value**: Borderline candidates often become strong after targeted edits, reducing waste without lowering standards.
## Agent Handoff (Comprehensive)
Use this section to orient a coding agent quickly and accurately.
### Source of Truth
- **Primary entrypoint:** `run.py` (CLI with subcommands: `run`, `metrics`, `check-imports`, `clean`).
- **Pipeline orchestration:** `src/agents/orchestrator.py` handles the full loop, concurrency, and I/O.
- **API client:** `src/client.py` (`AnthropicClient`) — async Anthropic SDK with retry and structured output.
- **Configuration:** `src/config.py` (`PipelineConfig`) — all constants in a frozen dataclass.
### What Actually Runs (Code Map)
- `run.py`: CLI entrypoint with subcommands.
- `src/agents/orchestrator.py`: Pipeline orchestration, concurrency, and I/O.
- `src/agents/generator.py`: Candidate generation (Claude Opus 4.6).
- `src/agents/structural.py`: Deterministic structural checks (Gate 1).
- `src/agents/semantic.py`: Semantic quality gate (Claude Opus 4.6).
- `src/agents/discriminator.py`: Discrimination gate (Claude Opus 4.6).
- `src/agents/refiner.py`: Refinement loop (Claude Opus 4.6).
- `src/client.py`: Anthropic API client with retry, structured output, and concurrency control.
- `src/config.py`: All pipeline constants (`PipelineConfig` frozen dataclass).
- `src/models.py`: Pydantic data models.
### Data Flow (Actual Outputs)
- **Inputs:** `data/examples/*.json`
- **Outputs:**
- `data/validated/.json` (single file containing array of 6 rubrics per example)
- `data/metrics/.json` and `data/metrics/summary.json`
- `pipeline.log` (log file in repo root)
- **Not persisted by default:** intermediate candidates, gate pass/fail pools, and refinement queues are kept in memory.
### Key Behaviors and Gotchas
- **Normalization:** `afm_response` is auto-mapped to `rejected_response`; `metadata.rubric_criteria` becomes `rubric_hints`.
- **Deduping:** Candidates are deduped by rubric text before gates.
- **Gate order:** Gate 1 → Gate 2 → Gate 3; borderline + both-pass failures are refined.
- **Randomization:** Gate 3 randomizes response order to avoid position bias.
- **Diversity-aware generation:** After each Gate 3 round, analyzes category coverage and targets underrepresented failure categories in subsequent generation rounds.
- **Forced fill:** If still short after max attempts, the pipeline force-fills using best failures or fallback templates/hints to reach 6 rubrics.
- **Concurrency knobs:** `--parallel`, `--gate2-concurrency`, `--gate3-concurrency` (CLI args) + `PARALLEL_EXAMPLES`, `GATE2_CONCURRENCY`, `GATE3_CONCURRENCY` (env vars) + defaults in `src/config.py`.
### Model and Configuration
All pipeline stages use **Claude Opus 4.6** (`claude-opus-4-6`), configured in `src/config.py`:
```python
@dataclass(frozen=True)
class PipelineConfig:
target_rubrics: int = 6
max_generation_rounds: int = 7
initial_candidates: int = 30
additional_candidates: int = 12
model: str = "anthropic/claude-opus-4-6"
# ... temperature/token settings per stage
```
### How to Run (CLI)
```bash
uv run python run.py run --examples data/examples/pref-0000.json
uv run python run.py run --examples data/examples/ --parallel 12
uv run python run.py metrics
uv run python run.py check-imports
uv run python run.py clean
```
## Quick Start
### 1. Setup
```bash
# Navigate to project
cd rubric_generation
# Install dependencies
uv sync
# Set your API key
export LB_API_KEY='your-labelbox-key'
# Or copy the example env file
cp .env.example .env
# Edit .env with your Labelbox key
```
### 2. Prepare Your Data
Place your examples in `data/examples/`:
```json
{
"id": "pref-0000",
"prompt": "User's original prompt/conversation",
"preferred_response": "The better response",
"afm_response": "The worse response (rejected)",
"category_hint": "scheduling",
"metadata": {
"rubric_criteria": [
"Hint 1 from human annotators",
"Hint 2 from human annotators"
]
}
}
```
**Note:** The pipeline accepts both `afm_response` and `rejected_response` fields.
### 3. Run the Pipeline
```bash
# Single example
uv run python run.py run --examples data/examples/pref-0000.json
# All examples in directory
uv run python run.py run --examples data/examples/
# Custom output directory
uv run python run.py run --examples data/examples/ --output results/
```
**High-throughput options (recommended for large datasets):**
```bash
# Shard across multiple terminals/machines
uv run python run.py run --examples data/examples/ --shard 0/4
# Resume mode (default on) skips completed examples
uv run python run.py run --examples data/examples/ --resume
# Override concurrency
uv run python run.py run --examples data/examples/ --parallel 12 --gate2-concurrency 30 --gate3-concurrency 12
```
### 4. Check Results
```bash
# Validated rubrics
ls data/validated/
# Pipeline metrics
uv run python run.py metrics
# Metrics for a specific example
uv run python run.py metrics pref-0010
```
## Directory Structure
```
rubric_generation/
├── run.py # CLI entrypoint (subcommands: run, metrics, check-imports, clean)
├── pyproject.toml # Dependencies (openai, pydantic, python-dotenv, tqdm)
├── .env.example # API key template
│
├── src/
│ ├── client.py # AnthropicClient (async, retry, structured output)
│ ├── config.py # PipelineConfig (frozen dataclass)
│ ├── models.py # Pydantic data models
│ ├── agents/
│ │ ├── orchestrator.py # Pipeline orchestration
│ │ ├── generator.py # Candidate generation
│ │ ├── structural.py # Gate 1: Structural validation
│ │ ├── semantic.py # Gate 2: Semantic validation
│ │ ├── discriminator.py# Gate 3: Discrimination validation
│ │ ├── refiner.py # Refinement loop
│ │ └── base.py # Base agent class
│ └── utils/
│ └── json_parser.py # JSON extraction from LLM output
│
├── data/
│ ├── examples/ # Input: (prompt, preferred, rejected) tuples
│ ├── validated/ # OUTPUT: Final validated rubrics
│ └── metrics/ # Pipeline metrics
│
├── RUBRIC_STYLE_GUIDE.md # Mined rubric principles (input to generator)
├── pipeline.log # Run logs (appended)
└── README.md # This file
```
## The Four Gates
### Gate 1: Structural Validation
- **Model**: Automated (no LLM calls)
- **Target**: Reject 40-60%
- **Checks**: Format, atomicity, self-containment, actionable language, anti-patterns
### Gate 2: Semantic Validation
- **Model**: Claude Opus 4.6
- **Target**: Pass 40-60% of Gate 1 survivors
- **Checks**: Quality checklist (self-containment, specificity 2-4, unambiguity, verifiability, meaningfulness)
### Gate 3: Discrimination Validation
- **Model**: Claude Opus 4.6
- **Target**: Pass 60-80% of Gate 2 survivors
- **Checks**: Does the rubric correctly distinguish preferred from rejected?
- **This is the most important gate**
### Refinement Loop
- **Model**: Claude Opus 4.6
- **Input**: Borderline candidates + "both pass" failures
- **Max cycles**: 2 per candidate
- **Strategies**: Increase specificity, decrease specificity, fix ambiguity, strengthen discrimination
## Key Metrics to Monitor
| Metric | Target | Alert If |
|--------|--------|----------|
| Gate 1 rejection rate | 40-60% | <30% (too clean) or >70% (quality issue) |
| Gate 2 pass rate | 40-60% | <30% (semantic issues) |
| Gate 3 pass rate | 60-80% | <50% (discrimination problems) |
| Inverted rate | <5% | >5% (fundamental generator issue) |
| Refinement success | 20-40% | <15% (refinement prompts need work) |
| Avg discrimination strength | >3.0 | <2.5 (weak rubrics) |
## Configuration
All pipeline constants are in `src/config.py` (`PipelineConfig` frozen dataclass):
```python
@dataclass(frozen=True)
class PipelineConfig:
target_rubrics: int = 6
max_generation_rounds: int = 7
initial_candidates: int = 30
additional_candidates: int = 12
max_total_candidates: int = 300
max_no_new_rounds: int = 3
max_seconds_per_example: int = 240
parallel_examples: int = 12
gate2_concurrency: int = 30
gate3_concurrency: int = 12
model: str = "anthropic/claude-opus-4-6"
generator_temperature: float = 0.8
gate2_temperature: float = 0.2
gate3_temperature: float = 0.1
refinement_temperature: float = 0.6
failure_categories: tuple[str, ...] = (
"instruction-retention",
"inference-memory",
"version-editing",
"self-coherence",
)
```
## Troubleshooting
### High "both_pass" rate
- Generation isn't targeting the actual contrast
- Solution: Add more specific constraint prompts in generator
### High inverted rate
- Generator fundamentally misunderstands what makes preferred better
- Solution: Re-analyze examples, check for ambiguous preferences
### Low refinement success
- Refinement strategies aren't effective
- Solution: Add more targeted strategies, reduce max cycles, accept lower recovery
## Output Format
The pipeline generates exactly **6 validated rubrics** per example in `data/validated/.json`:
```
data/validated/
├── pref-0000.json
├── pref-0001.json
└── ...
```
Each file contains an array of 6 rubrics:
```json
[
{
"id": "rubric-uuid",
"example_id": "pref-0000",
"rubric_text": "[Objective] The response should...",
"abstraction_level": "category",
"category": "instruction-retention",
"gate3_result": {
"discrimination_strength": 4,
"high_value": true,
"discrimination_type": "clean",
"passed": true
}
},
// ... 5 more rubrics
]
```
## Integration with RL Training
For RL training:
- Use `abstraction_level: "category"` rubrics for generalization
- Prioritize `high_value: true` rubrics (discriminate on subtle differences)
- Filter by `discrimination_strength >= 3` for clean gradients
## References
- RUBRIC_STYLE_GUIDE.md - Mined rubric principles
- src/config.py - All pipeline constants
- src/client.py - API client configuration