https://github.com/labelbox/multichallenge_rubric_generation

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/labelbox/multichallenge_rubric_generation
Owner: Labelbox
Created: 2026-02-13T19:51:12.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-02-13T20:22:04.000Z (5 months ago)
Last Synced: 2026-02-28T12:37:43.549Z (4 months ago)
Language: Python
Size: 521 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Rubric Generation Pipeline

A generator-validator loop for producing high-quality synthetic rubrics for RL training.

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│ PIPELINE FLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ GENERATE │ ──▶ │ GATE 1 │ ──▶ │ GATE 2 │ ──▶ │ GATE 3 │ │
│ │ Opus │ │ (Auto) │ │ Opus │ │ Opus │ │
│ │(high tmp)│ │40-60% rej│ │40-60%pass│ │60-80%pass│ │
│ └──────────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌───────────┐ ┌───────────┐ │
│ │ FAILED │ │ BORDERLINE│ │ VALIDATED │ │
│ └─────────┘ └─────┬─────┘ └───────────┘ │
│ │ │
│ ▼ │
│ ┌───────────┐ │
│ │ REFINE │ ◀── both_pass failures │
│ │ Opus │ │
│ └─────┬─────┘ │
│ │ │
│ ▼ │
│ ┌───────────┐ │
│ │ RECOVERED │ ──▶ Gates 2-3 again │
│ └───────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```

**Goal: Generate exactly 6 validated rubrics per example with diversity-aware generation**

## Repo Workflow, Components, and End-to-End Flow

This repo is set up as a **generator → multi-gate validator → refinement loop** because rubric quality is harder to verify than to draft. We bias toward **high recall early** (generate many candidates) and **high precision late** (strict gates), then recover borderline cases instead of discarding useful signal.

### How the Repo Works (At a Glance)

```mermaid
flowchart TD
A[Inputs: preference pairs
data/examples/] --> B[Generator
src/agents/generator.py]
B --> C[Gate 1: Structural
src/agents/structural.py]
C -->|pass| D[Gate 2: Semantic
src/agents/semantic.py]
C -->|fail| Cx["Rejected pool
(in memory)"]
D -->|pass| E[Gate 3: Discrimination
src/agents/discriminator.py]
D -->|borderline| Dx["Borderline pool
(in memory)"]
D -->|fail| Dy["Rejected pool
(in memory)"]
E -->|pass| F["Validated rubrics
data/validated/"]
E -->|fail| Ex["Gate 3 failures
(in memory)"]
Dx --> H[Refiner
src/agents/refiner.py]
Ex --> H
H --> D
H --> E
F --> M[Metrics
data/metrics/]
E --> M
P[pipeline.log] -.-> M
```

### Components and Responsibilities

```mermaid
flowchart LR
subgraph Entrypoint["run.py"]
EP[CLI with subcommands:
run, metrics, check-imports, clean]
end
subgraph Agents["src/agents/"]
G1[generator.py - Candidate generation]
V1[structural.py - Gate 1]
V2[semantic.py - Gate 2]
V3[discriminator.py - Gate 3]
R1[refiner.py - Refinement loop]
O1[orchestrator.py - Pipeline orchestration]
end
subgraph Core["src/"]
C1[client.py - AnthropicClient]
C2[config.py - PipelineConfig]
C3[models.py - Data models]
end
subgraph Data["data/"]
D1[examples/]
D3[validated/]
D4[metrics/]
end
D1 --> EP
EP --> O1
O1 --> G1
O1 --> V1
O1 --> V2
O1 --> V3
O1 --> R1
V3 --> D3
EP --> D4
```

### End-to-End Workflow (Step-by-Step)

```mermaid
sequenceDiagram
autonumber
participant U as User
participant P as run.py
participant O as Orchestrator
participant G as Generator
participant S as Gate 1 (Structural)
participant M as Gate 2 (Semantic)
participant D as Gate 3 (Discrimination)
participant R as Refiner
participant Out as Output

U->>P: Run pipeline on examples
P->>O: Initialize PipelineOrchestrator
O->>G: Generate N candidate rubrics
G-->>O: candidates kept in memory
O->>S: Structural checks (fast, deterministic)
S-->>O: pass/fail + routing
O->>M: Semantic checks (Claude Opus)
M-->>O: pass/fail + borderline
O->>D: Discrimination checks (Claude Opus)
D-->>O: pass/fail + both-pass cases
O->>R: Refine borderline and both-pass failures
R-->>O: improved candidates
O->>M: Re-check semantics
O->>D: Re-check discrimination
D-->>Out: Save validated rubrics + metrics
```

### Why It's Set Up This Way

- **Generate wide, filter hard**: High-quality rubrics are rare; generating many candidates increases odds of capturing strong discriminators.
- **Cheap checks first**: Structural validation is deterministic and fast, preventing wasted LLM calls on malformed rubrics.
- **Semantic before discrimination**: A rubric must be clear and verifiable before we test whether it separates preferred vs. rejected.
- **Discrimination is the core signal**: The final gate ensures rubrics reflect the actual preference delta, which is what RL training needs.
- **Refinement saves value**: Borderline candidates often become strong after targeted edits, reducing waste without lowering standards.

## Agent Handoff (Comprehensive)

Use this section to orient a coding agent quickly and accurately.

### Source of Truth

- **Primary entrypoint:** `run.py` (CLI with subcommands: `run`, `metrics`, `check-imports`, `clean`).
- **Pipeline orchestration:** `src/agents/orchestrator.py` handles the full loop, concurrency, and I/O.
- **API client:** `src/client.py` (`AnthropicClient`) — async Anthropic SDK with retry and structured output.
- **Configuration:** `src/config.py` (`PipelineConfig`) — all constants in a frozen dataclass.

### What Actually Runs (Code Map)

- `run.py`: CLI entrypoint with subcommands.
- `src/agents/orchestrator.py`: Pipeline orchestration, concurrency, and I/O.
- `src/agents/generator.py`: Candidate generation (Claude Opus 4.6).
- `src/agents/structural.py`: Deterministic structural checks (Gate 1).
- `src/agents/semantic.py`: Semantic quality gate (Claude Opus 4.6).
- `src/agents/discriminator.py`: Discrimination gate (Claude Opus 4.6).
- `src/agents/refiner.py`: Refinement loop (Claude Opus 4.6).
- `src/client.py`: Anthropic API client with retry, structured output, and concurrency control.
- `src/config.py`: All pipeline constants (`PipelineConfig` frozen dataclass).
- `src/models.py`: Pydantic data models.

### Data Flow (Actual Outputs)

- **Inputs:** `data/examples/*.json`
- **Outputs:**
- `data/validated/.json` (single file containing array of 6 rubrics per example)
- `data/metrics/.json` and `data/metrics/summary.json`
- `pipeline.log` (log file in repo root)
- **Not persisted by default:** intermediate candidates, gate pass/fail pools, and refinement queues are kept in memory.

### Key Behaviors and Gotchas

- **Normalization:** `afm_response` is auto-mapped to `rejected_response`; `metadata.rubric_criteria` becomes `rubric_hints`.
- **Deduping:** Candidates are deduped by rubric text before gates.
- **Gate order:** Gate 1 → Gate 2 → Gate 3; borderline + both-pass failures are refined.
- **Randomization:** Gate 3 randomizes response order to avoid position bias.
- **Diversity-aware generation:** After each Gate 3 round, analyzes category coverage and targets underrepresented failure categories in subsequent generation rounds.
- **Forced fill:** If still short after max attempts, the pipeline force-fills using best failures or fallback templates/hints to reach 6 rubrics.
- **Concurrency knobs:** `--parallel`, `--gate2-concurrency`, `--gate3-concurrency` (CLI args) + `PARALLEL_EXAMPLES`, `GATE2_CONCURRENCY`, `GATE3_CONCURRENCY` (env vars) + defaults in `src/config.py`.

### Model and Configuration

All pipeline stages use **Claude Opus 4.6** (`claude-opus-4-6`), configured in `src/config.py`:

```python
@dataclass(frozen=True)
class PipelineConfig:
target_rubrics: int = 6
max_generation_rounds: int = 7
initial_candidates: int = 30
additional_candidates: int = 12
model: str = "anthropic/claude-opus-4-6"
# ... temperature/token settings per stage
```

### How to Run (CLI)

```bash
uv run python run.py run --examples data/examples/pref-0000.json
uv run python run.py run --examples data/examples/ --parallel 12
uv run python run.py metrics
uv run python run.py check-imports
uv run python run.py clean
```

## Quick Start

### 1. Setup

```bash
# Navigate to project
cd rubric_generation

# Install dependencies
uv sync

# Set your API key
export LB_API_KEY='your-labelbox-key'

# Or copy the example env file
cp .env.example .env
# Edit .env with your Labelbox key
```

### 2. Prepare Your Data

Place your examples in `data/examples/`:

```json
{
"id": "pref-0000",
"prompt": "User's original prompt/conversation",
"preferred_response": "The better response",
"afm_response": "The worse response (rejected)",
"category_hint": "scheduling",
"metadata": {
"rubric_criteria": [
"Hint 1 from human annotators",
"Hint 2 from human annotators"
]
}
}
```

**Note:** The pipeline accepts both `afm_response` and `rejected_response` fields.

### 3. Run the Pipeline

```bash
# Single example
uv run python run.py run --examples data/examples/pref-0000.json

# All examples in directory
uv run python run.py run --examples data/examples/

# Custom output directory
uv run python run.py run --examples data/examples/ --output results/
```

**High-throughput options (recommended for large datasets):**

```bash
# Shard across multiple terminals/machines
uv run python run.py run --examples data/examples/ --shard 0/4

# Resume mode (default on) skips completed examples
uv run python run.py run --examples data/examples/ --resume

# Override concurrency
uv run python run.py run --examples data/examples/ --parallel 12 --gate2-concurrency 30 --gate3-concurrency 12
```

### 4. Check Results

```bash
# Validated rubrics
ls data/validated/

# Pipeline metrics
uv run python run.py metrics

# Metrics for a specific example
uv run python run.py metrics pref-0010
```

## Directory Structure

```
rubric_generation/
├── run.py # CLI entrypoint (subcommands: run, metrics, check-imports, clean)
├── pyproject.toml # Dependencies (openai, pydantic, python-dotenv, tqdm)
├── .env.example # API key template
│
├── src/
│ ├── client.py # AnthropicClient (async, retry, structured output)
│ ├── config.py # PipelineConfig (frozen dataclass)
│ ├── models.py # Pydantic data models
│ ├── agents/
│ │ ├── orchestrator.py # Pipeline orchestration
│ │ ├── generator.py # Candidate generation
│ │ ├── structural.py # Gate 1: Structural validation
│ │ ├── semantic.py # Gate 2: Semantic validation
│ │ ├── discriminator.py# Gate 3: Discrimination validation
│ │ ├── refiner.py # Refinement loop
│ │ └── base.py # Base agent class
│ └── utils/
│ └── json_parser.py # JSON extraction from LLM output
│
├── data/
│ ├── examples/ # Input: (prompt, preferred, rejected) tuples
│ ├── validated/ # OUTPUT: Final validated rubrics
│ └── metrics/ # Pipeline metrics
│
├── RUBRIC_STYLE_GUIDE.md # Mined rubric principles (input to generator)
├── pipeline.log # Run logs (appended)
└── README.md # This file
```

## The Four Gates

### Gate 1: Structural Validation
- **Model**: Automated (no LLM calls)
- **Target**: Reject 40-60%
- **Checks**: Format, atomicity, self-containment, actionable language, anti-patterns

### Gate 2: Semantic Validation
- **Model**: Claude Opus 4.6
- **Target**: Pass 40-60% of Gate 1 survivors
- **Checks**: Quality checklist (self-containment, specificity 2-4, unambiguity, verifiability, meaningfulness)

### Gate 3: Discrimination Validation
- **Model**: Claude Opus 4.6
- **Target**: Pass 60-80% of Gate 2 survivors
- **Checks**: Does the rubric correctly distinguish preferred from rejected?
- **This is the most important gate**

### Refinement Loop
- **Model**: Claude Opus 4.6
- **Input**: Borderline candidates + "both pass" failures
- **Max cycles**: 2 per candidate
- **Strategies**: Increase specificity, decrease specificity, fix ambiguity, strengthen discrimination

## Key Metrics to Monitor

| Metric | Target | Alert If |
|--------|--------|----------|
| Gate 1 rejection rate | 40-60% | <30% (too clean) or >70% (quality issue) |
| Gate 2 pass rate | 40-60% | <30% (semantic issues) |
| Gate 3 pass rate | 60-80% | <50% (discrimination problems) |
| Inverted rate | <5% | >5% (fundamental generator issue) |
| Refinement success | 20-40% | <15% (refinement prompts need work) |
| Avg discrimination strength | >3.0 | <2.5 (weak rubrics) |

## Configuration

All pipeline constants are in `src/config.py` (`PipelineConfig` frozen dataclass):

```python
@dataclass(frozen=True)
class PipelineConfig:
target_rubrics: int = 6
max_generation_rounds: int = 7
initial_candidates: int = 30
additional_candidates: int = 12
max_total_candidates: int = 300
max_no_new_rounds: int = 3
max_seconds_per_example: int = 240

parallel_examples: int = 12
gate2_concurrency: int = 30
gate3_concurrency: int = 12

model: str = "anthropic/claude-opus-4-6"

generator_temperature: float = 0.8
gate2_temperature: float = 0.2
gate3_temperature: float = 0.1
refinement_temperature: float = 0.6

failure_categories: tuple[str, ...] = (
"instruction-retention",
"inference-memory",
"version-editing",
"self-coherence",
)
```

## Troubleshooting

### High "both_pass" rate
- Generation isn't targeting the actual contrast
- Solution: Add more specific constraint prompts in generator

### High inverted rate
- Generator fundamentally misunderstands what makes preferred better
- Solution: Re-analyze examples, check for ambiguous preferences

### Low refinement success
- Refinement strategies aren't effective
- Solution: Add more targeted strategies, reduce max cycles, accept lower recovery

## Output Format

The pipeline generates exactly **6 validated rubrics** per example in `data/validated/.json`:

```
data/validated/
├── pref-0000.json
├── pref-0001.json
└── ...
```

Each file contains an array of 6 rubrics:

```json
[
{
"id": "rubric-uuid",
"example_id": "pref-0000",
"rubric_text": "[Objective] The response should...",
"abstraction_level": "category",
"category": "instruction-retention",
"gate3_result": {
"discrimination_strength": 4,
"high_value": true,
"discrimination_type": "clean",
"passed": true
}
},
// ... 5 more rubrics
]
```

## Integration with RL Training

For RL training:
- Use `abstraction_level: "category"` rubrics for generalization
- Prioritize `high_value: true` rubrics (discriminate on subtle differences)
- Filter by `discrimination_strength >= 3` for clean gradients

## References

- RUBRIC_STYLE_GUIDE.md - Mined rubric principles
- src/config.py - All pipeline constants
- src/client.py - API client configuration

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/labelbox/multichallenge_rubric_generation

Awesome Lists containing this project

README