https://github.com/HJSang/CRISP_Reasoning_Compression
https://github.com/HJSang/CRISP_Reasoning_Compression
Last synced: 15 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/HJSang/CRISP_Reasoning_Compression
- Owner: HJSang
- Created: 2026-03-07T06:41:10.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-05-26T17:20:32.000Z (about 1 month ago)
- Last Synced: 2026-05-26T19:12:43.267Z (about 1 month ago)
- Language: Python
- Size: 3.86 MB
- Stars: 54
- Watchers: 1
- Forks: 7
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesomeopd - CRISP_Reasoning_Compression - the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700" alt="Stars"> | 2026.03 | LinkedIn | [arXiv 2603.05433](https://arxiv.org/abs/2603.05433) | OPSDC / CRISP | (♻️ Self-Distillation with Privileged Context — OPSD)
README
# CRISP: Compressed Reasoning via Iterative Self-Policy Distillation (Original OPSDC On-Policy Self-Distillation for Reasoning Compression)
This repository contains the code for **CRISP** (**C**ompressed **R**easoning via **I**terative **S**elf-**P**olicy Distillation), a method that teaches reasoning models to think more concisely by distilling their own concise behavior back into themselves.
**Paper:** [CRISP: Compressed Reasoning via Iterative Self-Policy Distillation](crisp_compressed_reasoning_via_iterative_self_policy_distillation.pdf) | [arXiv](https://arxiv.org/abs/2603.05433)
**Authors:** Hejian Sang\*, Yuanda Xu\*, Zhengze Zhou\*, Ran He\*, Zhipeng Wang, Jiachen Sun
**Related write-up:** [Scorer Choice in Math Reasoning Evaluation](https://zhengzezhou.github.io/math-scorer-choice/) — a four-policy decomposition of how verifier choice (answer-extraction vs. symbolic equivalence) can swing reported MATH-500 accuracy by up to ~80 percentage points on identical generations.
## Key Idea
Reasoning models think out loud, but much of what they say is noise. CRISP uses a single, almost trivial idea: *ask the model to be concise, then teach it to do so without being asked*.
- **Teacher**: The same model conditioned on a conciseness instruction (e.g., "Solve concisely, avoid unnecessary steps")
- **Student**: The same model without the conciseness instruction
Training generates student rollouts and minimizes per-token reverse KL divergence between student and teacher distributions. No ground-truth answers, no token budgets, no difficulty estimators.
## Results
| Model | Benchmark | Token Reduction | Accuracy Change |
|-------|-----------|----------------|-----------------|
| Qwen3-8B | MATH-500 | 59% | +9 pts (77% → 86%) |
| Qwen3-14B | MATH-500 | 57% | +16 pts (70% → 86%) |
| Qwen3-14B | AIME 2024 | 41% | +10 pts |
Compression naturally adapts to problem difficulty (~1.6x more compression on easy vs. hard problems), entropy remains stable throughout training, and general capabilities (MMLU) are fully preserved.
## Repository Structure
```
OnPolicySD-open/
├── verl/ # VERL framework (forked, with minor fixes)
├── workspace/
│ ├── config/
│ │ └── prompts.json # Prompt templates (student, teacher, length prune)
│ ├── data/
│ │ ├── DAPO-Math-17k-dedup/ # Training data (17k math problems)
│ │ ├── MATH-500/ # Validation benchmark
│ │ ├── aime24/ # AIME 2024 validation
│ │ └── aime25/ # AIME 2025 validation
│ ├── src/
│ │ ├── data/
│ │ │ ├── process_eval_data.py # Process eval datasets (train/val splits)
│ │ │ ├── prepare_length_prune_data.py # Generate length pruning prompts
│ │ │ └── prepare_self_distill_data.py # Generate self-distill prompts (with teacher solutions)
│ │ └── self_distill_hybrid/
│ │ ├── main_opsd.py # OPSD entry point
│ │ ├── opsd_trainer.py # OPSD trainer (JSD/reverse-KL loss)
│ │ ├── opsd_worker.py # OPSD FSDP worker
│ │ ├── sd_worker.py # Base self-distill worker
│ │ ├── sd_dataset.py # Dataset for paired teacher/student prompts
│ │ └── sd_verifier.py # Math answer verification
│ ├── scripts/sft/
│ │ └── train_opsd.sh # Main training launch script
│ └── execution-configs/ # Hyperparameter configs for Qwen3-8B and 14B
```
## Setup
### Prerequisites
- 8x H100/H200 GPUs (80GB)
- Python 3.10+
- CUDA 12.4+
### Installation
```bash
git clone https://github.com/HJSang/OPSD_Reasoning_Compression.git
cd OPSD_Reasoning_Compression
# Install VERL and dependencies
cd verl
pip install -e .
cd ..
# Install additional dependencies
pip install sglang pandas datasets hydra-core omegaconf
```
## Quick Start
The full pipeline has 3 stages:
### Stage 1: Process Evaluation Data
Process DAPO-Math-17k-dedup into train/val splits and prepare validation benchmarks (MATH-500, AIME 2024, AIME 2025).
```bash
cd workspace/src/data
python process_eval_data.py \
--data_dir ../../data \
--output_dir ../../data/processed
```
This produces:
- `data/processed/train.parquet` — DAPO training split (95%)
- `data/processed/val_dapo.parquet` — DAPO validation split (5%)
- `data/processed/val_math500.parquet`, `val_aime24.parquet`, `val_aime25.parquet` — Evaluation benchmarks
### Stage 2: Generate Length Pruning Prompts
Create paired teacher/student prompts for OPSD training. The teacher prompt adds a conciseness instruction; the student prompt is the original DAPO-Math prompt unchanged.
```bash
# Batch mode (recommended) — generates all 4 variants with shared 80/20 split:
python prepare_length_prune_data.py batch \
--input-parquet ../../data/DAPO-Math-17k-dedup/distinct-prompts-with-rewards.parquet \
--output-root ../../data
# This creates:
# data/length_prune_concise/ — "Solve concisely" teacher prompt
# data/length_prune_20pct/ — "Use 20% fewer tokens" teacher prompt
# data/length_prune_50pct/ — "Use 50% fewer tokens" teacher prompt
# data/length_prune_80pct/ — "Use 80% fewer tokens" teacher prompt
#
# Each directory contains:
# self_distill_prompts.parquet — Training prompts
# self_distill_prompts_val.parquet — Validation prompts
```
### Stage 3: Train OPSD
Launch OPSD training using the VERL HybridEngine (sglang for generation + FSDP for training).
#### Qwen3-8B
```bash
MODEL_PATH=/path/to/Qwen3-8B \
SD_PROMPTS_PATH=./workspace/data/length_prune_concise/self_distill_prompts.parquet \
SD_VAL_PROMPTS_PATH=./workspace/data/length_prune_concise/self_distill_prompts_val.parquet \
OPSD_BETA=0.5 \
SD_TEMPERATURE=1.0 \
SD_TOP_P=1.0 \
SD_MAX_TOKENS=8192 \
SFT_MAX_LENGTH=10240 \
TOTAL_EPOCHS=1 \
TRAIN_BATCH_SIZE=32 \
MICRO_BATCH_SIZE=2 \
LEARNING_RATE=1e-6 \
TP_SIZE=2 \
GPU_MEM_UTIL=0.75 \
ULYSSES_SP_SIZE=4 \
MAX_PROMPT_LENGTH=1024 \
MAX_RESPONSE_LENGTH=30000 \
VAL_MAX_TOKENS=30000 \
CHECK_STRUCTURE=false \
USE_LIGER=true \
OPSD_LOSS_TYPE=reverse_kl \
TEACHER_UPDATE_FREQ=50 \
EXPERIMENT_NAME=opsd_length_prune_concise \
bash workspace/scripts/sft/train_opsd.sh
```
#### Qwen3-14B
```bash
MODEL_PATH=/path/to/Qwen3-14B \
SD_PROMPTS_PATH=./workspace/data/length_prune_concise/self_distill_prompts.parquet \
SD_VAL_PROMPTS_PATH=./workspace/data/length_prune_concise/self_distill_prompts_val.parquet \
OPSD_BETA=0.5 \
SD_TEMPERATURE=1.0 \
SD_TOP_P=1.0 \
SD_MAX_TOKENS=8192 \
SFT_MAX_LENGTH=10240 \
TOTAL_EPOCHS=1 \
TRAIN_BATCH_SIZE=32 \
MICRO_BATCH_SIZE=2 \
LEARNING_RATE=1e-6 \
TP_SIZE=2 \
GPU_MEM_UTIL=0.75 \
ULYSSES_SP_SIZE=4 \
MAX_PROMPT_LENGTH=1024 \
MAX_RESPONSE_LENGTH=30000 \
VAL_MAX_TOKENS=30000 \
CHECK_STRUCTURE=false \
USE_LIGER=true \
OPSD_LOSS_TYPE=reverse_kl \
TEACHER_UPDATE_FREQ=50 \
EXPERIMENT_NAME=opsd_length_prune_concise \
bash workspace/scripts/sft/train_opsd.sh
```
Pre-configured hyperparameter files for various ablations (teacher update frequency, compression strength) are available in `workspace/execution-configs/`.
## Key Hyperparameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `OPSD_LOSS_TYPE` | `reverse_kl` | Loss type: `reverse_kl` or `jsd` |
| `OPSD_BETA` | `0.5` | JSD interpolation weight (only used when `jsd`) |
| `TEACHER_UPDATE_FREQ` | `50` | Steps between teacher weight updates (0 = frozen teacher) |
| `SD_TEMPERATURE` | `1.0` | Student rollout temperature |
| `SD_MAX_TOKENS` | `8192` | Max tokens for student generation |
| `SFT_MAX_LENGTH` | `10240` | Max sequence length for training |
| `CHECK_STRUCTURE` | `false` | Whether to require `` tags in responses |
| `USE_LIGER` | `true` | Memory-efficient loss via logsumexp |
## How It Works
1. **Generate**: sglang produces student responses from question-only prompts
2. **Score**: Teacher forward pass computes logits on student-generated tokens using the conciseness-augmented prompt
3. **Train**: Minimize per-token reverse KL between student and teacher distributions on ALL responses (no correctness filtering)
4. **Sync**: Updated weights are automatically synced back to sglang for the next generation step
5. **Refresh teacher**: Every `TEACHER_UPDATE_FREQ` steps, copy student weights to teacher for progressive compression
## Acknowledgments
Built on top of [VERL](https://github.com/volcengine/verl) (HybridEngine for combined generation and training).
## Citation
```bibtex
@article{sang2025crisp,
title={CRISP: Compressed Reasoning via Iterative Self-Policy Distillation},
author={Sang, Hejian and Xu, Yuanda and Zhou, Zhengze and He, Ran and Wang, Zhipeng and Sun, Jiachen},
journal={arXiv preprint arXiv:2603.05433},
year={2026}
}
```