https://github.com/ctrl-gaurav/debate-train-evolve
[EMNLP 2025 Main] DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning
https://github.com/ctrl-gaurav/debate-train-evolve
Last synced: 2 days ago
JSON representation
[EMNLP 2025 Main] DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning
- Host: GitHub
- URL: https://github.com/ctrl-gaurav/debate-train-evolve
- Owner: ctrl-gaurav
- License: mit
- Created: 2025-09-17T23:20:14.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2026-03-24T06:33:02.000Z (3 months ago)
- Last Synced: 2026-03-25T08:08:31.608Z (3 months ago)
- Language: Python
- Homepage: https://aclanthology.org/2025.emnlp-main.1666/
- Size: 200 KB
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Debate, Train, Evolve
[](https://aclanthology.org/2025.emnlp-main.1666/)
[](https://aclanthology.org/2025.emnlp-main.1666/)
[](https://ctrl-gaurav.github.io/debate-train-evolve.github.io/)
[](https://python.org)
[](LICENSE)
**Self-Evolution of Language Model Reasoning via Multi-Agent Debate Traces**
**[Gaurav Srivastava](mailto:gks@vt.edu)**\* • **[Zhenyu Bi](mailto:zhenyub@vt.edu)** • **[Meng Lu](mailto:menglu@vt.edu)** • **[Xuan Wang](mailto:xuanw@vt.edu)**†
[](https://cs.vt.edu/)
[](https://2025.emnlp.org/)
[](https://aclanthology.org/2025.emnlp-main.1666/)
\* Lead Author † Corresponding Author
[**Read the Paper**](https://aclanthology.org/2025.emnlp-main.1666/) | [**Website & Docs**](https://ctrl-gaurav.github.io/debate-train-evolve.github.io/) | [**GitHub**](https://github.com/ctrl-gaurav/Debate-Train-Evolve)
---
## Overview
**DTE (Debate, Train, Evolve)** is a ground-truth-free training framework that evolves language model reasoning through multi-agent debate traces. Multiple LLM copies debate using Reflect-Critique-Refine (RCR) prompting, generating high-quality training data without external supervision. The model is then fine-tuned via Group Relative Policy Optimization (GRPO) and the process repeats.
**Key results:**
- Up to **+13.92%** accuracy gain (Qwen-1.5B on GSM-Plus)
- **+5.8%** average cross-domain generalization to science tasks
- Reduces sycophancy by **50%** via RCR prompting
- Single-model inference after training (no multi-agent overhead)
## Performance
| Model | GSM8K | GSM-Plus | MATH | ARC-Challenge | Best Gain |
|-------|-------|----------|------|---------------|-----------|
| Qwen-2.5-1.5B | 62.77 → **73.09** | 42.00 → **55.92** | 45.08 → **52.20** | 69.21 → 68.36 | **+13.92%** |
| Qwen-2.5-3B | 84.08 → **86.05** | 61.75 → **69.50** | 61.36 → **67.10** | 83.53 → **83.95** | **+7.75%** |
| Qwen-2.5-7B | 90.67 → 88.32 | 68.62 → **74.71** | 73.08 → **77.20** | 87.22 → **90.89** | **+6.09%** |
| Qwen-2.5-14B | 92.80 → **93.74** | 71.79 → **78.88** | 76.18 → **80.10** | 90.27 → **93.13** | **+7.09%** |
| Llama-3.2-3B | 72.55 → **75.06** | 45.67 → **53.79** | 39.76 → **43.80** | 73.12 → **77.23** | **+8.12%** |
| Llama-3.1-8B | 81.73 → **86.81** | 55.62 → **66.17** | 46.66 → **49.40** | 77.65 → **86.53** | **+10.55%** |
*Values show Base → Evolved performance. Bold = improvement.*
## Installation
**Prerequisites:** Python 3.9+ and a CUDA GPU (for training). Debate-only mode works on CPU.
```bash
# Quick setup (conda)
git clone https://github.com/ctrl-gaurav/Debate-Train-Evolve.git
cd Debate-Train-Evolve
bash setup.sh
# Or manual install
python -m venv dte_env && source dte_env/bin/activate
pip install -r requirements.txt
pip install -e .
# Verify
python main.py info
```
## Quick Start
### Python API
```python
import dte
# One-liner debate
result = dte.debate(
"What is 15 * 24?",
model="Qwen/Qwen2.5-0.5B-Instruct",
num_agents=3,
max_rounds=3,
task_type="math",
)
print(result.final_answer) # "360"
print(result.consensus_reached) # True
```
### CLI
```bash
# Single query debate
python main.py debate --query "What is 15 * 24?" --agents 3 --rounds 3
# Dataset evaluation
python main.py debate --dataset gsm8k --samples 20 --verbose
# Full pipeline (debate -> train -> evolve)
python main.py run --config config.yaml
```
### Full Pipeline
```python
import dte
pipeline = dte.from_config("config.yaml")
results = pipeline.run_complete_pipeline()
print(f"Improvement: {results['total_improvement']:.2%}")
```
## Project Structure
```
Debate-Train-Evolve/
├── dte/ # Main package
│ ├── __init__.py # Public API: dte.debate(), dte.from_config()
│ ├── core/ # Config, pipeline, evaluator, logger
│ ├── debate/ # Multi-agent debate (agent, manager, prompts)
│ ├── training/ # GRPO trainer + reward model
│ ├── data/ # Dataset management + data generation
│ └── utils/ # Answer extraction, helpers
├── examples/ # 6 usage examples
├── tests/ # Unit + GPU integration tests
├── config.yaml # Default configuration
├── main.py # CLI entry point
└── pyproject.toml # Package metadata
```
## Documentation
Full documentation is available on the [project website](https://ctrl-gaurav.github.io/debate-train-evolve.github.io/#/docs), including:
- **Installation & Setup** -- prerequisites, GPU support, development setup
- **Quick Start** -- Python API, CLI, component-level usage
- **API Reference** -- all public classes and functions
- **Configuration** -- complete YAML config reference
- **Training Guide** -- GRPO hyperparameters, multi-GPU, expected training times
- **Reward Functions** -- the 5 shaped reward functions (total max: 4.0)
- **Dataset Reference** -- 7 benchmarks (GSM8K, GSM-Plus, MATH, ARC, GPQA, CommonsenseQA)
- **CLI Reference** -- all commands and flags
- **Troubleshooting** -- OOM, model loading, consensus issues
- **FAQ** -- common questions answered
## CLI Commands
| Command | Description |
|---------|-------------|
| `python main.py run` | Run the complete DTE pipeline |
| `python main.py debate` | Standalone multi-agent debate |
| `python main.py generate` | Generate training data from debates |
| `python main.py train` | Train model with GRPO |
| `python main.py validate` | Validate a configuration file |
| `python main.py init` | Generate default config |
| `python main.py info` | Show system & GPU information |
## Contributing
```bash
pip install -e ".[dev]"
# Tests
pytest -m "not gpu" -v # Unit tests (no GPU)
pytest tests/test_debate_integration.py -v # GPU tests
# Lint & format
ruff check dte/ tests/
ruff format dte/ tests/
```
## Acknowledgments
This work was supported by NSF NAIRR Pilot with PSC Neocortex and NCSA Delta; Amazon, Cisco Research, Commonwealth Cyber Initiative, Amazon-Virginia Tech Center for Efficient and Robust Machine Learning, and the Sanghani Center for AI and Data Analytics at Virginia Tech.
## Citation
```bibtex
@inproceedings{srivastava2025debate,
title={Debate, Train, Evolve: Self-Evolution of Language Model Reasoning},
author={Srivastava, Gaurav and Bi, Zhenyu and Lu, Meng and Wang, Xuan},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
year={2025},
url={https://aclanthology.org/2025.emnlp-main.1666/}
}
```
## License
MIT License. See [LICENSE](LICENSE) for details.
---
[**Read the Paper**](https://aclanthology.org/2025.emnlp-main.1666/) | [**Website & Docs**](https://ctrl-gaurav.github.io/debate-train-evolve.github.io/) | [**GitHub**](https://github.com/ctrl-gaurav/Debate-Train-Evolve)
Made with ❤️ by the DTE Research Team
[](https://cs.vt.edu/)
[](https://2025.emnlp.org/)
[](https://aclanthology.org/2025.emnlp-main.1666/)