An open API service indexing awesome lists of open source software.

https://github.com/LiberCoders/FeatureBench

[ICLR 2026] Official Implementation of "FeatureBench: Benchmarking Agentic Coding for Complex Feature Development"
https://github.com/LiberCoders/FeatureBench

agentic-coding benchmark large-language-model software-engineering

Last synced: 5 days ago
JSON representation

[ICLR 2026] Official Implementation of "FeatureBench: Benchmarking Agentic Coding for Complex Feature Development"

Awesome Lists containing this project

README

          


logo


arXiv
License
DockerHub
HuggingFace
Leaderboard

---

FeatureBench is a test-driven data generation and evaluation pipeline for feature-level coding benchmarks.
It provides a unified CLI to run inference, evaluation, and dataset generation.

## 📰 News

🎁 **2026.02.06**: We now support one-click inference for mainstream agent frameworks, including **OpenHands, Claude Code, Codex, Gemini CLI, and mini-swe-agent**. All supported agent frameworks can be found [here](featurebench/infer/agents/). We have also open-sourced the FeatureBench **data pipeline**.

## 🚀 Quickstart

**Prerequisites:**
- [uv](https://docs.astral.sh/uv/getting-started/installation/) for Python environment management
- [docker](https://docs.docker.com/engine/install/) for reproducible builds and evaluation

```bash
# pypi
pip install featurebench
# or uv add featurebench

# local
git clone https://github.com/LiberCoders/FeatureBench.git
cd FeatureBench
uv sync
source .venv/bin/activate
```

**Configure:**
```bash
cp config_example.toml config.toml
```
See [docs/config.md](docs/config.md) for a comprehensive reference (harness, infer, data pipeline) with examples.

**Optional: pre-pull images to reduce network variance:**
```bash
fb pull --mode lite # lite split image list (13 images)
fb pull --mode full # full split image list (24 images)
fb pull --mode /path/to/images.txt # one image name per line

# full list: featurebench/resources/constants/full_images.txt
# lite list: featurebench/resources/constants/lite_images.txt
```

**Run inference:**
```bash
fb infer \
--config-path config.toml \
--agent mini_swe_agent \
--model openai/qwen3-coder-480b-a35b-instruct \
--split lite
```

**Run evaluation:**
```bash
fb eval \
-p runs//output.jsonl \
--split lite
# use -p gold to verify the gold patches
```

## 🧭 CLI Overview

`fb` provides three core commands:
- `fb infer` runs `featurebench.infer.run_infer` (docs: [docs/infer_cli_arg.md](docs/infer_cli_arg.md))
- `fb eval` runs `featurebench.harness.run_evaluation` (docs: [docs/harness_cli_arg.md](docs/harness_cli_arg.md))
- `fb data` runs `featurebench.pipeline` (docs: [docs/pipeline.md](docs/pipeline.md))

## ✍️ Citation

If you found FeatureBench useful, please cite us as:

```bibtex
@article{zhou2026featurebench,
title={FeatureBench: Benchmarking Agentic Coding for Complex Feature Development},
author={Zhou, Qixing and Zhang, Jiacheng and Wang, Haiyang and Hao, Rui and Wang, Jiahe and Han, Minghao and Yang, Yuxue and Wu, Shuzhe and Pan, Feiyang and Fan, Lue and others},
journal={arXiv preprint arXiv:2602.10975},
year={2026}
}
```

## 📧 Contact

If you have any questions, feel free to contact [qixingzhou1125@gmail.com](mailto:qixingzhou1125@gmail.com) or [zjcheng2022@gmail.com](mailto:zjcheng2022@gmail.com).