An open API service indexing awesome lists of open source software.

https://github.com/mlgroupjlu/structflowbench


https://github.com/mlgroupjlu/structflowbench

Last synced: 10 months ago
JSON representation

Awesome Lists containing this project

README

          

# StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following



📃 Paper



🤗 Dataset



🖥️ Code

## 1. Updates
- 2025/02/26: We enhanced the code documentation on GitHub with detailed implementation guidelines.
- 2025/02/24: We submitted our paper to Hugging Face's [Daily Papers](https://huggingface.co/papers/2502.14494).
- 2025/02/23: We released StructFlowBench dataset on [huggingface](https://huggingface.co/datasets/Jinnan/StructFlowBench).
- 2025/02/20: We released the first version of our [paper](https://arxiv.org/abs/2502.14494) along with the dataset and codebase.

## 2. Introduction

We introduce **StructFlowBench**, a novel instruction-following benchmark integrating a multi-turn structural flow framework.
- We propose a six-category structured taxonomy for multi-turn instruction-following evaluation, offering an interpretable framework for analyzing dialogue structural flow
- We introduce StructFlowBench, a structurally annotated multi-turn benchmark that leverages a structure-driven generation paradigm to enhance the simulation of complex dialogue scenarios.
- We systematically evaluate 13 state-of-the-art LLMs (3 closed-source and 10 open-source), unveiling disparities in structural processing capabilities and providing empirical insights for optimizing dialogue systems.

The illustration and an example of the Structural Flow
![Illustration](/resources/img/structural_flow.png)

The construction pipeline of StructFlowBench
![Construction Pipeline](/resources/img/data_construction_pipeline.png)

## 3. Result
The leaderboard of StructFlowBench
![leaderboard](/resources/img/leaderboard.jpeg)

Intra-turn-categorized Performance
![intra-turn](/resources/img/intra-turn_constraint_result.jpeg)

Task-categorized Performance
![task](/resources/img/task_result.jpeg)

The radar chart
![radar](/resources/img/radar.png)

## 4. Load Data
Data can be loaded from Hugging Face as demonstrated by the following Python code:
```python
from datasets import load_dataset

dataset = load_dataset("Jinnan/StructFlowBench", data_files="StructFlowBench.json")
```

## 5. Inference
### 5.1 Prepare

All APIs are provided in `evaluation\models`. To evaluate a model, find its corresponding file. For open-source models, no additional preparation is needed. However, for closed-source models, please provide the base_url and key for authentication.

### 5.2 Inference

Run the script below to perform inference with StructFlowBench using various models and generate their responses:

```bash
python infer.py \
--infer_model \
--in_path \
--out_dir \
--max_threads
```

Arguments:

- --infer_model: Name of the model to use for inference. Ensure the corresponding model class is defined in the `evaluation\models` directory.
- --in_path: Path to the input JSON file containing conversation data. (defualt: `evaluation\data\input.json`)
- --out_dir: Directory where the inference results will be saved.
- --max_threads: Number of threads for parallel processing to speed up inference.

Example:
```bash
python infer.py --infer_model your_model_name --in_path evaluation/data/input_data.json --out_dir evaluation/output/response --max_threads 4
```

## 6. Evaluation
### 6.1 GPT-4o Evaluation
---

Run the script below to evaluate model responses using the specified evaluation model:

```bash
python evaluate.py \
--key \
--base_url \
--model_name \
--response_dir \
--eval_dir \
--max_try \
--max_workers \
--eval_model
```

Arguments:

- --key: API key for the service (required if the evaluation model requires authentication).
- --base_url: Base URL for the API service (required if the evaluation model is hosted externally).
- --model_name: Name of the model whose responses will be evaluated.
- --response_dir: Directory containing the model responses to evaluate (default: `evaluation/output/response`).
- --eval_dir: Directory to save the evaluation results (default: `evaluation/output/evaluation`).
- --max_try: Maximum number of retry attempts in case of failures (default: 5).
- --max_workers: Maximum number of worker threads for parallel processing (default: 5).
- --eval_model: Name of the model used for evaluation (default: `gpt-4o`).

Example:
```bash
python evaluate.py \
--key your_api_key \
--base_url https://api.example.com \
--model_name your_model_name \
--response_dir evaluation/output/response \
--eval_dir evaluation/output/evaluation \
--max_try 3 \
--max_workers 10 \
--eval_model gpt-4o
```

### 6.2 Score
To calculate scores for the result, use the following command:
```bash
python score.py
```
All models' evaluation scores will be saved in the `output\score` directory.

## 7. Citation
```
@article{li2025structflowbench,
title={StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following},
author={Li, Jinnan and Li, Jinzhe and Wang, Yue and Chang, Yi and Wu, Yuan},
journal={arXiv preprint arXiv:2502.14494},
year={2025}
}
```
Please cite our paper if you find our research and code useful.