https://github.com/FlagOpen/FlagEval

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/FlagOpen/FlagEval
Owner: FlagOpen
Created: 2025-04-23T05:00:38.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-05-16T09:13:01.000Z (6 months ago)
Last Synced: 2025-06-07T22:41:47.806Z (5 months ago)
Size: 6.84 KB
Stars: 6
Watchers: 6
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

StarryDivineSky - FlagOpen/FlagEval
awesome-llm-eval - FlagEval

README

          # FlagEval evaluation platform  

![FlagEval Logo](https://github.com/flageval-baai/.github/blob/main/profile/img_v3_02ge_8b495d86-f148-473d-afbf-695dc1b88f4g.jpg)

---

FlagEval, launched by BAAI in 2023, is a comprehensive large model evaluation system that encompasses over 800 open-source and closed-source models from around the globe. It features more than 40 capability dimensions, including reasoning, mathematical skills, and task-solving abilities, along with five major tasks and four categories of metrics.

## 🧠 FlagEval Report

作者：[FlagEval]() 

The FlagEval Report series provides in-depth insights into the evolving landscape of large-scale model evaluation. Each issue delivers a comprehensive analysis of model capabilities across diverse tasks and metrics, enabling researchers and developers better to understand the strengths and limitations of leading AI models.

**Issue 2 (2024-12-30 Updated)** [pdf]()

**Issue 1 (2024-07-13 Updated)** [pdf]() 

## 🌟 FlagEval Core

| Project | Scope | GitHub |

| --- | --- | --- |

| **FlagEval** | General‑purpose evaluation **toolkit & platform** for LLMs and multimodal foundation models; integrates >20 benchmarks across NLP, CV, Audio |  |

---

## 🚀 Satellite Repositories

| Project | Description | GitHub |

| --- | --- | --- |

| **FlagEvalMM** | Flexible framework for comprehensive **multimodal model evaluation** across text, image, and video tasks |  |

| **SeniorTalk** | 55 h **Mandarin speech dataset** featuring 202 elderly speakers (75‑85 yrs) with rich annotations |  |

| **ChildMandarin** | 41 h **child speech dataset** covering 397 speakers (3‑5 yrs), balanced by gender & region |  |

| **HalluDial** | Large‑scale **dialogue hallucination benchmark** (spontaneous + induced scenarios, 147 k turns) |  |

| **CMMU** | IJCAI‑24 **Chinese Multimodal Multi‑type Question** benchmark (3 603 exam‑style Q&A) |  |

---

## 📚 Repository Matrix

| Repo | Highlights | Why It Matters | License |

| --- | --- | --- | --- |

| FlagEval | NLP/CV/Audio/Multimodal tasks; pipeline runners, leaderboard exporter | One‑stop hub for model & algorithm benchmarking | Apache‑2.0 |

| FlagEvalMM | Multimodal eval harness with vLLM/SGLang adapters | Ready for GPT‑4o era, supports batch eval | Apache‑2.0 |

| SeniorTalk | Elderly speech corpus | Enables ASR/TTS for super‑aged population | CC BY‑NC‑SA 4.0 |

| ChildMandarin | Child speech corpus | Complements SeniorTalk, spans lifespan | CC BY‑NC‑SA 4.0 |

| HalluDial | Dialogue hallucination dataset & metrics | First large‑scale hallucination localization benchmark | Apache‑2.0 |

| CMMU | Multimodal Q&A exam | Stress‑tests domain knowledge & reasoning | MIT |

---

## 🔭 Roadmap (2025‑2026)

1. **Continuous Benchmarking**: nightly runs on FlagScale with automated PR badges and regression alerts.

2. **Community Challenges**: quarterly leaderboard sprints to surface emerging research directions.

---

## 🤝 Contributing

We welcome issues & PRs! Please check each project’s `CONTRIBUTING.md` and adhere to its license terms.

---

## 📄 Citation

If you use any component of the ecosystem, please cite the corresponding paper listed in that project’s README.

---

## 🛡️ License

This meta‑repository is released under **Apache‑2.0**. Individual projects may apply different licenses—see their respective READMEs.

---

_Maintained by the FlagEval team · Last updated: 2025‑04‑23_

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/FlagOpen/FlagEval

Awesome Lists containing this project

README