https://github.com/FlagOpen/FlagEval
https://github.com/FlagOpen/FlagEval
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/FlagOpen/FlagEval
- Owner: FlagOpen
- Created: 2025-04-23T05:00:38.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-05-16T09:13:01.000Z (6 months ago)
- Last Synced: 2025-06-07T22:41:47.806Z (5 months ago)
- Size: 6.84 KB
- Stars: 6
- Watchers: 6
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- StarryDivineSky - FlagOpen/FlagEval
- awesome-llm-eval - FlagEval
README
# FlagEval evaluation platform

---
FlagEval, launched by BAAI in 2023, is a comprehensive large model evaluation system that encompasses over 800 open-source and closed-source models from around the globe. It features more than 40 capability dimensions, including reasoning, mathematical skills, and task-solving abilities, along with five major tasks and four categories of metrics.
## đź§ FlagEval Report
作者:[FlagEval]()
The FlagEval Report series provides in-depth insights into the evolving landscape of large-scale model evaluation. Each issue delivers a comprehensive analysis of model capabilities across diverse tasks and metrics, enabling researchers and developers better to understand the strengths and limitations of leading AI models.
**Issue 2 (2024-12-30 Updated)** [pdf]()
**Issue 1 (2024-07-13 Updated)** [pdf]()
## 🌟 FlagEval Core
| Project | Scope | GitHub |
| --- | --- | --- |
| **FlagEval** | General‑purpose evaluation **toolkit & platform** for LLMs and multimodal foundation models; integrates >20 benchmarks across NLP, CV, Audio | |
---
## 🚀 Satellite Repositories
| Project | Description | GitHub |
| --- | --- | --- |
| **FlagEvalMM** | Flexible framework for comprehensive **multimodal model evaluation** across text, image, and video tasks | |
| **SeniorTalk** | 55 h **Mandarin speech dataset** featuring 202 elderly speakers (75‑85 yrs) with rich annotations | |
| **ChildMandarin** | 41 h **child speech dataset** covering 397 speakers (3‑5 yrs), balanced by gender & region | |
| **HalluDial** | Large‑scale **dialogue hallucination benchmark** (spontaneous + induced scenarios, 147 k turns) | |
| **CMMU** | IJCAI‑24 **Chinese Multimodal Multi‑type Question** benchmark (3 603 exam‑style Q&A) | |
---
## 📚 Repository Matrix
| Repo | Highlights | Why It Matters | License |
| --- | --- | --- | --- |
| FlagEval | NLP/CV/Audio/Multimodal tasks; pipeline runners, leaderboard exporter | One‑stop hub for model & algorithm benchmarking | Apache‑2.0 |
| FlagEvalMM | Multimodal eval harness with vLLM/SGLang adapters | Ready for GPT‑4o era, supports batch eval | Apache‑2.0 |
| SeniorTalk | Elderly speech corpus | Enables ASR/TTS for super‑aged population | CC BY‑NC‑SA 4.0 |
| ChildMandarin | Child speech corpus | Complements SeniorTalk, spans lifespan | CC BY‑NC‑SA 4.0 |
| HalluDial | Dialogue hallucination dataset & metrics | First large‑scale hallucination localization benchmark | Apache‑2.0 |
| CMMU | Multimodal Q&A exam | Stress‑tests domain knowledge & reasoning | MIT |
---
## 🔠Roadmap (2025‑2026)
1. **Continuous Benchmarking**: nightly runs on FlagScale with automated PR badges and regression alerts.
2. **Community Challenges**: quarterly leaderboard sprints to surface emerging research directions.
---
## 🤝 Contributing
We welcome issues & PRs! Please check each project’s `CONTRIBUTING.md` and adhere to its license terms.
---
## đź“„ Citation
If you use any component of the ecosystem, please cite the corresponding paper listed in that project’s README.
---
## 🛡️ License
This meta‑repository is released under **Apache‑2.0**. Individual projects may apply different licenses—see their respective READMEs.
---
_Maintained by the FlagEval team · Last updated: 2025‑04‑23_