An open API service indexing awesome lists of open source software.

https://github.com/FlagOpen/FlagEval


https://github.com/FlagOpen/FlagEval

Last synced: 3 months ago
JSON representation

Awesome Lists containing this project

README

          

# FlagEval evaluation platform

![FlagEval Logo](https://github.com/flageval-baai/.github/blob/main/profile/img_v3_02ge_8b495d86-f148-473d-afbf-695dc1b88f4g.jpg)

---

FlagEval, launched by BAAI in 2023, is a comprehensive large model evaluation system that encompasses over 800 open-source and closed-source models from around the globe. It features more than 40 capability dimensions, including reasoning, mathematical skills, and task-solving abilities, along with five major tasks and four categories of metrics.

## đź§  FlagEval Report
作者:[FlagEval]()

The FlagEval Report series provides in-depth insights into the evolving landscape of large-scale model evaluation. Each issue delivers a comprehensive analysis of model capabilities across diverse tasks and metrics, enabling researchers and developers better to understand the strengths and limitations of leading AI models.

**Issue 2 (2024-12-30 Updated)** [pdf]()

**Issue 1 (2024-07-13 Updated)** [pdf]()

## 🌟 FlagEval Core

| Project | Scope | GitHub |
| --- | --- | --- |
| **FlagEval** | General‑purpose evaluation **toolkit & platform** for LLMs and multimodal foundation models; integrates >20 benchmarks across NLP, CV, Audio | |

---

## 🚀 Satellite Repositories

| Project | Description | GitHub |
| --- | --- | --- |
| **FlagEvalMM** | Flexible framework for comprehensive **multimodal model evaluation** across text, image, and video tasks | |
| **SeniorTalk** | 55 h **Mandarin speech dataset** featuring 202 elderly speakers (75‑85 yrs) with rich annotations | |
| **ChildMandarin** | 41 h **child speech dataset** covering 397 speakers (3‑5 yrs), balanced by gender & region | |
| **HalluDial** | Large‑scale **dialogue hallucination benchmark** (spontaneous + induced scenarios, 147 k turns) | |
| **CMMU** | IJCAI‑24 **Chinese Multimodal Multi‑type Question** benchmark (3 603 exam‑style Q&A) | |

---

## 📚 Repository Matrix

| Repo | Highlights | Why It Matters | License |
| --- | --- | --- | --- |
| FlagEval | NLP/CV/Audio/Multimodal tasks; pipeline runners, leaderboard exporter | One‑stop hub for model & algorithm benchmarking | Apache‑2.0 |
| FlagEvalMM | Multimodal eval harness with vLLM/SGLang adapters | Ready for GPT‑4o era, supports batch eval | Apache‑2.0 |
| SeniorTalk | Elderly speech corpus | Enables ASR/TTS for super‑aged population | CC BY‑NC‑SA 4.0 |
| ChildMandarin | Child speech corpus | Complements SeniorTalk, spans lifespan | CC BY‑NC‑SA 4.0 |
| HalluDial | Dialogue hallucination dataset & metrics | First large‑scale hallucination localization benchmark | Apache‑2.0 |
| CMMU | Multimodal Q&A exam | Stress‑tests domain knowledge & reasoning | MIT |

---

## 🔭 Roadmap (2025‑2026)

1. **Continuous Benchmarking**: nightly runs on FlagScale with automated PR badges and regression alerts.
2. **Community Challenges**: quarterly leaderboard sprints to surface emerging research directions.

---

## 🤝 Contributing

We welcome issues & PRs! Please check each project’s `CONTRIBUTING.md` and adhere to its license terms.

---

## đź“„ Citation

If you use any component of the ecosystem, please cite the corresponding paper listed in that project’s README.

---

## 🛡️ License

This meta‑repository is released under **Apache‑2.0**. Individual projects may apply different licenses—see their respective READMEs.

---

_Maintained by the FlagEval team · Last updated: 2025‑04‑23_