{"id":22206130,"url":"https://github.com/FlagOpen/FlagEval","last_synced_at":"2025-07-27T07:31:20.380Z","repository":{"id":293628961,"uuid":"971148207","full_name":"FlagOpen/FlagEval","owner":"FlagOpen","description":null,"archived":false,"fork":false,"pushed_at":"2025-05-16T09:13:01.000Z","size":7,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-06-07T22:41:47.806Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FlagOpen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-23T05:00:38.000Z","updated_at":"2025-06-07T14:47:34.000Z","dependencies_parsed_at":"2025-05-16T10:37:03.574Z","dependency_job_id":null,"html_url":"https://github.com/FlagOpen/FlagEval","commit_stats":null,"previous_names":["flagopen/flageval"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/FlagOpen/FlagEval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FlagOpen%2FFlagEval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FlagOpen%2FFlagEval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FlagOpen%2FFlagEval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FlagOpen%2FFlagEval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FlagOpen","download_url":"https://codeload.github.com/FlagOpen/FlagEval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FlagOpen%2FFlagEval/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267324369,"owners_count":24069384,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-27T02:00:11.917Z","response_time":82,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-02T18:01:49.590Z","updated_at":"2025-07-27T07:31:20.374Z","avatar_url":"https://github.com/FlagOpen.png","language":null,"funding_links":[],"categories":["A01_文本生成_文本对话","Evaluation Datasets","Tools"],"sub_categories":["大语言对话模型及数据","Multitask \u003ca id=\"multitask01\"\u003e\u003c/a\u003e"],"readme":"# FlagEval evaluation platform  \n\n![FlagEval Logo](https://github.com/flageval-baai/.github/blob/main/profile/img_v3_02ge_8b495d86-f148-473d-afbf-695dc1b88f4g.jpg)\n\n---\n\n\nFlagEval, launched by BAAI in 2023, is a comprehensive large model evaluation system that encompasses over 800 open-source and closed-source models from around the globe. It features more than 40 capability dimensions, including reasoning, mathematical skills, and task-solving abilities, along with five major tasks and four categories of metrics.\n\n## 🧠 FlagEval Report\n作者：[FlagEval](\u003chttps://flageval.baai.ac.cn/\u003e) \n\nThe FlagEval Report series provides in-depth insights into the evolving landscape of large-scale model evaluation. Each issue delivers a comprehensive analysis of model capabilities across diverse tasks and metrics, enabling researchers and developers better to understand the strengths and limitations of leading AI models.\n\n**Issue 2 (2024-12-30 Updated)** [pdf](\u003chttps://github.com/flageval-baai/FlagEval/blob/master/AI_%E5%A4%A7%E6%A8%A1%E5%9E%8B%E8%83%BD%E5%8A%9B%E5%85%A8%E6%99%AF%E6%89%AB%E6%8F%8F%20%E7%AC%AC%E4%BA%8C%E6%9C%9F.pdf\u003e)\n\n**Issue 1 (2024-07-13 Updated)** [pdf](\u003chttps://github.com/FlagOpen/FlagEval/blob/master/AI%E5%A4%A7%E6%A8%A1%E5%9E%8B%E8%83%BD%E5%8A%9B%E5%85%A8%E6%99%AF%E6%89%AB%E6%8F%8F.pdf\u003e) \n\n\n## 🌟 FlagEval Core\n\n| Project | Scope | GitHub |\n| --- | --- | --- |\n| **FlagEval** | General‑purpose evaluation **toolkit \u0026 platform** for LLMs and multimodal foundation models; integrates \u003e20 benchmarks across NLP, CV, Audio | \u003chttps://github.com/flageval-baai/FlagEval\u003e |\n\n---\n\n## 🚀 Satellite Repositories\n\n| Project | Description | GitHub |\n| --- | --- | --- |\n| **FlagEvalMM** | Flexible framework for comprehensive **multimodal model evaluation** across text, image, and video tasks | \u003chttps://github.com/flageval-baai/FlagEvalMM\u003e |\n| **SeniorTalk** | 55 h **Mandarin speech dataset** featuring 202 elderly speakers (75‑85 yrs) with rich annotations | \u003chttps://github.com/flageval-baai/SeniorTalk\u003e |\n| **ChildMandarin** | 41 h **child speech dataset** covering 397 speakers (3‑5 yrs), balanced by gender \u0026 region | \u003chttps://github.com/flageval-baai/ChildMandarin\u003e |\n| **HalluDial** | Large‑scale **dialogue hallucination benchmark** (spontaneous + induced scenarios, 147 k turns) | \u003chttps://github.com/flageval-baai/HalluDial\u003e |\n| **CMMU** | IJCAI‑24 **Chinese Multimodal Multi‑type Question** benchmark (3 603 exam‑style Q\u0026A) | \u003chttps://github.com/flageval-baai/CMMU\u003e |\n\n---\n\n## 📚 Repository Matrix\n\n| Repo | Highlights | Why It Matters | License |\n| --- | --- | --- | --- |\n| FlagEval | NLP/CV/Audio/Multimodal tasks; pipeline runners, leaderboard exporter | One‑stop hub for model \u0026 algorithm benchmarking | Apache‑2.0 |\n| FlagEvalMM | Multimodal eval harness with vLLM/SGLang adapters | Ready for GPT‑4o era, supports batch eval | Apache‑2.0 |\n| SeniorTalk | Elderly speech corpus | Enables ASR/TTS for super‑aged population | CC BY‑NC‑SA 4.0 |\n| ChildMandarin | Child speech corpus | Complements SeniorTalk, spans lifespan | CC BY‑NC‑SA 4.0 |\n| HalluDial | Dialogue hallucination dataset \u0026 metrics | First large‑scale hallucination localization benchmark | Apache‑2.0 |\n| CMMU | Multimodal Q\u0026A exam | Stress‑tests domain knowledge \u0026 reasoning | MIT |\n\n---\n\n## 🔭 Roadmap (2025‑2026)\n\n1. **Continuous Benchmarking**: nightly runs on FlagScale with automated PR badges and regression alerts.\n2. **Community Challenges**: quarterly leaderboard sprints to surface emerging research directions.\n\n---\n\n## 🤝 Contributing\n\nWe welcome issues \u0026 PRs! Please check each project’s `CONTRIBUTING.md` and adhere to its license terms.\n\n---\n\n## 📄 Citation\n\nIf you use any component of the ecosystem, please cite the corresponding paper listed in that project’s README.\n\n---\n\n## 🛡️ License\n\nThis meta‑repository is released under **Apache‑2.0**. Individual projects may apply different licenses—see their respective READMEs.\n\n---\n\n_Maintained by the FlagEval team · Last updated: 2025‑04‑23_\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFlagOpen%2FFlagEval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFlagOpen%2FFlagEval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFlagOpen%2FFlagEval/lists"}