{"id":28385520,"url":"https://github.com/x-plug/writingbench","last_synced_at":"2025-09-01T03:34:09.618Z","repository":{"id":281621285,"uuid":"945834834","full_name":"X-PLUG/WritingBench","owner":"X-PLUG","description":"WritingBench: A Comprehensive Benchmark for Generative Writing","archived":false,"fork":false,"pushed_at":"2025-08-21T10:34:18.000Z","size":35413,"stargazers_count":107,"open_issues_count":2,"forks_count":10,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-21T11:44:40.624Z","etag":null,"topics":["ai","benchmark","evaluation-framework","huggingface","llm","long-context","long-text","nlp","text-generation","writing"],"latest_commit_sha":null,"homepage":"https://arxiv.org/pdf/2503.05244","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/X-PLUG.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-10T07:48:18.000Z","updated_at":"2025-08-21T10:34:22.000Z","dependencies_parsed_at":"2025-03-10T09:35:40.290Z","dependency_job_id":"a0a58149-6b71-494f-98c9-2255da471645","html_url":"https://github.com/X-PLUG/WritingBench","commit_stats":null,"previous_names":["x-plug/writingbench"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/X-PLUG/WritingBench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FWritingBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FWritingBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FWritingBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FWritingBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/X-PLUG","download_url":"https://codeload.github.com/X-PLUG/WritingBench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/X-PLUG%2FWritingBench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273069704,"owners_count":25040103,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-01T02:00:09.058Z","response_time":120,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","benchmark","evaluation-framework","huggingface","llm","long-context","long-text","nlp","text-generation","writing"],"created_at":"2025-05-30T10:40:30.913Z","updated_at":"2025-09-01T03:34:09.607Z","avatar_url":"https://github.com/X-PLUG.png","language":"Python","readme":"# WritingBench: A Comprehensive Benchmark for Generative Writing\n\u003cp align=\"center\"\u003e\n  📃 \u003ca href=\"https://arxiv.org/abs/2503.05244\" target=\"_blank\"\u003e[Paper]\u003c/a\u003e • 🚀 \u003ca href=\"https://github.com/X-PLUG/WritingBench\" target=\"_blank\"\u003e[Github Repo]\u003c/a\u003e • 🏆 \u003ca href=\"https://huggingface.co/spaces/WritingBench/WritingBench\" target=\"_blank\"\u003e[Leaderboard]\u003c/a\u003e • 📏 \u003ca href=\"https://huggingface.co/AQuarterMile/WritingBench-Critic-Model-Qwen-7B\" target=\"_blank\"\u003e[Critic Model]\u003c/a\u003e • ✍️ \u003ca href=\"https://huggingface.co/AQuarterMile/Writing-Model-Qwen-7B\" target=\"_blank\"\u003e[Writing Model]\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"pics/construction.png\" \n       alt=\"Benchmark Construction Pipeline\"\n       width=\"900\"\u003e\n\u003c/div\u003e\n\n## 🚀 What's New \n\n#### ```2025-04-29```\n- **🏆 Leaderboard Launch**: Explore evaluation results on [Hugging Face Leaderboard](https://huggingface.co/spaces/WritingBench/WritingBench) and [ModelScope Leaderboard](https://modelscope.cn/studios/WritingBench/WritingBench). Update latest LLM evaluations (Claude-3-7-Sonnet, o3, grok-3, etc)\n  - Parameters for response generation: `top_p: 0.8`; `top_k: 20`; `temperature: 0.7`; `max_length: 16000` (or maximum allowed if less than 16000)\n  - Parameters for scoring: `top_p: 0.95`; `top_k: (empty)`; `temperature: 1.0`; `max_length: 2048`\n  - Leaderboard scores are scaled from 10 to 100 by multiplying by 10 for easier viewing.\n- ‼️ Update [benchmark queries \u0026 criteria](https://github.com/X-PLUG/WritingBench/blob/main/benchmark_query/benchmark_all.jsonl) for improved assessment, including **1,000** queries and requirement dimension subsets. \n- ‼️ Update [evaluation prompt](https://github.com/X-PLUG/WritingBench/blob/main/prompt.py) for better scoring, and switch to using **Claude-3-7-Sonnet** for evaluation.\n  \n#### ```2025-03-10```\n- We release the first version of WritingBench, including **1,239** writing queries and style/format/length dimension subsets.\n\n## 📖 Overview\nWritingBench is a comprehensive benchmark for evaluating LLMs' writing capabilities across **1,000 real-world queries**, spanning:\n- 6 primary domains \n- 100 fine-grained subdomains\n- 1,500+ avg. tokens per query\n\nWritingBench integrates diverse sources of materials. Each query is paired with **5 instance-specific criteria**, scoring either through LLM evaluators or through a finetuned critic model.\n\n\n## 🏗️ Benchmark Construction\n\nWritingBench is built through a hybrid pipeline combining **Model-Augmented Query Generation** and **Human-in-the-Loop Refinement**, ensuring both diversity and real-world applicability. The construction process involves two key phases:\n\n### 🤖 Model-Augmented Query Generation\n\n#### Phase 1: Initial Query Generation\nLeverage LLMs to generate queries from a two-tiered domain pool grounded in real-world writing scenarios, consisting of 6 primary domains and 100 secondary subdomains, covering:\n   - 🔬 Academic \u0026 Engineering\n   - 💼 Finance \u0026 Business\n   - ⚖️ Politics \u0026 Law\n   - 🎨 Literature \u0026 Art\n   - 🎓 Education\n   - 📢 Advertising \u0026 Marketing\n\n####  Phase 2: Query Diversification\nEnhance the diversity and practical applicability of queries by random selected strategies from **Query Refinement Guidance Pool**, covering:\n- Style Adjustments (e.g., kid-friendly tone)\n- Format Specifications (e.g., IEEE template)\n- Length Constraints (e.g., 500-word summary)\n- Personalization (e.g., educator's perspective)\n- Content Specificity (e.g., 2023 Q3 metrics)\n- Expression Optimization (query rewriting)\n\n### ✍️ Human-in-the-Loop Refinement\n\n#### Phase 1: Material Collection\n30 trained annotators collect necessary open-source materials (e.g., public financial statements or legal templates), guided by material requirements generated by LLMs.\n\n#### Phase 2: Expert Screening \u0026 Optimization\n5 experts conduct a delicate two-stage filtering process: \n- query adaptation: ambiguous or unrealistic queries are revised to better align with the provided materials and practical scenarios\n- material pruning: redundant or irrelevant content is eliminated from the collected materials\n\n## 📈 Evaluation Framework\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"pics/criteria.png\" alt=\"Instance-Specific Criteria\" width=\"800\"\u003e\n\u003c/div\u003e\n\n### Phase 1: Dynamic Criteria Generation\nGiven a query $q$ in the WritingBench, the LLM is prompted to automatically generate a set of five evaluation criteria, $C_q = \\{c_1, \\ldots, c_5\\}$. Each criterion comprises three components: a concise name summarizing the criterion, an extended description elaborating on the evaluation focus, and detailed scoring rubrics.\n\n### Phase 2: Rubric-based Scoring\nFor each criterion $c_i \\in C_q$, the evaluator independently assigns a score on a 10-point scale to a response $r$, providing both a score and a justification.\n\n\n## 🛠 Installation\n```bash\ngit clone https://github.com/X-PLUG/WritingBench.git\n```\n\n## 📂 Repository Structure\n```bash\n.\n├── generate_response.py      # Generation script\n├── evaluate_benchmark.py     # Evaluation script\n├── calculate_scores.py       # Scoring script\n├── prompt.py                 # Prompt templates\n├── evaluator/\n│   ├── __int__.py\n│   ├── critic.py             # Critic model evaluation interface\n│   └── llm.py                # LLM evaluation interface\n└── benchmark_query/\n    ├── benchmark_all.jsonl   # Full dataset (1,000 queries)\n    └── requirement/\n        ├── style/           \n        │   ├── style_subset.jsonl    # requirement-involved subset for style\n        │   └── style_subset_C.jsonl  # category-specific subset for style\n        ├── format/          \n        │   ├── format_subset.jsonl    # requirement-involved subset for format\n        │   └── format_subset_C.jsonl  # category-specific subset for format\n        └── length/         \n            ├── length_subset.jsonl    # requirement-involved subset for length\n            └── length_subset_C.jsonl  # category-specific subset for length\n```\n\n## 🚀 Quick Start\n\n### 1. Generation\nGenerate responses for the evaluation queries (you need to complete the API call in the `writer` function in `generate_response.py`),  or obtain the answers through any other method. Whatever approach you take, write the results to a file that follows exactly the same JSONL structure as the example `response_model.jsonl`.\n```bash\npython generate_response.py \\\n  --query_file query_set.jsonl \\ # use files under benchmark_query/\n  --output_file ./responses/response_model.jsonl \\ # path to save response file\n```\n\n### 2. Scoring\nFirst, Add your API credentials:\n- For LLM-as-a-Judge, see `evaluator/llm.py`. Recommend using `Claude-3-7-Sonnet` for evaluation.\n```bash\n  self.api_key = \"your_api_key_here\"\n  self.url = \"Your API endpoint\"\n  self.model = \"Chose your model name\"\n```bash\n- For critic model, see `evaluator/critic.py`\n```bash\n  self.model = LLM(\n      model=\"\", # Your local path. Please download critic model from https://huggingface.co/AQuarterMile/WritingBench-Critic-Model-Qwen-7B.\n      tensor_parallel_size=1, # Your tensor parallel size setting. Defaults to 1, indicating no parallelism\n  )\n```\n\nThen choose appropriate evaluation sets from `benchmark_query/`.\n```bash\npython evaluate_benchmark.py \\\n  --evaluator critic \\ # or claude\n  --query_criteria_file query_set.jsonl \\ # use files under benchmark_query/\n  --input_file ./responses/response_model.jsonl \\\n  --output_file ./score_dir/score_model.jsonl\n```\n\nAn example of `response_model.jsonl` used to store responses generated by the evaluated LLMs:\n```bash\n{\"index\": i, \"response\": \"xxx\"}\n```\n\n### 3. Calculation\nStore all scoring results from the previous step in a folder, then aggregate the scores.\n```bach\npython calculate_scores.py \\\n  --score_dir ./score_dir \\                   # Directory with JSONL score files produced in the previous stage\n  --benchmark_file ./benchmark_query/benchmark_all.jsonl \\  # Full benchmark file provided in the repository\n  --output_excel ./scores.xlsx \\              # Path to the aggregated-result Excel file to be generated\n  --requirement_dir ./requirement             # Requirement folder included in the repository\n```\n\n\n## 📝 Citation\n\n```\n@misc{wu2025writingbench,\n      title={WritingBench: A Comprehensive Benchmark for Generative Writing}, \n      author={Yuning Wu and Jiahao Mei and Ming Yan and Chenliang Li and Shaopeng Lai and Yuran Ren and Zijia Wang and Ji Zhang and Mengyue Wu and Qin Jin and Fei Huang},\n      year={2025},\n      url={https://arxiv.org/abs/2503.05244}, \n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-plug%2Fwritingbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fx-plug%2Fwritingbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx-plug%2Fwritingbench/lists"}