{"id":21773392,"url":"https://github.com/thunlp/LLMxMapReduce","last_synced_at":"2025-07-19T10:30:55.996Z","repository":{"id":258129794,"uuid":"861632234","full_name":"thunlp/LLMxMapReduce","owner":"thunlp","description":null,"archived":false,"fork":false,"pushed_at":"2024-10-16T04:37:52.000Z","size":2241,"stargazers_count":124,"open_issues_count":3,"forks_count":8,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-11-19T18:58:43.906Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thunlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-23T08:55:07.000Z","updated_at":"2024-11-19T06:51:18.000Z","dependencies_parsed_at":null,"dependency_job_id":"e27b31f4-fec1-4850-b749-c58526471076","html_url":"https://github.com/thunlp/LLMxMapReduce","commit_stats":null,"previous_names":["thunlp/llmxmapreduce"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thunlp%2FLLMxMapReduce","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thunlp%2FLLMxMapReduce/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thunlp%2FLLMxMapReduce/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thunlp%2FLLMxMapReduce/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thunlp","download_url":"https://codeload.github.com/thunlp/LLMxMapReduce/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226584451,"owners_count":17655036,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-26T17:01:10.857Z","updated_at":"2025-07-19T10:30:55.955Z","avatar_url":"https://github.com/thunlp.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"# LLMxMapReduce: Simplified Long-Sequence Processing using Large Language Models\n\n\u003ca href='https://surveygo.modelbest.cn/'\u003e\u003cimg src='https://img.shields.io/badge/Demo-Page-pink'\u003e\u003c/a\u003e \u003ca href='https://arxiv.org/abs/2410.09342'\u003e\u003cimg src='https://img.shields.io/badge/V1-Paper-Green'\u003e\u003c/a\u003e \u003ca href='https://arxiv.org/abs/2504.05732'\u003e\u003cimg src='https://img.shields.io/badge/V2-Paper-blue'\u003e\u003c/a\u003e \u003ca href='https://huggingface.co/datasets/R0k1e/SurveyEval'\u003e\u003cimg src='https://img.shields.io/badge/SurveyEval-Benchmark-yellow'\u003e\u003c/a\u003e \u003ca href='README_zh.md'\u003e\u003cimg src='https://img.shields.io/badge/Chinese-Readme-red'\u003e\u003c/a\u003e\n\n# 🎉 News\n- [x] **`2025.04.22`** Release [SurveyGO](https://surveygo.modelbest.cn/), an online writting system driven by LLMxMapReduce-V2.\n- [x] **`2025.04.09`** Release the paper of LLMxMapReduce-V2 in [arXiv](https://arxiv.org/abs/2504.05732).\n- [x] **`2025.02.21`** Add support for both OpenAI API and OpenAI-compatible APIs (e.g., vLLM).\n- [x] **`2024.10.12`** Release the paper of LLMxMapReduce-V1 in [arXiv](https://arxiv.org/abs/2410.09342).\n- [x] **`2024.09.12`** Release the code for LLMxMapReduce-V1.\n\n# 📚 Overview\n**LLMxMapReduce** is a divide-and-conquer framework designed to enhance modern large language models (LLMs) in understanding and generating long sequences. Developed collaboratively by **AI9STARS**, **OpenBMB**, and **THUNLP**, this framework draws inspiration from the classic MapReduce algorithm introduced in the field of big data. Our goal is to build an LLM-driven distributed computing system capable of efficiently processing long sequences. Here are the main versions of LLMxMapReduce:\n\n* [**LLMxMapReduce-V1**](https://github.com/thunlp/LLMxMapReduce/blob/main/LLMxMapReduce_V1): Utilizes a structured information protocol and in-context confidence calibration to enhance long-sequence understanding, enabling [MiniCPM3-4B](https://github.com/OpenBMB/MiniCPM) to outperform 70B-scale models in long-context evaluations.\n* [**LLMxMapReduce-V2**](https://github.com/thunlp/LLMxMapReduce/tree/main/LLMxMapReduce_V2): Introduces an entropy-driven convolutional test-time scaling mechanism to improve the integration of extremely large volumes of information, powering the online [SurveyGO](https://surveygo.modelbest.cn/) system.\n\n# 📖 Introduction\n\n\nLong-form generation is crucial for a wide range of practical applications, typically categorized into short-to-long and long-to-long generation. While short-to-long generations have received considerable attention, generating long texts from extremely long resources remains relatively underexplored. The primary challenge in long-to-long generation lies in effectively integrating and analyzing relevant information from extensive inputs, which remains difficult for current large language models (LLMs). In this paper, we propose LLMxMapReduce-V2, a novel test-time scaling strategy designed to enhance the ability of LLMs to process extremely long inputs. Drawing inspiration from convolutional neural networks, which iteratively integrate local features into higher-level global representations, LLMxMapReduce-V2 utilizes stacked convolutional scaling layers to progressively expand the understanding of input materials. Both quantitative and qualitative experimental results demonstrate that our approach substantially enhances the ability of LLMs to process long inputs and generate coherent, informative long-form articles, outperforming several representative baselines.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"assets/main_pic.jpg\" alt=\"$\\text{LLM}\\times \\text{MapReduce}$-V2 framework\"\u003e\n\u003c/div\u003e\n\n# ⚡️ Getting Started\nThe following steps are about LLMxMapReduce-V2. If you want to use LLMxMapReduce-V1, you need to refer to [here](LLMxMapReduce_V1/README.md).\n\nTo get started, ensure all dependencies listed in requirements.txt are installed. You can do this by running:\n```bash\ncd LLMxMapReduce_V2\nconda create -n llm_mr_v2 python=3.11\nconda activate llm_mr_v2\npip install -r requirements.txt\npython -m playwright install --with-deps chromium\n```\n\nBefore evaluation, you need to download punkt_tab firstly.\n```python\nimport nltk\nnltk.download('punkt_tab')\n```\n## Env config\nPlease set your OPENAI_API_KEY and OPENAI_API_BASE in your environment variables before start the pipeline. If you use miniconda, replace `anaconda3` in `LD_LIBRARY_PATH` with `miniconda3`\n```bash\nexport LD_LIBRARY_PATH=${HOME}/anaconda3/envs/llm_mr_v2/lib/python3.11/site-packages/nvidia/nvjitlink/lib:${LD_LIBRARY_PATH}\nexport PYTHONPATH=$(pwd):${PYTHONPATH}\nexport OPENAI_API_KEY=Your OpenAI Key. You need to set it when you choose the infer type as OpenAI.\nexport OPENAI_API_BASE=Your OpenAI base url\nexport GOOGLE_API_KEY=Your Google Cloud key. you need to set it when you choose the infer type as Google.\nexport SERP_API_KEY= Get SERP API key from https://serpapi.com\n```\n\nWe provide both English and Chinese version of prompt. Default version is English. If you wish to use Chinese version, please set this env:\n``` bash\nexport PROMPT_LANGUAGE=\"zh\"\n```\n\n## Model Set\n⚠️ We strongly recommend using the Gemini Flash models. There may be unknown errors when using any other models. This project has high requirements for API consumption and concurrent volume, so it's not recommended to use locally deployed models.\n\nThe models used in the generation process are configured in the `./LLMxMapReduce_V2/config/model_config.json` file. Currently, we support both the OpenAI API and the Google API. You can specify the API to be used in the `infer_type` key. Additionally, you need to specify the model name in the `model` key.\n\nMoreover, the crawling process also requires large language model (LLM) inference. You can make changes in a similar manner in the `./LLMxMapReduce_V2/src/start_pipeline.py` file. \n\n## Start LLMxMapReduce_V2 pipeline\nFollow the instructions and generate a report. The generated Markdown file is at ./output/md. \n```bash\ncd LLMxMapReduce_V2\nbash scripts/pipeline_start.sh TOPIC output_file_path.jsonl\n```\n\nIf you wish to use your own data, you need to set the `--input_file` and don't set `--topic` in scripts.\n\nThe input data should have following components at least:\n```json\n{\n  \"title\": \"The article title you wish to write\",\n  \"papers\": [\n    {\n      \"title\": \"The material title\",\n      \"abstract\": \"The abstract material. Optional, if not, part of the full text will be excerpted\",\n      \"txt\": \"The reference material full content\"\n    }\n  ]\n}\n```\n\nYou could use to use [this script](LLMxMapReduce_V2/scripts/output_to_md.py) to convert data from `.jsonl` to multiple `.md` files.\n\n# 📃 Evaluation\nThe following steps are about LLMxMapReduce-V2. If you want to use LLMxMapReduce-V1, you need to refer to [here](LLMxMapReduce_V1/README.md).\n\nFollow the steps below to set up the evaluation:\n## 1. Download the Dataset\nBefore running the evaluation, you need to download the `test` split of [SurveyEval dataset](https://huggingface.co/datasets/R0k1e/SurveyEval). After downloading, store it in a `.jsonl` file.\n\n## 2. Run the Evaluation\nExecute the [scripts](LLMxMapReduce_V2/scripts/eval_all.sh) to evaluate the generated result. \n```bash\ncd LLMxMapReduce_V2\nbash scripts/eval_all.sh output_data_file_path.jsonl\n```\nAware that the evaluation process is token-consuming, you need to make sure you have enough balance.\n\n# 📊 Experiment Results\nOur experiments demonstrate the improved performance of LLM using the LLMxMapReduce-V2 framework on SurveyEval. Detailed results are provided below.\n\n| **Methods**           | **Struct.** | **Fait.** | **Rele.** | **Lang.** | **Crit.** | **Num.** | **Dens.** | **Prec.** | **Recall** |\n|-----------------------|-------------|-----------|-----------|-----------|-----------|----------|-----------|-----------|------------|\n| Vanilla               | 94.44       | 96.43     | **100.00**| **96.50** | 37.11     | 78.75    | **74.64** | 25.48     | 26.46      |\n| + Skeleton            | **98.95**   | **97.03** | **100.00**| 95.95     | **41.01** | **135.15**| 72.96     | **62.60** | **65.11**  |\n| AutoSurvey            | 86.00       | 93.10     | **100.00**| 92.90     | 68.39     | 423.35   | 31.97     | 50.12     | 51.73      |\n| LLMxMapReduce_V2       | **95.00**   | **97.22** | **100.00**| **94.34** | **71.99** | **474.90**| **52.23** | **95.50** | **95.80**  |\n\n# 📝 Citation\nIf you have used the content of this repository, please cite the paper and leave your star :).\n\n```\n@misc{wang2025llmtimesmapreducev2entropydrivenconvolutionaltesttime,\n      title={LLM$\\times$MapReduce-V2: Entropy-Driven Convolutional Test-Time Scaling for Generating Long-Form Articles from Extremely Long Resources}, \n      author={Haoyu Wang and Yujia Fu and Zhu Zhang and Shuo Wang and Zirui Ren and Xiaorong Wang and Zhili Li and Chaoqun He and Bo An and Zhiyuan Liu and Maosong Sun},\n      year={2025},\n      eprint={2504.05732},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2504.05732}, \n}\n\n@misc{zhou2024llmtimesmapreducesimplifiedlongsequenceprocessing,\n      title={LLM$\\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models}, \n      author={Zihan Zhou and Chong Li and Xinyi Chen and Shuo Wang and Yu Chao and Zhili Li and Haoyu Wang and Rongqiao An and Qi Shi and Zhixing Tan and Xu Han and Xiaodong Shi and Zhiyuan Liu and Maosong Sun},\n      year={2024},\n      eprint={2410.09342},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2410.09342}, \n}\n```\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthunlp%2FLLMxMapReduce","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthunlp%2FLLMxMapReduce","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthunlp%2FLLMxMapReduce/lists"}