{"id":15829106,"url":"https://github.com/THUDM/LongAlign","last_synced_at":"2025-10-16T21:31:22.782Z","repository":{"id":220200373,"uuid":"749113197","full_name":"THUDM/LongAlign","owner":"THUDM","description":"[EMNLP 2024] LongAlign: A Recipe for Long Context Alignment of LLMs","archived":false,"fork":false,"pushed_at":"2024-12-16T05:32:30.000Z","size":6374,"stargazers_count":239,"open_issues_count":9,"forks_count":16,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-01-29T09:09:31.335Z","etag":null,"topics":["alignment","llm","long-context","longtext"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-27T16:10:20.000Z","updated_at":"2025-01-28T16:06:22.000Z","dependencies_parsed_at":"2024-02-08T01:29:43.749Z","dependency_job_id":"09cc00bd-a848-4f4d-8e9a-d05d9dbc2abf","html_url":"https://github.com/THUDM/LongAlign","commit_stats":null,"previous_names":["thudm/longalign"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FLongAlign","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FLongAlign/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FLongAlign/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FLongAlign/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/LongAlign/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":236749064,"owners_count":19198617,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","llm","long-context","longtext"],"created_at":"2024-10-05T11:00:37.150Z","updated_at":"2025-10-16T21:31:21.906Z","avatar_url":"https://github.com/THUDM.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"![](assets/LongAlign-logo.gif)\n# LongAlign: A Recipe for Long Context Alignment of LLMs\n\n\u003cp align=\"center\"\u003e\n    🤗 \u003ca href=\"https://huggingface.co/datasets/THUDM/LongAlign-10k\" target=\"_blank\"\u003eHF Repo\u003c/a\u003e • 📃 \u003ca href=\"https://arxiv.org/abs/2401.18058\" target=\"_blank\"\u003ePaper\u003c/a\u003e\n\u003c/p\u003e\n\n阅读[中文](README_zh.md)版本\n\n**LongAlign** is the first full recipe for LLM alignment on long context. We propose the **LongAlign-10k** dataset, containing 10,000 long instruction data of 8k-64k in length. We investigate on training strategies, namely **packing (with loss weighting) and sorted batching**, which are all implemented in our code. For real-world long context evaluation, we introduce **LongBench-Chat** that evaluates the instruction-following capability on queries of 10k-100k length.\n\n## 🔍 Table of Contents\n- [⚙️ Data Preparation](#data-preparation)\n- [🖥️ LongAlign Training](#longalign-training)\n- [📊 Evaluation](#longbench-chat-evaluation)\n- [📝 Citation](#citation)\n\n\u003ca name=\"data-preparation\"\u003e\u003c/a\u003e\n## ⚙️ Data Preparation\n\nYou can download and save the **LongAlign-10k** data through the Hugging Face datasets ([🤗 HF Repo](https://huggingface.co/datasets/THUDM/LongAlign-10k)):\n```python\ndataset = load_dataset('THUDM/LongAlign-10k')\nfor split, split_dataset in dataset.items():\n    split_dataset.to_json(\"data/raw/long.jsonl\")\n```\nThe ShareGPT data can be downloaded from [here](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main/HTML_cleaned_raw_dataset). We refer to the [open-instruct](https://github.com/allenai/open-instruct) repository for the preprocesss of ShareGPT data. Please save the data file at `data/raw/sharegpt.jsonl`. You can use other data as a source for general instruction data, but please format your data as follows: \n```json\n{\n    \"messages\": [{\"role\": \"user\", \"content\": \"...\"}, \n                 {\"role\": \"assistant\", \"content\": \"...\"}, ...]\n    }\n```\n\n\u003ca name=\"longalign-training\"\u003e\u003c/a\u003e\n## 🖥️ LongAlign Training\n\n### Environmental Setup\nInstall the requirements with pip: `pip install -r requirements.txt`. For Llama based models, we recommend using FlashAttention 2 for optimization and saving GPU memory. The relevant dependencies can be installed according to the code base of [FlashAttention](https://github.com/Dao-AILab/flash-attention).\n\n### Data preprocessing\n\nFirst, tokenize the raw text data using the tokenizer of the model. For example, when training ChatGLM:\n```bash\npython pre_tokenize.py --model chatglm --datanum 10k\n```\nThe `--datanum` parameter here refers to the amount of long data you want in your mixed training dataset (our paper investigates on 0k, 5k, and 10k). The tokenized data will be saved under `./data/chatglm/10k`.\n\nFor the packing and sorted batching strategies, we then organize the tokenized data for training:\n```bash\npython sort_and_group.py --group_size 8 --train_file ./data/chatglm/10k\n```\nYou should set the `--group_size` parameter to the number of GPUs during training. We recommend using at least 8 80G GPUs for model training, otherwise the 64k length may incur memory overflow.\n\n### Model training\n\nWe provide training scripts under `scripts/` for the ChatGLM3 and Llama-2 model series. Make sure to adjust `--model_name_or_path`, `--train_file`, and `--output_dir` to match your model path, data path, and output path. You should consider using a base model with at least 64k context window length. We release three **base models** with extended context windows of 64k: [LongAlign-6B-64k-base](https://huggingface.co/THUDM/LongAlign-6B-64k-base), [LongAlign-7B-64k-base](https://huggingface.co/THUDM/LongAlign-7B-64k-base), and [LongAlign-13B-64k-base](https://huggingface.co/THUDM/LongAlign-13B-64k-base).\n\nFor packing training, please modify the *attention calculation* to support the 1D attention mask that marks the start and end position of each sequence in the pack, and the *model forward* function to support loss weighting during packing training. An example of such modifications for the ChatGLM3 model is provided in [modeling_chatglm.py](https://github.com/THUDM/LongAlign/blob/main/modeling_chatglm.py), in `CoreAttention.forward` and `ChatGLMForConditionalGeneration.forward`. You can directly use this file as the modeling file for ChatGLM packing training. We also provide the training code for Llama. To reproduce our results, please use [modeling_llama.py](https://github.com/THUDM/LongAlign/blob/main/modeling_llama.py) as the modeling file. As suggested in the result our paper, we recommend *packing+loss weighting* for ChatGLM training and *sorted batching* for Llama.\n\n### Model deploying\nWe have released four **chat models** trained using LongAlign: [LongAlign-6B-64k](https://huggingface.co/THUDM/LongAlign-6B-64k) (based on *ChatGLM3-6B*), [LongAlign-7B-64k](https://huggingface.co/THUDM/LongAlign-7B-64k) (based on *Llama-2-7B*), [LongAlign-13B-64k](https://huggingface.co/THUDM/LongAlign-13B-64k) (based on *Llama-2-13B*), and [ChatGLM3-6B-128k](https://huggingface.co/THUDM/chatglm3-6b-128k). Try the model to summarize our paper, or ask anything about it:\n```python\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nimport torch\ntokenizer = AutoTokenizer.from_pretrained(\"THUDM/LongAlign-6B-64k\", trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(\"THUDM/LongAlign-6B-64k\", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map=\"auto\")\nmodel = model.eval()\nquery = open(\"assets/paper.txt\").read() + \"\\n\\nPlease summarize the paper.\"\nresponse, history = model.chat(tokenizer, query, history=[], max_new_tokens=512, temperature=1)\nprint(response)\n```\nFor Llama-based models, we also provide a [llama_flash_attn_monkey_patch.py](https://github.com/THUDM/LongAlign/blob/main/LongBench_Chat/llama_flash_attn_monkey_patch.py) for utilization of FlashAttention-2 to save memory for inference on long sequences.\n\n### All available models\n\nHere is the full list of models we released:\n\n|Model|HF Repo|Description|\n|---|---|---|\n|**LongAlign-6B-64k-base**| [🤗 HF Repo](https://huggingface.co/THUDM/LongAlign-6B-64k-base) | **ChatGLM3-6B** with an extended 64k context window |\n|**LongAlign-6B-64k**| [🤗 HF Repo](https://huggingface.co/THUDM/LongAlign-6B-64k) | Chat model by LongAlign training on LongAlign-6B-64k-base|\n|**LongAlign-7B-64k-base**| [🤗 HF Repo](https://huggingface.co/THUDM/LongAlign-7B-64k-base) | **Llama-2-7B** with an extended 64k context window |\n|**LongAlign-7B-64k**| [🤗 HF Repo](https://huggingface.co/THUDM/LongAlign-7B-64k) | Chat model by LongAlign training on LongAlign-7B-64k-base|\n|**LongAlign-13B-64k-base**| [🤗 HF Repo](https://huggingface.co/THUDM/LongAlign-13B-64k-base) | **Llama-2-13B** with an extended 64k context window |\n|**LongAlign-13B-64k**| [🤗 HF Repo](https://huggingface.co/THUDM/LongAlign-13B-64k) | Chat model by LongAlign training on LongAlign-13B-64k-base|\n|**ChatGLM3-6B-128k**| [🤗 HF Repo](https://huggingface.co/THUDM/chatglm3-6b-128k) | **ChatGLM3-6B** with a 128k context window|\n\n\u003ca name=\"longbench-chat-evaluation\"\u003e\u003c/a\u003e\n## 📊 Evaluation\n\n### LongBench-Chat evaluation\nLongBench-Chat is the first benchmark for assessing long context alignment, featuring real user queries of 10k-100k in length. The dataset and evaluation code are available under `LongBench_Chat/`. Remember to configure your OpenAI API key in `eval.py` since we adopt GPT-4 as the evaluator. Run\n```bash\npython eval.py --model {model_path} --max_length {max_length}\n```\n`model_path` can either be your local model path or a Hugging Face model path. Here is the leaderboard on LongBench-Chat:\n\n![](assets/leaderboard.png)\n\nYou are also welcome to submit your model's test predictions or results to us. We are planning to release a more formal leaderboard.\n\n### Needle-test evaluation\nWe also provide the code for evaluating HuggingFace models on the \"Needle In A Haystack\" test under `Needle_test/`. See its [README.md](https://github.com/THUDM/LongAlign/blob/main/Needle_test/README.md) for more information.\n\n*To reproduce our results on other benchmarks, we refer to the code in [LongBench](https://github.com/THUDM/LongBench), [FastChat](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge), and [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) for evaluating on LongBench, MT-Bench, and Open LLM Leaderboard tasks.*\n\n\u003ca name=\"citation\"\u003e\u003c/a\u003e\n## 📝 Citation\n\nIf you find our work useful, please consider citing LongAlign:\n\n```\n@inproceedings{bai2024longalign,\n    title = \"{L}ong{A}lign: A Recipe for Long Context Alignment of Large Language Models\",\n    author = \"Bai, Yushi and Lv, Xin and Zhang, Jiajie and He, Yuze and Qi, Ji and Hou, Lei and Tang, Jie and Dong, Yuxiao and Li, Juanzi\",\n    booktitle = \"Findings of the Association for Computational Linguistics: EMNLP 2024\",\n    month = nov,\n    year = \"2024\",\n    address = \"Miami, Florida, USA\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https://aclanthology.org/2024.findings-emnlp.74\",\n    doi = \"10.18653/v1/2024.findings-emnlp.74\",\n    pages = \"1376--1395\",\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTHUDM%2FLongAlign","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FTHUDM%2FLongAlign","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FTHUDM%2FLongAlign/lists"}