{"id":13819363,"url":"https://github.com/Glaciohound/LM-Infinite","last_synced_at":"2025-05-16T04:33:23.920Z","repository":{"id":195377777,"uuid":"647892539","full_name":"Glaciohound/LM-Infinite","owner":"Glaciohound","description":"Implementation of paper \"LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models\"","archived":false,"fork":false,"pushed_at":"2024-08-16T23:37:59.000Z","size":2217,"stargazers_count":104,"open_issues_count":3,"forks_count":12,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-08-17T00:44:01.633Z","etag":null,"topics":["language-model","long-context","model-diagnostics"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2308.16137","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Glaciohound.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-31T18:48:18.000Z","updated_at":"2024-08-16T23:38:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"6137bdaa-a7b3-4b5a-8325-1cdf48348c3f","html_url":"https://github.com/Glaciohound/LM-Infinite","commit_stats":null,"previous_names":["glaciohound/lm-infinite"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Glaciohound%2FLM-Infinite","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Glaciohound%2FLM-Infinite/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Glaciohound%2FLM-Infinite/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Glaciohound%2FLM-Infinite/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Glaciohound","download_url":"https://codeload.github.com/Glaciohound/LM-Infinite/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225405597,"owners_count":17469374,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["language-model","long-context","model-diagnostics"],"created_at":"2024-08-04T08:00:45.980Z","updated_at":"2024-11-19T18:31:46.402Z","avatar_url":"https://github.com/Glaciohound.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\n# LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models\n\n[![arXiv](https://img.shields.io/badge/arXiv-2308.16137-b31b1b.svg)](https://arxiv.org/abs/2308.16137)\n[![NAACL 2024 Outstanding Paper Award](https://img.shields.io/badge/NAACL%202024-Outstanding%20Paper%20Award-ffcc00.svg)](https://2024.naacl.org/awards/)\n\nThis is the codes of the paper\n[LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models](https://arxiv.org/abs/2308.16137)\n**(NAACL 2024 Outstanding Paper award)** in PyTorch.\nThe work is done by [Chi Han](https://glaciohound.github.io), Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang.\n\n## Table of Contents\n\n- [Introduction](#introduction)\n- [:tada::tada::tada: Now A Drop-in Replacement for HuggingFace Transformers!](#tada-tada-tada-now-a-drop-in-replacement-for-huggingface-transformers)\n- [Requirements](#requirements)\n- [Directory Structure](#directory-structure)\n- [Usage](#usage)\n  - [Data Preparation](#data-preparation)\n  - [Model Preparation](#model-preparation)\n  - [Evaluation](#evaluation)\n    - [Perplexity](#perplexity)\n    - [Evaluating Perplexity at Extreme Lengths](#evaluating-perplexity-at-extreme-lengths)\n    - [Generation](#generation)\n    - [Evaluation Downstream Tasks](#evaluation-downstream-tasks)\n      - [Passkey Retrieval](#passkey-retrieval)\n      - [Qasper](#qasper)\n- [Citation](#citation)\n\n\n## Introduction\n\n\n\nIn this paper, the authors propose a simple method, called LM-Infinite, to improve the length generalization of large language models to an extreme length of **200M** tokens, without any additional training or parameter updates.\n\n![](assets/diagnosis.jpg)\n\nWe are motivatedby first identifying three factors underlying the length generalization failure in LLMs: **(a)** Factor 1: Unseen distances between tokens cause attention logits to explode. **(b)** Factor 2: An unseen number of tokens can cause attention entropy to increase beyond the training range as the length increases. **(c)** Factor 3: Starting few tokens occupy a distinct feature region and should not be discarded.\n\n![](assets/overview.jpg)\n\nThe key idea is to use (1) a $\\Lambda$-shaped attention pattern, so that each token only attends to the nearest $L_{pretrain}$ tokens as well as a few starting tokens, and (2) a distance limit $L_{pretrain}$, so that the attention distance is capped at $L_{pretrain}$.\nThe proposed method is compatible with multiple state-of-the-art language models, including but not limited to LLaMA, Llama-2, GPT-J, MPT-7B series.\nLM-Infinite is also computational efficient, with only $O(n)$ time complexity.\n\n![](assets/perplexity_128k.jpg)\n\n\n## :tada::tada::tada: Now A Drop-in Replacement for HuggingFace Transformers!\n\n\nWe have implemented the LM-Infinite method as a drop-in replacement for HuggingFace Transformers.\nAfter you load the Transformers models, and if it is a Llama model, an MPT model, or a GPT-J model, you can run the following codes to enable LM-Infinite.\n\n\nFor Llama model:\n```\nfrom models.llama import convert_llama_model\nmodel = convert_llama_model(model, 4096, 10)\n```\n\nFor MPT model:\n```\nfrom models.mpt_7b import convert_mpt_model\nmodel = convert_mpt_model(model, 4096, 10)\n```\n\nFor GPT-J model:\n```\nfrom models.gpt_j import convert_gpt_j_model\nmodel = convert_gpt_j_model(model, 4096, 10)\n```\n\nThen, you can use the model as usual!\n\n\n\n## Requirements\n\n- Python 3.11\n- PyTorch 2.0.1\n- Datasets 2.14.4\n- Tokenizers 0.13.3\n- Transformers 4.32.1\n- SentencePiece 0.1.99\n- Evaluate 0.4.0\n- Rouge-Score 0.1.2\n- Protobuf 3.20.3\n- Accelerate 0.22.0\n- DeepSpeed 0.10.2\n- Tqdm 4.66.1\n- Einops 0.6.1\n\nA detailed list of python packages from an Anaconda perspective can be found in `requirements.txt`.\nSome packages were installed by `conda` and some by `pip`.\nMy commands to install the requirements in Anaconda \u0026 Pip environment are as follows:\n\n```\nconda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia\nconda install -c conda-forge sentencepiece einops cudatoolkit-dev tqdm ipython datasets evaluate rouge-score protobuf accelerate langchain openai\npip install transformers deepspeed\n```\n\n\n\n## Directory Structure\n\n```\n├── LICENSE\n├── README.md\n├── requirements.txt\n├── configs\n│   └── zero3_efficient_config.json         # config for deepspeed acceleration\n├── data\n│   ├── generation_metrics.py\n│   ├── get_data.py                         # dataset loading and preprocessing\n│   ├── passkey_retrieval\n│   │   ├── create_passkey_data.py\n│   │   ├── create_passkey_data.sh\n│   │   └── passkey_retrieval_accuracy.py\n│   └── split_pile_file.py                  # split the Pile dataset into task-specific files\n├── models\n│   ├── constant.py                         # a constant function model\n│   ├── get_llama2\n│   │   ├── convert_llama_weights_to_hf.py  # convert llama-2 weights to huggingface format\n│   │   └── download_llama2.sh\n│   ├── get_model.py\n│   ├── gpt_j.py\n│   ├── lambda_attention.py                 # efficient implementation of lambda attention\n│   ├── llama.py\n│   ├── model_base.py\n│   └── mpt_7b.py\n├── scripts\n│   ├── combine_evaluate_generation.py\n│   ├── combine_results.py\n│   ├── eval_downstream_tasks.py            # evaluate on passkey retrieval task\n│   ├── eval_generation.py                  # evaluate generation metrics\n│   └── eval_ppl_deepspeed.py               # evaluate perplexity\n├── utils\n│   ├── arguments.py\n│   └── utils.py\n└── visualization\n    ├── plot_nll.py\n    ├── position_pca.py\n    └── relative_attention_explosion.py\n```\n\n\n## Usage\n\n\n\n### Data Preparation\n\n\nFor datasets, you need to prepared a corpus dataset.\nIf you download the the original Pile source (https://pile.eleuther.ai) to `${PILE_PATH}/test.jsonl.zst` and `${PILE_PATH}/val.jsonl.zst`, run the following commands to extract the compressed dataset.\n```\ncd ${PILE_PATH}\nzstd -d ./ test.jsonl.zst\nzstd -d ./ val.jsonl.zst\n```\nThen run the following commands to split the dataset into task-specific files.\n```\ncd ${REPOSITORY_ROOT}\nmkdir -p ${PILE_PATH}/val\nmkdir -p ${PILE_PATH}/test\npython data/split_pile_file.py ${PILE_PATH}/val.jsonl ${PILE_PATH}/val\npython data/split_pile_file.py ${PILE_PATH}/test.jsonl ${PILE_PATH}/test\n```\n\nHowever the official Pile does not seem to be available for download anymore, so you probably need to figure out another source(e.g., https://huggingface.co/datasets/arxiv_dataset or https://openwebtext2.readthedocs.io/en/latest/).\nAlternatively, you can also use your own corpus.\nBoth two options require you to edit [data/get_data.py](data/get_data.py).\n\n\n\n\n\n\n\n### Model Preparation\n\nFor backbone models, the paper uses Llama-2, LLaMA, GPT-J, and MPT-7B.\nThe last 3 models are directly available on-the-fly from HuggingFace model hub so not action is needed beforehand.\nThe Llama-2 download key needs to be requested from [Meta AI request form](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).\nThen run the following command\n```\nbash models/get_llama2/download_llama2.sh\n```\nand follow prompts to download the checkpoints to `${PATH_TO_LLAMA2_CHECKPOINTS}`.\nThen run \n```\npython models/get_llama2/convert_llama_weights_to_hf.py \\\n    --input_dir ${PATH_TO_LLAMA2_CHECKPOINTS} \\\n    --model_size 7B \\\n    --output_dir ${PATH_TO_LLAMA2_CHECKPOINTS}/llama-2-7b-hf\n```\nto convert the llama-2-7b checkpoints to huggingface format.\n\n\n\n\n\n\n## Evaluation\n\nThe codes requires a `${LOG_DIR}` to store the logs and results.\nPlease select a directory with enough space.\n\n\n### Perplexity\n\nEvaluating the perplexity of Llama-2 model on ArXiv test set.\n\n```\nTRIAL=llama2-infinite-ArXiv\nmkdir -p $LOG_DIR/$TRIAL\nCUDA_VISIBLE_DEVICES=0\nMASTER_PORT=$(shuf -i 29500-65535 -n 1)\nDS_SKIP_CUDA_CHECK=1 PYTHONPATH=. deepspeed --include localhost:$CUDA_VISIBLE_DEVICES --master_port $MASTER_PORT scripts/eval_ppl_deepspeed.py \\\n    --deepspeed_config configs/zero3_efficient_config.json \\\n    --model ${PATH_TO_LLAMA2_CHECKPOINTS}/llama-2-7b-hf --tokenizer_path ${PATH_TO_LLAMA2_CHECKPOINTS} \\\n    --use_lambda_attention --local_branch 4096 --global_branch 100 --limit_distance 4096 \\\n    --dataset the_pile --dataset_group ArXiv --split test --dataset_dir ${PILE_PATH} \\\n    --max_length 32770 \\\n    --log_dir $LOG_DIR/$TRIAL\n```\n\nA brief explanation of the arguments:\n- `--model`: the path or name to model. Pass `decapoda-research/llama-7b-hf` to use LLaMA, `mosaicml/mpt-7b` to use MPT-7B, and `EleutherAI/gpt-j-6b` to use GPT-J-6B.\n- `--tokenizer_path`: the path to the tokenizer. Remove this argument if not using Llama-2.\n- `--use_lambda_attention`: use lambda attention. (Required for LM-Infinite)\n- `--local_branch`: the local branch size. 2048 for LLaMA, MPT-7B and GPT-J (Required for LM-Infinite)\n- `--global_branch`: the global branch size. Range 10-100 gives generally similar effect. (Required for LM-Infinite)\n- `--limit_distance`: the distance limit. 2048 for LLaMA, MPT-7B and GPT-J (Required for LM-Infinite)\n- `--dataset`: the dataset name. See [data/get_data.py](data/get_data.py) to figure how to use custom datasets.\n\n\nIf you want to evaluate on vanilla models without LM-Infinite, simply remove the \n`--use_lambda_attention --local_branch 4096 --global_branch 100 --limit_distance 4096 `\nargument set.\n\nIf you want only to evaluate on a subset of the test set, you can use the `--start_data_from` argument to specify the starting index of the test set, and/or `--max_data_num` to specify the number of examples after that index.\n\n\n### Evaluating Perplexity at Extreme Lengths\n\n\n```\n\nTRIAL=llama2-infinite-ArXiv-extreme\nCUDA_VISIBLE_DEVICES=0\nMASTER_PORT=$(shuf -i 29500-65535 -n 1)\necho port: $MASTER_PORT\nmkdir -p $LOG_DIR/$TRIAL\nDS_SKIP_CUDA_CHECK=1 PYTHONPATH=. deepspeed --include localhost:$CUDA_VISIBLE_DEVICES --master_port $MASTER_PORT scripts/eval_infinite_ppl.py \\\n    --deepspeed_config configs/zero3_efficient_config.json \\\n    --model ${PATH_TO_LLAMA2_CHECKPOINTS}/llama-2-7b-hf --tokenizer_path ${PATH_TO_LLAMA2_CHECKPOINTS} \\\n    --use_lambda_attention --local_branch 4096 --global_branch 10 --limit_distance 4096 \\\n    --dataset the_pile --dataset_group ArXiv --split test --dataset_dir ${PILE_PATH} \\\n    --streaming_length 200000000 --max_length 128000 --start_data_from 2300 \\\n    --log_dir $LOG_DIR/$TRIAL\n\n```\n\n\n### Generation\n\n\nGenerating evaluation from Llama-2 model on ArXiv test set.\n\n```\n\nTRIAL=llama2-infinite-generate-ArXiv\nmkdir -p $LOG_DIR/$TRIAL\nCUDA_VISIBLE_DEVICES=0\nMASTER_PORT=$(shuf -i 29500-65535 -n 1)\nDS_SKIP_CUDA_CHECK=1 PYTHONPATH=. deepspeed --include localhost:$CUDA_VISIBLE_DEVICES --master_port $MASTER_PORT scripts/eval_generation.py \\\n    --deepspeed_config configs/zero3_efficient_config.json \\\n    --model ${PATH_TO_LLAMA2_CHECKPOINTS}/llama-2-7b-hf --tokenizer_path ${PATH_TO_LLAMA2_CHECKPOINTS} \\\n    --use_lambda_attention --local_branch 4096 --global_branch 100 --limit_distance 4096 \\\n    --dataset the_pile --dataset_group ArXiv --split test --dataset_dir ${PILE_PATH} \\\n    --max_length 33000 \\\n    --max_generation_length 100 --evaluate_metrics --evaluate_positions 4096 8192 12288 16384 \\\n    --log_dir $LOG_DIR/$TRIAL\n\n```\n\n\n### Evaluation Downstream Tasks\n\n#### Passkey Retrieval\n\nFirst, we need to prepare the passkey retrieval dataset.\n```\nfor MAX_LENGTH in 2048 3072 4096 5120 6144 7168 8192 10240 12288 14335 16384; do\n    echo $MAX_LENGTH\n    python data/passkey_retrieval/create_passkey_data.py \\\n        --token-length $MAX_LENGTH \\\n        --dump-file-path ${PASSKEY_DATA}/${MAX_LENGTH} \\\n        --tokenizer-path ${PATH_TO_LLAMA2_CHECKPOINTS} \\\n        --num-samples 1000\ndone\n\n```\n\nThen, let us evaluate the passkey retrieval task.\n```\n\nCUDA_VISIBLE_DEVICES=0\nfor MAX_LENGTH in 6144 8192 10240 12288 16384; do\n    TRIAL=llama2-infinite-passkey-$MAX_LENGTH\n    mkdir -p $LOG_DIR/$TRIAL\n    MASTER_PORT=$(shuf -i 29500-65535 -n 1)\n    DS_SKIP_CUDA_CHECK=1 PYTHONPATH=. deepspeed --master_port $MASTER_PORT --include localhost:$CUDA_VISIBLE_DEVICES scripts/eval_downstream_tasks.py \\\n        --deepspeed_config configs/zero3_efficient_config.json \\\n        --model ${PATH_TO_LLAMA2_CHECKPOINTS}/llama-2-7b-hf --tokenizer_path ${PATH_TO_LLAMA2_CHECKPOINTS} \\\n        --use_lambda_attention --local_branch 4096 --global_branch 10 --limit_distance 4096 --triangle_offset 0 \\\n        --top_k_attention 5 --top_k_from_layer 4 \\\n        --dataset passkey_retrieval --dataset_dir ${PASSKEY_DATA} --dataset_group ${MAX_LENGTH} \\\n        --max_generation_length 7 --evaluate_metrics \\\n        --log_dir $LOG_DIR/$TRIAL\ndone\n\n```\n\n\n#### Qasper\n\n\nRunning the Qasper task:\n```\n\nCUDA_VISIBLE_DEVICES=0\nDATASET=qasper\nTRIAL=llama2-infinite-$DATASET\nmkdir -p $LOG_DIR/$TRIAL\nMASTER_PORT=$(shuf -i 29500-65535 -n 1)\necho port: $MASTER_PORT\nDS_SKIP_CUDA_CHECK=1 PYTHONPATH=. deepspeed --include localhost:$CUDA_VISIBLE_DEVICES --master_port $MASTER_PORT scripts/eval_downstream_tasks.py \\\n    --deepspeed_config configs/zero3_efficient_config_large.json \\\n    --model ${PATH_TO_LLAMA2_CHECKPOINTS}/llama-2-7b-hf --tokenizer_path ${PATH_TO_LLAMA2_CHECKPOINTS} \\\n    --use_lambda_attention --local_branch 4096 --global_branch 10 --limit_distance 4096 --triangle_offset 0 \\\n    --top_k_attention 5 --top_k_from_layer 4 \\\n    --dataset $DATASET --split test --evaluate_metrics \\\n    --max_length 6144 --truncation_side center \\\n    --log_dir $LOG_DIR/$TRIAL\n\n```\n\n## Citation\n\n```\n@inproceedings{han2024lm,\n  title={LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models},\n  author={Han, Chi and Wang, Qifan and Peng, Hao and Xiong, Wenhan and Chen, Yu and Ji, Heng and Wang, Sinong},\n  booktitle={Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},\n  pages={3991--4008},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGlaciohound%2FLM-Infinite","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FGlaciohound%2FLM-Infinite","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FGlaciohound%2FLM-Infinite/lists"}