{"id":13754090,"url":"https://github.com/FranxYao/Long-Context-Data-Engineering","last_synced_at":"2025-05-09T22:30:46.172Z","repository":{"id":222621703,"uuid":"750748239","full_name":"FranxYao/Long-Context-Data-Engineering","owner":"FranxYao","description":"Implementation of paper Data Engineering for Scaling Language Models to 128K Context","archived":false,"fork":false,"pushed_at":"2024-03-19T03:57:10.000Z","size":4529,"stargazers_count":457,"open_issues_count":12,"forks_count":30,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-04-05T08:08:49.846Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FranxYao.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-31T08:39:03.000Z","updated_at":"2025-04-04T10:13:48.000Z","dependencies_parsed_at":"2024-08-03T09:17:17.154Z","dependency_job_id":null,"html_url":"https://github.com/FranxYao/Long-Context-Data-Engineering","commit_stats":null,"previous_names":["franxyao/long-context-data-engineering"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FranxYao%2FLong-Context-Data-Engineering","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FranxYao%2FLong-Context-Data-Engineering/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FranxYao%2FLong-Context-Data-Engineering/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FranxYao%2FLong-Context-Data-Engineering/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FranxYao","download_url":"https://codeload.github.com/FranxYao/Long-Context-Data-Engineering/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253335247,"owners_count":21892634,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:39.780Z","updated_at":"2025-05-09T22:30:45.582Z","avatar_url":"https://github.com/FranxYao.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话"],"sub_categories":["大语言对话模型及数据"],"readme":"\n# Long-Context Data Engineering\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca \u003e\u003cimg src=\"assets/logo.jpg\" alt=\"logo\" style=\"width: 60%; min-width: 300px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\nChatGPT-4 Dalle-3 Prompt: \"Draw a carton style logo showing a very very long paper\"\n\u003cp align=\"center\"\u003e\n    🤗 \u003ca href=\"https://huggingface.co/yaofu/llama-2-7b-80k\" target=\"_blank\"\u003eHF Repo\u003c/a\u003e • 📃 \u003ca href=\"https://arxiv.org/abs/2402.10171\" target=\"_blank\"\u003ePaper\u003c/a\u003e • 💿 \u003ca href=\"https://huggingface.co/datasets/yaofu/slimpajama-per-source-length-upsample\" target=\"_blank\"\u003eData\u003c/a\u003e\n\u003c/p\u003e\n\nImplementation of paper:\n* Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim and Hao Peng. Feb 2024. _Data Engineering for Scaling Language Models to 128K Context_\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca \u003e\u003cimg src=\"assets/needle.jpg\" alt=\"logo\" style=\"width: 100%; min-width: 300px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\u003c/p\u003e\nOur model is the first public work showing how to achieve GPT-4 level long-context retrieval performance. \n\n\n## Table of Content\n- [x] Loading and playing with the following continue pretrained checkpoint:\n    - [x] LLaMA-2 7B 80K: continue pretrained on 80K, tested on 128K\n    - [x] LLaMA-2 13B 64K: continue pretrained on 64K, tested on 128K\n- [x] Evaluating the pretrained checkpoint on Needle-in-a-HayStack\n- [x] Loading the preprocessed data\n- [x] Processing the long-context data\n- [ ] Continue pretraining the model on processed long-context data\n\n\n## Download the model to local \nCreate a folder to download the model. \n```bash \npip install -r requirements.txt # pytorch is not included here because we assume you have already installed pytorch\nmkdir ../llama-2-7b-80k\nmkdir ../llama-2-13b-64k\n```\n\nDownload the continue pretrained checkpoint to local \n```python \nfrom huggingface_hub import snapshot_download\n\nsnapshot_download(repo_id='yaofu/llama-2-7b-80k',\n                  local_dir='../llama-2-7b-80k',\n                  repo_type='model',\n                  local_dir_use_symlinks=False,\n                  resume_download=True)\n\nsnapshot_download(repo_id='yaofu/llama-2-13b-64k',\n                  local_dir='../llama-2-13b-64k',\n                  repo_type='model',\n                  local_dir_use_symlinks=False,\n                  resume_download=True)\n```\n\nWe recommend you download the checkpoint to local first, instead of directly loading from HF, like the following:\n```python\nfrom transformers import AutoModelForCausalLM\n# Below is slow and hard to control in a cluster\n# Unless you insist, **we recommend you download the model to local first**\nmodel = AutoModelForCausalLM.from_pretrained(\"yaofu/llama-2-7b-80k\", \n                                             use_flash_attention_2=\"flash_attention_2\", \n                                             torch_dtype=torch.bfloat16\n                                             ) \n```\n\n## Load the continue pretrained checkpoint and play with it \nThe following code requries at least 8x4090 to support 80K context. \nIf you have 4x80G A100 you can make it to at least 128K\n\nWe use `tensor_parallel` implemented from [this repo](https://github.com/BlackSamorez/tensor_parallel) because it is much faster than huggingface's `device_map` and lightweight than vLLM. But it has a small bug that if your GPU memory is not large enough, it will stuck instead of through a memory overflow exception. So make sure you do have enough GPU memory.\n```python \nimport torch \nimport tensor_parallel as tp\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom eval.needle.utils import load_context, insert_needle\n\n# This is the continue pretrained LLaMA 2 7B model with modified rope\ndef reset_rope(model, model_max_train_len, scaling_factor):\n    for l in model.model.layers:\n        l.self_attn.rotary_emb.scaling_factor = scaling_factor\n        l.self_attn.rotary_emb._set_cos_sin_cache(seq_len=model_max_train_len, device=\"cpu\", dtype=torch.float32)\n    return\nmodel = AutoModelForCausalLM.from_pretrained(\"../llama-2-7b-80k\",\n                                             use_flash_attention_2=\"flash_attention_2\", \n                                             torch_dtype=torch.bfloat16\n                                             ) # requires about 14G disk size in $HF_HOME\nscaling_factor = 10 # hardcode here\nreset_rope(model, model_max_train_len=81920, scaling_factor=scaling_factor)\nmodel = tp.tensor_parallel(model, sharded=True)\n\n# Construct the Needle-in-a-HayStack Prompt\nneedle = \"\\nThe best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.\\n\"\nctx_len = 100000 # need at least 8*4090 to run this length\ndepth = 0.5\ncontext = load_context(fpath=\"eval/needle/PaulGrahamEssays/*.txt\", ctx_len=ctx_len)\ncontext = insert_needle(context, needle, depth=depth)\nneedle_idx = context.find(\"The best thing to do in San Francisco is\")\nprint(\"Context has %d chars, needle inserted at %d char location:\\n\" % (len(context), needle_idx))\nprint(context[needle_idx - 150: needle_idx + 150]) # look at how the needle is inserted \n\nprompt =\"\\n\u003c|im_start|\u003e This is a very long story book: \u003cbook\u003e %s \u003c/book\u003e.\\n\" % context\nquestion = \"What is the best thing to do in San Francisco?\"\nprompt += \"Based on the content of the book, Question: %s\\nAnswer:\" % question\nprint(prompt) # feel the length of 100K\n\n# Check how the model performs\ntokenizer = AutoTokenizer.from_pretrained(\"meta-llama/Llama-2-7b-hf\")\nprompt = tokenizer(prompt, return_tensors=\"pt\")\ninput_ids = prompt['input_ids'].to(model.device)\nprint(\"After tokenization, there is %d tokens\" % len(input_ids[0]))\nwith torch.no_grad():\n    output_ids = model.generate(input_ids, max_new_tokens=50)\n    response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True).strip()\nprint(\"Response:\", response.split(\"\\n\")[0])\n```\n\n## Evaluate the pretrained checkpoint on the Needle-in-a-Haystack test\nThe evaluation requires 4*80G A100, and takes about/ less than 24 hours to finish. \nThe inference code can be further optimized by optimizing the tokenizer speed (tokenizing a document of 100K tokens takes a lot of time), though we leave it to future work. \n```bash\ncd eval/needle\nmkdir logs img results\n\n(\npython -u needle_in_haystack.py --s_len 0 --e_len 128000\\\n    --model_provider LLaMA\\\n    --model_path ../../../llama-2-7b-80k\n) 2\u003e\u00261  | tee logs/eval_llama-2-7b-80k.log\n\npython visualize.py \n```\n\n## Evaluate the pretrained checkpoint on the BookQA dataset from InfiniBench\nCode and data adapted from [InfiniBench](https://github.com/OpenBMB/InfiniteBench/tree/main) original author\n\n```bash\ncd eval/book\nmkdir data\n```\nThen download `longbook_qa_eng.json` from [here](https://drive.google.com/drive/folders/1IkfRudRr180CbqOpa5PtSHYW4__XGUpH?usp=sharing) and put it under the `data` folder. \n\n```bash\n(\npython -u eval_book.py --task longbook_qa_eng\\\n    --verbose\\\n    --model_path ../../../llama-2-7b-80k\\\n    --data_dir data\\\n    --model_name llama\\\n    --truncate 128000\n) 2\u003e\u00261  | tee logs/eval_llama_7b_80k_test_to_128k.log\n```\nCaveat: there are two versions of longbook_qa_eng\n* The original version was uploaded by the InfiniBench author at [this commit](https://huggingface.co/datasets/xinrongzhang2022/InfiniteBench/commit/c583fe67832c26f6094515dbe6c3c26c28d840ee)\n* Recently the author updated the data at [this commit](https://huggingface.co/datasets/xinrongzhang2022/InfiniteBench/commit/f2fd8f04ea3af8304b88de2c58bd33887bcccdb8). Consequently if you download Infinibench from HF directly you will be use different data than we use.\n* Here we upload the version we used for the paper under our `data` folder. This will incease the risk of this dataset being exposed to future LLM training. Hope by that time we already have a better long context eval :) \n\n## Load the preprocessed data \nThe following code requires 60G disk size in the `$HF_CACHE` folder. The data is processed from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) using per-source length-upsampling described in our paper section 3. We have already tokenized and chunked the data in the following format:\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca \u003e\u003cimg src=\"assets/chunking.jpg\" alt=\"logo\" style=\"width: 100%; min-width: 300px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n```python \nimport datasets\nfrom transformers import AutoTokenizer\ndataset = datasets.load_dataset(\"yaofu/slimpajama-per-source-length-upsample\")\ntokenizer = AutoTokenizer.from_pretrained(\"meta-llama/Llama-2-7b-hf\")\n\nd = dataset[\"train\"][0]\nprint(d.keys())\nprint(d[\"source\"])\nprint(len(d[\"input_ids\"])) ## all input_ids are chunks of length 131072\n\ndoc_id = 0\ndoc_start, doc_end = d[\"source\"][doc_id][\"start\"], d[\"source\"][doc_id][\"end\"]\nprint(tokenizer.decode(d[\"input_ids\"][doc_start: doc_end]))\n\ndoc_id = 1\ndoc_start, doc_end = d[\"source\"][doc_id][\"start\"], d[\"source\"][doc_id][\"end\"]\nprint(tokenizer.decode(d[\"input_ids\"][doc_start: doc_end]))\n```\n\nAlternatively, you may use the `streaming=True` mode to avoid the long downloading time. \nBut we do recommend downloading the model first because it will save a lot of time when you load the dataset at the second time. \n```python \nimport datasets\nfrom transformers import AutoTokenizer\ndataset = datasets.load_dataset(\"yaofu/slimpajama-per-source-length-upsample\", streaming=True)\nit = iter(dataset[\"train\"])\ntokenizer = AutoTokenizer.from_pretrained(\"meta-llama/Llama-2-7b-hf\")\n\nd = next(it)\nprint(d.keys())\nprint(d[\"source\"])\nprint(len(d[\"input_ids\"])) ## all input_ids are chunks of length 131072\n\ndoc_id = 0\ndoc_start, doc_end = d[\"source\"][doc_id][\"start\"], d[\"source\"][doc_id][\"end\"]\nprint(tokenizer.decode(d[\"input_ids\"][doc_start: doc_end]))\n\ndoc_id = 1\ndoc_start, doc_end = d[\"source\"][doc_id][\"start\"], d[\"source\"][doc_id][\"end\"]\nprint(tokenizer.decode(d[\"input_ids\"][doc_start: doc_end]))\n```\n\n## Generate the per-source length upsampled data\nWe recommend first download the SlimPajama data to local. First make a folder \n```bash\nmkdir ../SlimPajama-627B\n```\n\nThen download. This requires about 1.8T disk size and takes quite a while to download. Remember that this is not finetuning, so be patient. \n```python\nfrom huggingface_hub import snapshot_download\n\nsnapshot_download(repo_id='cerebras/SlimPajama-627B',\n                  local_dir='../SlimPajama-627B',\n                  repo_type='dataset',\n                  local_dir_use_symlinks=False,\n                  resume_download=True)\n```\n\nThen generate the per-source length upsampled data. In our practice we down-sample sequences shorter than 4K. \nNote that this is equivalent to upsampling sequences longer than 4K. \nWe use multi-processing: there are 200 tokenizer process, a read process (which is also the main process) and a write process. \nThe main process reads the data streamingly, then asks which tokenizer process is free. \nIf there is a free tokenizer process, it assigns the current document to that process, otherwise it waits and keeps asking. \nA tokenizer process receives the document from the main process, tokenizes it, then sends the tokens to the writer process. \nThe writer process continuously receives the tokenized data from all tokenizer processes, and writes them into a .jsonl file. \nThe following code requries about 200 CPU cores, 50G CPU memory. Tokenizing 5B tokens takes about 1 hour. \nIf you do not use multi-processing like we do, you will need about two days for tokenization. \n```bash\nmkdir logs\nmkdir data\nmkdir data/slimpajama\nmkdir data/slimpajama/per_source_downsample\ncd data_engineering\n\nPATH_TO_SLIMPAJAMA=../SlimPajama-627B\nnohup python -u slimpajama_packing.py\\\n    --dataset_size=100m\\\n    --print_interval=100 --num_process=200\\\n    --dataset_path=$PATH_TO_SLIMPAJAMA\\\n    --output_path=../data/slimpajama/per_source_downsample/ --down_sample_ratio=0.1 --down_sample_mode=per_source\\\n    \u003e ../logs/slimpajama_packing_dist_per_source_downsample_0.1.log 2\u003e\u00261 \u0026\ntail -f ../logs/slimpajama_packing_dist_per_source_downsample_0.1.log\n```\nThe `--dataset_size 100m` is for a quick demo. Change it to `--dataset_size 5B` to reproduce our training data.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFranxYao%2FLong-Context-Data-Engineering","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFranxYao%2FLong-Context-Data-Engineering","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFranxYao%2FLong-Context-Data-Engineering/lists"}