{"id":13441872,"url":"https://github.com/CStanKonrad/long_llama","last_synced_at":"2025-03-20T13:31:09.128Z","repository":{"id":179230449,"uuid":"663101722","full_name":"CStanKonrad/long_llama","owner":"CStanKonrad","description":"LongLLaMA is a large language model capable of handling long contexts. It is based on OpenLLaMA and fine-tuned with the Focused Transformer (FoT) method.","archived":false,"fork":false,"pushed_at":"2023-11-07T18:50:31.000Z","size":1585,"stargazers_count":1450,"open_issues_count":18,"forks_count":87,"subscribers_count":26,"default_branch":"main","last_synced_at":"2025-03-09T16:38:52.878Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CStanKonrad.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-07-06T14:54:15.000Z","updated_at":"2025-02-27T22:30:04.000Z","dependencies_parsed_at":"2023-09-22T02:30:56.401Z","dependency_job_id":"ba0f4212-f9ac-43b3-8ff1-72ea3707f286","html_url":"https://github.com/CStanKonrad/long_llama","commit_stats":null,"previous_names":["cstankonrad/long_llama"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CStanKonrad%2Flong_llama","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CStanKonrad%2Flong_llama/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CStanKonrad%2Flong_llama/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CStanKonrad%2Flong_llama/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CStanKonrad","download_url":"https://codeload.github.com/CStanKonrad/long_llama/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244188991,"owners_count":20412981,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T03:01:39.082Z","updated_at":"2025-03-20T13:31:08.697Z","avatar_url":"https://github.com/CStanKonrad.png","language":"Python","funding_links":[],"categories":["Python","Uncategorized","A01_文本生成_文本对话","Summary"],"sub_categories":["Uncategorized","大语言对话模型及数据"],"readme":"\n\u003cp align=\"center\" width=\"100%\"\u003e\u003cimg src=\"assets/longllama.png\" alt=\"LongLLaMA\" style=\"width: 50%;  display: block; margin: auto;\"\u003e\u003c/p\u003e\n\n# LongLLaMA: Focused Transformer Training for Context Scaling\n\n\n\n\u003cdiv align=\"center\"\u003e\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth style=\"font-size: 120%\"\u003e \u003e_ 🎓 \u003ca href=\"https://huggingface.co/syzymon/long_llama_code_7b_instruct\"\u003eLongLLaMA-Code 7B Instruct\u003c/a\u003e 📑🗨 \u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n    \u003ca  href=\"https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_code_instruct_colab.ipynb\"\u003e\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\"\u003e\u003c/a\u003e \u0026nbsp \u003ca href=\"instruction_fine_tuning/LongLLamaCode7BInstruct.md\"\u003eLearn more\u003c/a\u003e\n    \u003c/td\u003e\n    \n \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003c/div\u003e\n\n\n\u003cdiv align=\"center\"\u003e\n\n\u003ctable\u003e\n\n  \u003ctr\u003e\n  \u003ctd align=\"center\"\u003e\n    \u003cspan style=\"font-size:200%\"\u003e⇧\u003c/span\u003e\n  \u003c/td\u003e\n \u003c/tr\u003e\n \n\u003c/table\u003e\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n\u003ctable\u003e\n\n  \u003ctr\u003e\n  \u003ctd align=\"center\"\u003e\n    \u003cspan style=\"font-size:150%\"\u003e{\u003c/span\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n    \u003cspan style=\"font-size:110%\"\u003e\n    \u003cb\u003e\n    \u003ca href=\"https://huggingface.co/syzymon/long_llama_code_7b\" tyle=\"margin-bottom:30px\"\u003eLongLLaMA-Code 7B\u003c/a\u003e\n    \u003c/b\u003e\n    \u003c/span\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n    \u003cspan style=\"font-size:150%\"\u003e}\u003c/span\u003e\n    \u003c/td\u003e\n\n \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003chr\u003e\n\n\u003c/div\u003e\n\n\n\u003cdiv align=\"center\"\u003e\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003cth\u003e \u003ca href=\"https://huggingface.co/syzymon/long_llama_3b_instruct\"\u003eLongLLaMA-Instruct-3Bv1.1\u003c/a\u003e \u003c/th\u003e\n    \u003cth\u003e \u003ca href=\"https://huggingface.co/syzymon/long_llama_3b_v1_1\"\u003eLongLLaMA-3Bv1.1\u003c/a\u003e\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\n    \u003ca  href=\"https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_instruct_colab.ipynb\"\u003e\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\"\u003e\u003c/a\u003e\n    \u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\n    \u003ca href=\"https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_colab.ipynb\"\u003e\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\"\u003e\u003c/a\u003e\n    \u003c/td\u003e\n \u003c/tr\u003e\n\u003c/table\u003e\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n [TLDR](#TLDR) | [Overview](#Overview) | [Usage](#Usage) | [LongLLaMA performance](#LongLLaMA-performance) | [Authors](#Authors) | [Citation](#Citation) | [License](#License) | [Acknowledgments](#Acknowledgments)\n \n [FoT continued pretraining](fot_continued_pretraining) | [Instruction tuning](instruction_fine_tuning)\n\n\u003c/div\u003e\n\n## TLDR\nThis repository contains the research preview of **LongLLaMA, a large language model capable of handling long contexts of 256k tokens or even more**. \n\nLongLLaMA is built upon the foundation of [OpenLLaMA](https://github.com/openlm-research/open_llama) and fine-tuned using the [Focused Transformer (FoT)](https://arxiv.org/abs/2307.03170) method.\nLongLLaMA Code is built upon the foundation of [Code Llama](https://huggingface.co/codellama/CodeLlama-7b-hf).\nWe release a smaller  3B base variant (not instruction tuned) of the LongLLaMA model on a permissive license (Apache 2.0) and inference code supporting longer contexts on [Hugging Face](https://huggingface.co/syzymon/long_llama_3b). Our model weights can serve as the drop-in replacement of LLaMA in existing implementations (for short context up to 2048 tokens). Additionally, we provide evaluation results and comparisons against the original OpenLLaMA models.  \nIn addition to this, we release code for [instruction tuning (PyTorch)](instruction_fine_tuning/) and [FoT continued pretraining (JAX)](fot_continued_pretraining/).\n\n\n## Overview\n\n### Base models\n[Focused Transformer: Contrastive Training for Context Scaling](https://arxiv.org/abs/2307.03170) (FoT) presents a simple method for endowing language models with the ability to handle context consisting possibly of millions of tokens while training on significantly shorter input. FoT permits a subset of attention layers to access a memory cache of (key, value) pairs to extend the context length. The distinctive aspect of FoT is its training procedure, drawing from contrastive learning. Specifically, we deliberately expose the memory attention layers to both relevant and irrelevant keys (like negative samples from unrelated documents). This strategy incentivizes the model to differentiate keys connected with semantically diverse values, thereby enhancing their structure. This, in turn, makes it possible to extrapolate the effective context length much beyond what is seen in training. \n\n\n**LongLLaMA** is an [OpenLLaMA](https://github.com/openlm-research/open_llama) model finetuned with the FoT method,\nwith three layers used for context extension. **Crucially, LongLLaMA is able to extrapolate much beyond the context length seen in training: $8k$. E.g., in the passkey retrieval task, it can handle inputs of length $256k$**.  \n**LongLLaMA Code** is a [Code Llama](https://huggingface.co/codellama/CodeLlama-7b-hf) model finetuned with the FoT method.\n\n\n\u003cdiv align=\"center\"\u003e\n\n|  | [LongLLaMA-3B](https://huggingface.co/syzymon/long_llama_3b) | [LongLLaMA-3Bv1.1](https://huggingface.co/syzymon/long_llama_3b_v1_1) | [LongLLaMA-Code 7B](https://huggingface.co/syzymon/long_llama_code_7b) |\n|----------------|----------|----------|-----------|\n| Source model         | [OpenLLaMA-3B](https://huggingface.co/openlm-research/open_llama_3b_easylm)      | [OpenLLaMA-3Bv2](https://huggingface.co/openlm-research/open_llama_3b_v2_easylm) | [CodeLLaMA-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf)       |\n| Source model tokens     | 1T      |  1 T |  2T + 0.5 T       |\n| Fine-tuning tokens  | 10B     | 5B | 35B     | - |\n| Memory layers         |  6, 12, 18        |   6, 12, 18        |  8, 16, 24        |\n\n\u003c/div\u003e\n\n\n### FoT continued pretraining\nIn the [fot_continued_pretraining](fot_continued_pretraining/) subfolder, we provide the code that can be used to tune LLaMA models with FoT.  \nThis code is written in [JAX](https://jax.readthedocs.io/en/latest/notebooks/quickstart.html) \u0026 [Flax](https://flax.readthedocs.io/en/latest/guides/flax_basics.html) and based on [EasyLM](https://github.com/young-geng/EasyLM).\n\n### Instruction/Chat tuning\n\nIn the [instruction_fine_tuning](instruction_fine_tuning/) subfolder, we provide the code that was used to create [LongLLaMA-Instruct-3Bv1.1](https://huggingface.co/syzymon/long_llama_3b_instruct), an instruction-tuned version of [LongLLaMA-3Bv1.1](https://huggingface.co/syzymon/long_llama_3b_v1_1). We used [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) (instructions) and [zetavg/ShareGPT-Processed](https://huggingface.co/datasets/zetavg/ShareGPT-Processed) (chat) datasets for tuning.  \nThis code utilizes [PyTorch](https://pytorch.org/) and [Hugging Face trainer](https://huggingface.co/docs/transformers/v4.30.0/en/main_classes/trainer).\n\n### Inference code\nIn the [src](src/) subfolder we provide inference code for FoT models.  \nThe code is written in [PyTorch](https://pytorch.org/) and based on [Hugging Face implementation of LLaMA](https://huggingface.co/docs/transformers/main/model_doc/llama).  \nThe code should support standard Hugging Face API. For more details see the [Usage](#Usage) section.\n\n\n## Usage\n\nSee also: \n* [Colab with LongLLaMA-Instruct-3Bv1.1](https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_instruct_colab.ipynb).\n* [Colab with an example usage of base LongLLaMA](https://colab.research.google.com/github/CStanKonrad/long_llama/blob/main/long_llama_colab.ipynb).\n### Requirements\n```\npip install --upgrade pip\npip install transformers==4.33.2 sentencepiece accelerate\n```\n\n### Loading model\n```python\nimport torch\nfrom transformers import LlamaTokenizer, AutoModelForCausalLM\n\ntokenizer = LlamaTokenizer.from_pretrained(\"syzymon/long_llama_3b_v1_1\")\nmodel = AutoModelForCausalLM.from_pretrained(\"syzymon/long_llama_3b_v1_1\", \n                                            torch_dtype=torch.float32, \n                                            trust_remote_code=True)\n```\n\n### Input handling and generation\nLongLLaMA uses the Hugging Face interface, the long input given to the model will be \nsplit into context windows and loaded into the memory cache.\n```python\nprompt = \"My name is Julien and I like to\"\ninput_ids = tokenizer(prompt, return_tensors=\"pt\").input_ids\noutputs = model(input_ids=input_ids)\n```\nDuring the model call, one can provide the parameter `last_context_length` (default $1024$), which specifies the number of tokens left in the last context window. Tuning this parameter can improve generation as the first layers do not have access to memory. See details in [How LongLLaMA handles long inputs](#How-LongLLaMA-handles-long-inputs).\n\n```python\ngeneration_output = model.generate(\n    input_ids=input_ids,\n    max_new_tokens=256,\n    num_beams=1,\n    last_context_length=1792,\n    do_sample=True,\n    temperature=1.0,\n)\nprint(tokenizer.decode(generation_output[0]))\n```\n\n### Additional configuration\nLongLLaMA has several other parameters:\n* `mem_layers` specifies layers endowed with memory (should be either an empty list or a list of all memory layers specified in the description of the checkpoint).\n* `mem_dtype` allows changing the type of memory cache\n* `mem_attention_grouping` can trade off speed for reduced memory usage. \n  When equal to `(4, 2048)`, the memory layers will process at most $4*2048$ queries at once ($4$ heads and $2048$ queries for each head).\n\n```python\nimport torch\nfrom transformers import LlamaTokenizer, AutoModelForCausalLM\n\ntokenizer = LlamaTokenizer.from_pretrained(\"syzymon/long_llama_3b_v1_1\")\nmodel = AutoModelForCausalLM.from_pretrained(\n    \"syzymon/long_llama_3b_v1_1\", torch_dtype=torch.float32, \n    mem_layers=[], \n    mem_dtype='bfloat16',\n    trust_remote_code=True,\n    mem_attention_grouping=(4, 2048),\n)\n```\n\n\n### Drop-in use with LLaMA code\n LongLLaMA checkpoints can also be used as a drop-in replacement for LLaMA checkpoints in [Hugging Face implementation of LLaMA](https://huggingface.co/docs/transformers/main/model_doc/llama), but in this case, they will be limited to the original context length of $2048$.\n\n```python\nfrom transformers import LlamaTokenizer, LlamaForCausalLM\nimport torch\n\ntokenizer = LlamaTokenizer.from_pretrained(\"syzymon/long_llama_3b_v1_1\")\nmodel = LlamaForCausalLM.from_pretrained(\"syzymon/long_llama_3b_v1_1\", torch_dtype=torch.float32)\n```\n\n\n### How LongLLaMA handles long inputs\nInputs over $lctx=2048$ ($lctx=4096$ for LongLLaMA Code) tokens are automatically split into windows $w_1, \\ldots, w_m$. The first $m-2$ windows contain $lctx$ tokens each, $w_{m-1}$ has no more than $lctx$ tokens, and $w_m$ contains the number of tokens specified by `last_context_length`. The model processes the windows one by one extending the memory cache after each. If `use_cache` is `True`, then the last window will not be loaded to the memory cache but to the local (generation) cache.\n\nThe memory cache stores $(key, value)$ pairs for each head of the specified memory layers `mem_layers`. In addition to this, it stores attention masks. \n\nIf `use_cache=True` (which is the case in generation), LongLLaMA will use two caches: the memory cache for the specified layers and the local (generation) cache for all layers. When the local cache exceeds $lctx$ elements, its content is moved to the memory cache for the memory layers.\n\nFor simplicity, context extension is realized with a memory cache and full attention in this repo. Replacing this simple mechanism with a KNN search over an external database is possible with systems like [Faiss](https://github.com/facebookresearch/faiss). This potentially would enable further context length scaling. We leave this as a future work.\n\n\n## LongLLaMA performance\nWe present some illustrative examples of LongLLaMA results. Refer to our paper [Focused Transformer: Contrastive Training for Context Scaling](https://arxiv.org/abs/2307.03170) for more details.\n\nWe manage to achieve good performance on the passkey retrieval task from [Landmark Attention: Random-Access Infinite Context Length for Transformers](https://arxiv.org/abs/2305.16300). The code for generating the prompt and running the model is located in `examples/passkey.py`. \n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"assets/plot_passkey.png\" alt=\"LongLLaMA\" style=\"width: 70%; min-width: 300px; display: block; margin: auto;\"\u003e\n\u003c/p\u003e\n\nOur LongLLaMA 3B model also shows improvements when using long context on two downstream tasks, TREC question classification and WebQS question answering. \n\u003cdiv align=\"center\"\u003e\n\n\n| Context/Dataset | TREC  | WebQS |\n| --- | --- | --- |\n| $2K$ | 67.0 |  21.2 |\n| $4K$ | 71.6 | 21.4 |\n| $6K$ | 72.9 | 22.2 |\n| $8K$ | **73.3** | **22.4** |\n\n\u003c/div\u003e\n\nLongLLaMA retains performance on tasks that do not require long context. \n\nIn particular, LongLLaMA-Code 7B improves reasoning (GSM8K) and knowledge (MMLU) due to code fine-tuning:\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003cimg src=\"assets/full_results.png\" alt=\"LongLLaMA\" style=\"width: 70%; min-width: 300px; display: block; margin: auto;\"\u003e\n\u003c/p\u003e\n\nWe provide a comparison with OpenLLaMA\non [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) in the zero-shot setting. \n\u003cdiv align=\"center\"\u003e\n\n| Task/Metric | OpenLLaMA-3B | LongLLaMA-3B |\n|----------------|----------|-----------|\n| anli_r1/acc | 0.33 | 0.32 |\n| anli_r2/acc | 0.32 | 0.33 |\n| anli_r3/acc | 0.35 | 0.35 |\n| arc_challenge/acc | 0.34 | 0.34 |\n| arc_challenge/acc_norm | 0.37 | 0.37 |\n| arc_easy/acc | 0.69 | 0.68 |\n| arc_easy/acc_norm | 0.65 | 0.63 |\n| boolq/acc | 0.68 | 0.68 |\n| hellaswag/acc | 0.49 | 0.48 |\n| hellaswag/acc_norm | 0.67 | 0.65 |\n| openbookqa/acc | 0.27 | 0.28 |\n| openbookqa/acc_norm | 0.40 | 0.38 |\n| piqa/acc | 0.75 | 0.73 |\n| piqa/acc_norm | 0.76 | 0.75 |\n| record/em | 0.88 | 0.87 |\n| record/f1 | 0.89 | 0.87 |\n| rte/acc | 0.58 | 0.60 |\n| truthfulqa_mc/mc1 | 0.22 | 0.24 |\n| truthfulqa_mc/mc2 | 0.35 | 0.38 |\n| wic/acc | 0.48 | 0.50 |\n| winogrande/acc | 0.62 | 0.60 |\n| Avg score | 0.53 | 0.53 |\n\n\u003c/div\u003e\n\nStarting with v1.1 models we have decided to use [EleutherAI](https://github.com/EleutherAI) implementation of [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) with a slight modification, that adds `\u003cbos\u003e` token at beginning of input sequence. The results are provided in the table below.\n\n\u003cdiv align=\"center\"\u003e\n\n| description            | LongLLaMA-3B | OpenLLaMA-3Bv2 | LongLLaMA-3Bv1.1 | LongLLaMA-Instruct-3Bv1.1 |\n|:-----------------------|:--------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:--------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------|\n| anli_r1/acc            | 0.32                                                                                  | 0.33                                                                               | 0.31                                                                      | 0.33                                                                                         |\n| anli_r2/acc            | 0.33                                                                                  | 0.35                                                                               | 0.33                                                                      | 0.35                                                                                         |\n| anli_r3/acc            | 0.35                                                                                  | 0.38                                                                               | 0.35                                                                      | 0.38                                                                                         |\n| arc_challenge/acc      | 0.34                                                                                  | 0.33                                                                               | 0.32                                                                      | 0.36                                                                                         |\n| arc_challenge/acc_norm | 0.37                                                                                  | 0.36                                                                               | 0.36                                                                      | 0.37                                                                                         |\n| arc_easy/acc           | 0.67                                                                                  | 0.68                                                                               | 0.68                                                                      | 0.7                                                                                          |\n| arc_easy/acc_norm      | 0.63                                                                                  | 0.63                                                                               | 0.63                                                                      | 0.63                                                                                         |\n| boolq/acc              | 0.68                                                                                  | 0.67                                                                               | 0.66                                                                      | 0.77                                                                                         |\n| hellaswag/acc          | 0.48                                                                                  | 0.53                                                                               | 0.52                                                                      | 0.52                                                                                         |\n| hellaswag/acc_norm     | 0.65                                                                                  | 0.7                                                                                | 0.69                                                                      | 0.68                                                                                         |\n| openbookqa/acc         | 0.28                                                                                  | 0.28                                                                               | 0.28                                                                      | 0.28                                                                                         |\n| openbookqa/acc_norm    | 0.38                                                                                  | 0.39                                                                               | 0.37                                                                      | 0.41                                                                                         |\n| piqa/acc               | 0.73                                                                                  | 0.77                                                                               | 0.77                                                                      | 0.78                                                                                         |\n| piqa/acc_norm          | 0.75                                                                                  | 0.78                                                                               | 0.77                                                                      | 0.77                                                                                         |\n| record/em              | 0.87                                                                                  | 0.87                                                                               | 0.86                                                                      | 0.85                                                                                         |\n| record/f1              | 0.88                                                                                  | 0.88                                                                               | 0.87                                                                      | 0.86                                                                                         |\n| rte/acc                | 0.6                                                                                   | 0.53                                                                               | 0.62                                                                      | 0.7                                                                                          |\n| truthfulqa_mc/mc1      | 0.24                                                                                  | 0.22                                                                               | 0.21                                                                      | 0.25                                                                                         |\n| truthfulqa_mc/mc2      | 0.38                                                                                  | 0.35                                                                               | 0.35                                                                      | 0.4                                                                                          |\n| wic/acc                | 0.5                                                                                   | 0.5                                                                                | 0.5                                                                       | 0.54                                                                                         |\n| winogrande/acc         | 0.6                                                                                   | 0.66                                                                               | 0.63                                                                      | 0.65                                                                                         |\n| Avg score                   | 0.53                                                                                  | 0.53                                                                               | 0.53                                                                      | 0.55                                                                                         |\n\n\u003c/div\u003e\n\n\nWe also provide the results on human-eval. We cut the generated text after either\n*  `\"\\ndef \"`\n*  `\"\\nclass \"`\n*  `\"\\nif __name__\"`\n\n\u003cdiv align=\"center\"\u003e\n\n|  | OpenLLaMA-3Bv2 | LongLLaMA-3Bv1.1 | LongLLaMA-Instruct-3Bv1.1 |\n| - | - | - | - |\n| pass@1| 0.09| 0.12 |  0.12 |\n\n\u003c/div\u003e\n\n## Authors\n- [Szymon Tworkowski](https://scholar.google.com/citations?user=1V8AeXYAAAAJ\u0026hl=en)\n- [Konrad Staniszewski](https://scholar.google.com/citations?user=CM6PCBYAAAAJ)\n- [Mikołaj Pacek](https://scholar.google.com/citations?user=eh6iEbQAAAAJ\u0026hl=en\u0026oi=ao)\n- [Henryk Michalewski](https://scholar.google.com/citations?user=YdHW1ycAAAAJ\u0026hl=en)\n- [Yuhuai Wu](https://scholar.google.com/citations?user=bOQGfFIAAAAJ\u0026hl=en)\n- [Piotr Miłoś](https://scholar.google.pl/citations?user=Se68XecAAAAJ\u0026hl=pl\u0026oi=ao)\n\n\n## Citation\nTo cite this work please use\n```bibtex\n@misc{tworkowski2023focused,\n      title={Focused Transformer: Contrastive Training for Context Scaling}, \n      author={Szymon Tworkowski and Konrad Staniszewski and Mikołaj Pacek and Yuhuai Wu and Henryk Michalewski and Piotr Miłoś},\n      year={2023},\n      eprint={2307.03170},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n\n## License\nThe source code and base LongLLaMA 3B models checkpoints are licensed under [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0).  \nThe instruction/chat tuned models are for research purposes only.  \nFor the LongLLaMA-Code 7B see [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/LICENSE) license.  \nLongLLaMA-Code 7B Instruct is LongLLaMA-Code 7B tuned on [TIGER-Lab/MathInstruct](https://huggingface.co/datasets/TIGER-Lab/MathInstruct), [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) and [ShareGPT-Processed](https://huggingface.co/datasets/zetavg/ShareGPT-Processed) datasets.  \nSome of the examples use external code (see headers of files for copyright notices and licenses).\n\n## Acknowledgments\nWe gratefully acknowledge the TPU Research Cloud program, which was instrumental to our research by providing significant computational resources. We are also grateful to Xinyang Geng and Hao Liu for releasing [OpenLLaMA](https://github.com/openlm-research/open_llama) checkpoints and the [EasyLM](https://github.com/young-geng/EasyLM) library.\n\nSpecial thanks to [Keiran Paster](https://twitter.com/keirp1) for providing immensely valuable suggestions about the pre-training data for LongLLaMA-Code.\n\nWe would like to thank [Xiaosong,He](https://github.com/hxs91) for suggestions on how to improve the explanations of cross-batch code.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCStanKonrad%2Flong_llama","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCStanKonrad%2Flong_llama","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCStanKonrad%2Flong_llama/lists"}