Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/beomi/infinitransformer
Unofficial PyTorch/🤗Transformers(Gemma/Llama3) implementation of Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
https://github.com/beomi/infinitransformer
gemma huggingface infinitransformer llama llama3 pytorch transformers
Last synced: 2 days ago
JSON representation
Unofficial PyTorch/🤗Transformers(Gemma/Llama3) implementation of Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
- Host: GitHub
- URL: https://github.com/beomi/infinitransformer
- Owner: Beomi
- License: mit
- Created: 2024-04-12T08:55:57.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-04-23T07:25:49.000Z (10 months ago)
- Last Synced: 2024-12-22T11:46:00.541Z (about 2 months ago)
- Topics: gemma, huggingface, infinitransformer, llama, llama3, pytorch, transformers
- Language: Python
- Homepage: https://arxiv.org/abs/2404.07143
- Size: 119 KB
- Stars: 350
- Watchers: 7
- Forks: 31
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# InfiniTransformer
Unofficial PyTorch/🤗Transformers implementation of Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention,
with Llama3 and Gemma model supported. (Llama 2 and 1 is also supported)- Paper Link: https://arxiv.org/abs/2404.07143
## Two types of Implementation for Infini-Attention
**Type I. Infini Attention in Model-wise, Trainer-wise**
- Overrides modeling and config python files.
- Full edit, Not compatible with basic HF trainer.
- Need custom training code
- Memory usage is **much lower** than SDPA(default) attention
- can train Gemma-2B with 32768 seq len(2048*16) on 2x H100 80G (with AdamW optimizer, No gradient checkpointing)
- can train Llama-3-8B with 1M seq len(2048*512) on 2x H100 80G (with Adafactor optimizer, no grad checkpointing)
- Can train 'infinite' context -- check `train.gemma.infini.noclm.1Mseq.sh` with 1x H100 80G (with AdamW optimizer, No gradient checkpointing)**Type II. Infini Attention in Attention-Layer only**
- Overrides modeling python file only, especially Attention layer only.
- Minimal edit, fully compatible with HF(Trainer, etc)
- Memory usage is ~eq with SDPA(default) attention
- can train Gemma-2B with 8192 seq len(128*64) on 2x H100 80G (with Adafactor Optimizer + Gradient Checkpointing)## How to use Type I. Infini Attention in Model-wise, Trainer-wise.
### 1. Clone this repository
```bash
git clone https://github.com/Beomi/InfiniTransformer
```### 2. Install dependencies
> We need to install the latest version(`b109257f4f`) of 🤗Transformers from the source code.
```bash
pip install -r requirements.txt
pip install -e git+https://github.com/huggingface/transformers.git@b109257f4f#egg=transformers
# or just pip install transformers
```### 3. Run the example(Inference, simple forward/backward test)
```bash
python test_basic.infini.py
```### 4. Train with your data
Train Llama-3 1M seq len with 2K segment size, with [MiniPile Dataset](https://huggingface.co/datasets/JeanKaddour/minipile)
```bash
./train.llama.infini.noclm.1Mseq.sh
```or
Train Gemma-2B 32K seq len with 2K segment size, with [WikiText2 Dataset](https://huggingface.co/datasets/wikitext)
```bash
./train.gemma.infini.noclm.sh
```or
Train Gemma-2B 1M seq len with 2K segment size, with [MiniPile Dataset](https://huggingface.co/datasets/JeanKaddour/minipile)
```bash
./train.gemma.infini.noclm.1Mseq.sh
```## How to use Type II. Infini Attention in Attention-Layer only
### 1. Clone this repository
```bash
git clone https://github.com/Beomi/InfiniTransformer
```### 2. Install dependencies
> We need to install the latest version(`b109257f4f`) of 🤗Transformers from the source code.
```bash
pip install -r requirements.txt
pip install -e git+https://github.com/huggingface/transformers.git@b109257f4f#egg=transformers
```### 3. Remove original `modeling_gemma.py`, make a symbolic link with new `modeling_gemma.py`
```bash
python test_basic.infini.py
```### 4. Run the example(Inference, simple forward/backward test)
```bash
python test_basic.py
```### 5. Train with your data
```bash
python test_train.small.gemma.py
```
Example code used wikitext-2-raw-v1 from https://huggingface.co/datasets/wikitext
Here's the test wandb log here -> https://api.wandb.ai/links/beomi2/1rsqrkfn
### 6. Inference
```bash
python test_basic.trained.py
```**Sample Generation w/ 1-epoch Trained Model on WikiText2**
Input:
> This work introduces an efficient method to scale Transformer-based
Output1:
> models for denoising , denoising denoising , and deep denoising of images of the U2 EPK model , using a coefficient that is a function of the depth of the image resolution . The paper experiments with image denoising by Turbo @-@ based filtering , denoising by generative adversarial networks , and video denoising by denoising each of the three elements of the video ( color of the pixels / frames ) . The results are considered fair . The video is not discussed . The paper is not considering an actual application in an industrial context , line is probably a 1 . It is built in the Nohmi…
Output2:
> vision models across platforms using a custom architecture optimized for both vision ( 3D / 2D ) and vision and language . In other words , a single model can run on different types of devices , a feature that is critical for the development of general @-@ purpose and large-scale AI ( see also : The One @-@ Model @-@ for @-@ All @-@ Things @-@ AI Problem ) . The model is the first to reach a global scale ( 200 GPU + ) on a single GPU using the Transformer and its variants . The model can run at the end of 1967 . He had his family relocated to a house in a nearby neighborhood , where they lived for five years , before returning to their primary residence in St. Petersburg . Later comments of 1968 made by his fellow musician Bruce Hornsby made it clear that he had gone through a lot , both personally and professionally .