https://github.com/OpenNLPLab/lightning-attention

Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
https://github.com/OpenNLPLab/lightning-attention

Last synced: about 1 year ago
JSON representation

Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models

Host: GitHub
URL: https://github.com/OpenNLPLab/lightning-attention
Owner: OpenNLPLab
License: mit
Created: 2024-01-09T14:28:40.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-04-24T07:45:07.000Z (about 2 years ago)
Last Synced: 2024-08-04T08:02:02.898Z (almost 2 years ago)
Language: Python
Homepage:
Size: 56.6 KB
Stars: 175
Watchers: 11
Forks: 15
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

StarryDivineSky - OpenNLPLab/lightning-attention - 2 (闪电注意力-2) 是一个旨在处理大型语言模型中无限序列长度的开源项目。它提供了一种免费的解决方案，无需额外训练或微调，即可显著扩展模型的上下文窗口。该项目基于对注意力机制中冗余计算的观察，通过减少计算量来加速推理。核心思想是识别并移除不重要的键值对，从而降低计算复杂度。Lightning Attention-2 适用于各种Transformer架构，并且易于集成到现有的模型中。它主要关注推理加速，并提供了一个高效的注意力实现，允许模型处理更长的序列，而不会显著增加计算成本。该项目通过减少不必要的计算，实现了更高的吞吐量和更低的延迟。它支持多种硬件平台，并提供了详细的文档和示例，方便用户使用和定制。项目目标是让大型语言模型能够更好地理解和生成长文本，从而提升各种自然语言处理任务的性能。它是一个社区驱动的项目，欢迎贡献和反馈。 (Transformer库与优化 / 大语言对话模型及数据)

README

          # Lightning Attention



💻 GitHub  •

💬 Discord •

💬 WeChat



## Introduction

This repository provides the official implementation of Lightning Attention 1/2 Algorithm.

- [Lightning Attention-1](https://arxiv.org/abs/2307.14995)

- [Lightning Attention-2](https://arxiv.org/abs/2401.04658)

## Installation

```

pip install lightning_attn

```

The code has been test under the following environment:

```

triton                   2.0.0

triton-nightly           2.1.0.dev20230728172942

```

You can use the following command to install:

```

pip install triton==2.0.0

pip install triton-nightly==2.1.0.dev20230728172942 --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/

```

## How to use lightning attention

```

import torch

from lightning_attn.ops import lightning_attn_func

from lightning_attn.utils import _build_slope_tensor

dtype = torch.bfloat16

device = torch.device("cuda")

b, h, n, d, e = 2, 12, 2048, 192, 192

q = torch.randn((b, h, n, d), dtype=dtype, device=device).requires_grad_()

k = torch.randn((b, h, n, d), dtype=dtype, device=device).requires_grad_()

v = torch.randn((b, h, n, e), dtype=dtype, device=device).requires_grad_()

s = _build_slope_tensor(h).to(q.device).to(torch.float32)

o = lightning_attn_func(q, k, v, s)

print(o.shape)

loss = o.sum()

loss.backward()

```

## Benchmark

```

lightning2-speed_fwd-batch4-head32-qk_dim128-v_dim128-dtype_bf16:

         n  Lightning2      Flash2    Xformers

0    512.0    0.351540    0.094412    0.127568

1   1024.0    0.585876    0.232286    0.375690

2   2048.0    1.134238    0.754831    1.297325

3   4096.0    2.240815    2.740033    4.804503

4   8192.0    4.414397   10.392551   18.329409

5  16384.0    8.832678   40.573997   71.699486

6  32768.0   17.661427  162.895615  286.869446

lightning2-speed_bwd-batch4-head32-qk_dim128-v_dim128-dtype_bf16:

         n  Lightning2      Flash2     Xformers

0    512.0    1.169621    0.397422     0.797627

1   1024.0    2.334296    0.957989     2.027344

2   2048.0    4.657026    2.739919     5.976820

3   4096.0    9.307817    8.891191    19.931032

4   8192.0   18.617611   31.986572    72.536194

5  16384.0   37.212578  121.685730   276.402618

6  32768.0   74.594788  470.666473  1075.611450

lightning2-memory_fwd-batch4-head32-qk_dim128-v_dim128-dtype_bf16:

         n   Lightning2       Flash2     Xformers

0    512.0    64.000488    64.250977    64.250488

1   1024.0   128.000488   128.500977   128.500488

2   2048.0   256.000488   257.000977   257.000488

3   4096.0   512.000488   514.000977   514.000488

4   8192.0  1024.000488  1028.000977  1028.000488

5  16384.0  2048.000488  2056.000977  2056.000488

6  32768.0  4096.000488  4112.000977  4112.000488

lightning2-memory_bwd-batch4-head32-qk_dim128-v_dim128-dtype_bf16:

         n    Lightning2        Flash2      Xformers

0    512.0    173.600488    206.100977    270.100977

1   1024.0    347.200488    412.200977    540.200977

2   2048.0    694.400488    824.400977   1080.400977

3   4096.0   1388.800488   1648.800977   2160.800977

4   8192.0   2777.600488   3297.600977   4321.600977

5  16384.0   5555.200488   6595.200977   8643.200977

6  32768.0  11110.400488  13190.400977  17286.400977

```

## Todo

- [ ] Add support for lightning attention parallel version.

- [x] Add support for linear attention with no decay.

- [ ] Add support for linear attention with data dependent decay.

- [ ] Add block size for 3090.

- [ ] Add efficient version to deal with not power of 2 feature dim.

## Citation

If you find our work useful, please cite the following papers:

```

@misc{qin2024transnormerllm,

      title={TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer},

      author={Zhen Qin and Dong Li and Weigao Sun and Weixuan Sun and Xuyang Shen and Xiaodong Han and Yunshen Wei and Baohong Lv and Xiao Luo and Yu Qiao and Yiran Zhong},

      year={2024},

      eprint={2307.14995},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

@misc{qin2024lightning,

      title={Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models},

      author={Zhen Qin and Weigao Sun and Dong Li and Xuyang Shen and Weixuan Sun and Yiran Zhong},

      year={2024},

      eprint={2401.04658},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

```

## Acknowledgment

Thanks for [sustcsonglin](https://github.com/sustcsonglin) and [yzhangcs](https://github.com/yzhangcs) for the helpful discussions. You may also find [flash-linear-attention](https://github.com/sustcsonglin/flash-linear-attention) useful.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/OpenNLPLab/lightning-attention

Awesome Lists containing this project

README