https://github.com/shreyansh26/attention-mask-patterns

Using FlexAttention to compute attention with different masking patterns
https://github.com/shreyansh26/attention-mask-patterns

attention flex-attention

Last synced: 6 months ago
JSON representation

Using FlexAttention to compute attention with different masking patterns

Host: GitHub
URL: https://github.com/shreyansh26/attention-mask-patterns
Owner: shreyansh26
Created: 2024-09-07T20:25:57.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-09-22T15:26:42.000Z (about 1 year ago)
Last Synced: 2025-03-24T18:52:32.287Z (7 months ago)
Topics: attention, flex-attention
Language: Python
Homepage:
Size: 4.59 MB
Stars: 42
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Attention Mask Patterns

Using FlexAttention to compute attention with different masking patterns. 

The speedup over F.sdpa/xFormers and FA2 tends to increase with increasing sequence length. Timing plots are shown for different sequence lengths. It is mentioned in the title of the plot.

### Causal mask

Mask             |  Execution Time

:-------------------------:|:-------------------------:

![](plots/causal/mask.png)  |  ![](plots/causal/timing.png)

### Causal sliding window mask

Mask             |  Execution Time

:-------------------------:|:-------------------------:

![](plots/causal_sliding_window/mask.png)  |  ![](plots/causal_sliding_window/timing.png)

### Bidirectional sliding window mask

Mask             |  Execution Time

:-------------------------:|:-------------------------:

![](plots/bidirectional_sliding_window/mask.png)  |  ![](plots/bidirectional_sliding_window/timing.png)

### Bidirectional dilated sliding window mask

Mask             |  Execution Time

:-------------------------:|:-------------------------:

![](plots/bidirectional_dilated_sliding_window/mask.png)  |  ![](plots/bidirectional_dilated_sliding_window/timing.png)

### Bidirectional global + local sliding window attention mask

Mask             |  Execution Time

:-------------------------:|:-------------------------:

![](plots/bidirectional_local_sliding_window_global_attention/mask.png)  |  ![](plots/bidirectional_local_sliding_window_global_attention/timing.png)

### PrefixLM mask

Mask             |  Execution Time

:-------------------------:|:-------------------------:

![](plots/prefix_lm/mask.png)  |  ![](plots/prefix_lm/timing.png)

### Multi-document bidirectional mask

Mask             |  Execution Time

:-------------------------:|:-------------------------:

![](plots/multi_document_bidirectional_mask/mask.png)  |  ![](plots/multi_document_bidirectional_mask/timing.png)

### Multi-document causal mask

Mask             |  Execution Time

:-------------------------:|:-------------------------:

![](plots/multi_document_causal_mask/mask.png)  |  ![](plots/multi_document_causal_mask/timing.png)

### Multi-document prefixLM mask

Mask             |  Execution Time

:-------------------------:|:-------------------------:

![](plots/multi_document_prefix_lm_mask/mask.png)  |  ![](plots/multi_document_prefix_lm_mask/timing.png)

### Stand-alone Self-Attention mask

(Reference - [attention-gym repo](https://github.com/pytorch-labs/attention-gym/blob/75867424a1d4391bff49527029d3612a09dd67e2/examples/flex_attn.ipynb))

Mask             |  Execution Time

:-------------------------:|:-------------------------:

![](plots/standalone_self_attention/mask.png)  |  ![](plots/standalone_self_attention/timing.png)

## Requirements

* Pytorch Nightly (for FlexAttention, to be released with Pytorch 2.5)

* Refer `requirements.txt` for other requirements

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shreyansh26/attention-mask-patterns

Awesome Lists containing this project

README