https://github.com/sid3503/sparse-attention
PyTorch-style strided sparse attention with configurable strides, local+global token support, and memory-efficient masking.
https://github.com/sid3503/sparse-attention
llm sparsity text-processing
Last synced: 3 months ago
JSON representation
PyTorch-style strided sparse attention with configurable strides, local+global token support, and memory-efficient masking.
- Host: GitHub
- URL: https://github.com/sid3503/sparse-attention
- Owner: Sid3503
- Created: 2025-05-05T05:24:51.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-06T07:19:14.000Z (about 1 year ago)
- Last Synced: 2025-06-21T08:08:58.627Z (11 months ago)
- Topics: llm, sparsity, text-processing
- Language: Jupyter Notebook
- Homepage:
- Size: 712 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## π What is Attention?
In a Transformer, **attention** helps a word (or token) focus on *other important words* when trying to understand the meaning of a sentence.
For example:
> *"The cat sat on the mat."*
When processing βsatβ, the model might look at βcatβ to understand who is sitting. Thatβs attention.
---
## π‘ What is Sparse Attention?
**Sparse Attention** means:
> *βDonβt look at **every** wordβjust look at a **few** important ones.β*
Regular attention (called **full attention**) looks at **all** the tokens in a sequence. If the sentence is long, that becomes very expensive (slow and memory-heavy).
So, sparse attention is like:
> βLetβs save time and memory by only attending to nearby words or a few important global words.β
---
## π Analogy: Ordering Pizza
Imagine youβre throwing a party and want to ask your **friends** what pizza to order.
* **Full Attention**: You ask **everyone** at the party, even people you donβt know well. That takes time.
* **Sparse Attention**: You only ask the people **near you** or the ones who **always have good taste** (like your foodie friend). Thatβs faster and usually good enough.
---
## β
Types of Sparse Attention (Examples)
1. **Local Attention**:
* Only look at nearby tokens.
* Example: In the sentence `"The cat sat on the mat"`, when looking at `"sat"`, only check `"cat"` and `"on"`.
2. **Global Tokens**:
* Some special tokens (like headings or keywords) can attend to *everything*, and everything can attend to them.
* Example: In a document, the word `"Title:"` might be a global token. All other tokens can look at it, no matter where it is.
3. **Strided Attention**:
* Look at every 2nd or 3rd token.
* Example: Look at token 0, 2, 4, 6, etc.
---
## π§ Why Use Sparse Attention?
Because attention across **long sequences** is **slow** and uses **a lot of memory**.
| Attention Type | Speed | Memory Use | Accuracy |
| -------------- | ------ | ---------- | ------------------------- |
| Full | β Slow | β High | β
High |
| Sparse | β
Fast | β
Low | β
Good (if designed well) |
That's why models like **Longformer**, **BigBird**, and your **VerticalSlashAttention** use sparse attention to scale better.
---
## π Vertical Slash Attention (your example)
Your code does something like this:
1. Each token looks at a **window** of nearby tokens (local attention).
2. Also lets every token look at some **global tokens**.
So itβs like saying:
> βIβll mostly look around me, but Iβll also glance at the important headers.β
---
## π§Έ Sentence:
> `"The quick brown fox jumps over the lazy dog"`
Letβs say we convert each word into a **token**. So we have:
```
[0] The
[1] quick
[2] brown
[3] fox
[4] jumps
[5] over
[6] the
[7] lazy
[8] dog
```
Letβs imagine weβre applying **sparse attention with a local window size of 2**:
* Each token can only βseeβ **itself** and **2 tokens before and after** it.
* This is **local attention**.
---
### π§ Full Attention (Just for comparison)
In full attention, every token attends to **all tokens**:
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| --- | - | - | - | - | - | - | - | - | - |
| 0 | β | β | β | β | β | β | β | β | β |
| 1 | β | β | β | β | β | β | β | β | β |
| 2 | β | β | β | β | β | β | β | β | β |
| ... | | | | | | | | | |
| 8 | β | β | β | β | β | β | β | β | β |
Too much for long sequences! β
---
### β
Sparse Attention (Window = 2)
Letβs define a simple rule:
* A token at position `i` can attend to: `i-2, i-1, i, i+1, i+2` (within bounds).
Example:
For token `fox [3]`, it can attend to `[1] quick`, `[2] brown`, `[3] fox`, `[4] jumps`, `[5] over`.
Now letβs build the matrix:
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| - | - | - | - | - | - | - | - | - | - |
| 0 | β | β | β | | | | | | |
| 1 | β | β | β | β | | | | | |
| 2 | β | β | β | β | β | | | | |
| 3 | | β | β | β | β | β | | | |
| 4 | | | β | β | β | β | β | | |
| 5 | | | | β | β | β | β | β | |
| 6 | | | | | β | β | β | β | β |
| 7 | | | | | | β | β | β | β |
| 8 | | | | | | | β | β | β |
β
Much sparser, faster for long texts!
---
### π Add Global Attention (Optional)
Letβs say the word `"The"` at position `0` is a **global token**.
* Every token can look at token 0.
* Token 0 can look at all tokens.
Update:
| | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
| - | - | - | - | - | - | - | - | - | - | -------- |
| 0 | β | β | β | β | β | β | β | β | β | β Global |
| 1 | β | β | β | β | | | | | | |
| 2 | β | β | β | β | β | | | | | |
| 3 | β | β | β | β | β | β | | | | |
| 4 | β | | β | β | β | β | β | | | |
| 5 | β | | | β | β | β | β | β | | |
| 6 | β | | | | β | β | β | β | β | |
| 7 | β | | | | | β | β | β | β | |
| 8 | β | | | | | | β | β | β | |
Global token adds **long-distance awareness**!
---
## Techniques for Sparse Attention Implementation
### π§© 1. **Unit of Sparsification** (Where to apply attention)
This is about the **pattern** or **shape** of attention β who can "talk" to who.
#### A. Local Window
π¦ **Idea**: Each token can only see its neighbors.
πΆ **Example**: Sentence = `A B C D E`
If we use a window of size 3, then:
```
C β [B C D] (C can see B, C, D)
E β [D E] (E can only see D and itself)
```
Like how you talk to your friends sitting next to you in class.
---
#### B. Global Tokens
π¦ **Idea**: Some special tokens (like \[CLS]) can see all others, and all others can see them.
πΆ **Example**:
```
Tokens: [CLS] A B C D
[CLS] β [A B C D]
A β [CLS A]
```
Useful when you need a summary token (like a team leader who listens to everyone).
---
#### C. Diagonals / Strided
π¦ **Idea**: Each token looks at past tokens with a fixed gap.
πΆ **Example**:
```
T0 T1 T2 T3 T4 T5
T5 β [T3, T4, T5] (looks 2 steps back)
```
Like checking every 2nd page in a book.
---
#### D. Block Attention
π¦ **Idea**: Group tokens into chunks (like 4 words at a time) and attend between groups.
πΆ **Example**:
Tokens = \[A B C D] \[E F G H] (two blocks)
```
Block 1 β Block 1 and Block 2
```
Instead of person-to-person, it's **group-to-group** talk.
---
### π§© 2. **Importance Estimation** (Which tokens are important?)
This decides **which tokens to keep** based on rules.
#### A. Fixed Importance
π¦ **Idea**: Always use the same rule.
πΆ **Example**: Always keep the first 2 tokens.
```
Keep: T0, T1 (no matter the sentence)
```
Simple but not smart β like always picking the first 2 students in line.
---
#### B. Dynamic Importance
π¦ **Idea**: Use token properties like attention score or embedding size to decide.
πΆ **Example**:
Tokens = T0, T1, T2, T3
Embedding norms: \[1.2, 2.0, 0.5, 2.5]
Keep tokens with top 2 norms β T1 and T3
```
Keep: T1 (2.0), T3 (2.5)
```
Like picking top scorers in a test dynamically.
---
### π§© 3. **Budget Allocation** (How many tokens to keep)
This is about **how many** tokens each part of the model gets to use.
#### A. Uniform Budget
π¦ **Idea**: Each head or layer gets same number of important tokens.
πΆ **Example**: Every attention head gets to keep 4 tokens.
```
Head 1 β 4 tokens
Head 2 β 4 tokens
```
Fair, but not always efficient.
---
#### B. Adaptive Budget
π¦ **Idea**: More important layers or heads get more budget.
πΆ **Example**:
```
Layer 1 β 8 tokens
Layer 6 β 3 tokens
```
Like giving senior employees more resources.
---
### π§© 4. **KV Cache Management** (Which keys/values to keep during decoding)
This is for **text generation** where you canβt keep all history due to memory limits.
#### A. Sliding Window
π¦ **Idea**: Only keep the last N tokens.
πΆ **Example**: N = 3, and sentence so far = A B C D E
```
Keep: C D E (drop A, B)
```
Like only remembering the last few lines of a conversation.
---
#### B. Attention Score Based
π¦ **Idea**: Keep tokens that were most useful in the past (high attention).
πΆ **Example**:
Tokens: A B C D
Attention scores: \[0.1, 0.9, 0.3, 0.8]
Keep top-2: B and D
```
Drop: A and C
```
Like keeping people you talked to the most at a party.
---
#### C. Combine Strategies
π¦ **Idea**: Keep recent + most useful tokens.
πΆ **Example**:
Keep: last 2 tokens + any token with high score
```
Result: D E + B (if B had high attention)
```
Best of both worlds.
---
## Summary (TL;DR)
| Concept | Meaning |
| ---------------- | ---------------------------------------------------------------------- |
| Attention | Helps tokens focus on others to understand meaning. |
| Full Attention | Looks at **all** tokens (slow for long sequences). |
| Sparse Attention | Looks at a **few** nearby or important tokens (faster). |
---