{"id":28899270,"url":"https://github.com/sid3503/sparse-attention","last_synced_at":"2026-01-31T20:03:01.657Z","repository":{"id":291528965,"uuid":"977871093","full_name":"Sid3503/sparse-attention","owner":"Sid3503","description":"PyTorch-style strided sparse attention with configurable strides, local+global token support, and memory-efficient masking.","archived":false,"fork":false,"pushed_at":"2025-05-06T07:19:14.000Z","size":729,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-21T08:08:58.627Z","etag":null,"topics":["llm","sparsity","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Sid3503.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-05T05:24:51.000Z","updated_at":"2025-05-06T18:55:24.000Z","dependencies_parsed_at":"2025-05-05T07:36:48.129Z","dependency_job_id":"667040b2-70d4-4a3c-bc1b-957ed7c92187","html_url":"https://github.com/Sid3503/sparse-attention","commit_stats":null,"previous_names":["sid3503/sparse-attention"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Sid3503/sparse-attention","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sid3503%2Fsparse-attention","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sid3503%2Fsparse-attention/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sid3503%2Fsparse-attention/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sid3503%2Fsparse-attention/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Sid3503","download_url":"https://codeload.github.com/Sid3503/sparse-attention/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sid3503%2Fsparse-attention/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28952578,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-31T18:30:42.805Z","status":"ssl_error","status_checked_at":"2026-01-31T18:30:19.593Z","response_time":128,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","sparsity","text-processing"],"created_at":"2025-06-21T08:08:58.190Z","updated_at":"2026-01-31T20:03:01.649Z","avatar_url":"https://github.com/Sid3503.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"## 🔍 What is Attention?\n\nIn a Transformer, **attention** helps a word (or token) focus on *other important words* when trying to understand the meaning of a sentence.\n\nFor example:\n\n\u003e *\"The cat sat on the mat.\"*\n\nWhen processing “sat”, the model might look at “cat” to understand who is sitting. That’s attention.\n\n---\n\n## 💡 What is Sparse Attention?\n\n**Sparse Attention** means:\n\n\u003e *“Don’t look at **every** word—just look at a **few** important ones.”*\n\nRegular attention (called **full attention**) looks at **all** the tokens in a sequence. If the sentence is long, that becomes very expensive (slow and memory-heavy).\n\nSo, sparse attention is like:\n\n\u003e “Let’s save time and memory by only attending to nearby words or a few important global words.”\n\n---\n\n## 🍕 Analogy: Ordering Pizza\n\nImagine you’re throwing a party and want to ask your **friends** what pizza to order.\n\n* **Full Attention**: You ask **everyone** at the party, even people you don’t know well. That takes time.\n* **Sparse Attention**: You only ask the people **near you** or the ones who **always have good taste** (like your foodie friend). That’s faster and usually good enough.\n\n---\n\n## ✅ Types of Sparse Attention (Examples)\n\n1. **Local Attention**:\n\n   * Only look at nearby tokens.\n   * Example: In the sentence `\"The cat sat on the mat\"`, when looking at `\"sat\"`, only check `\"cat\"` and `\"on\"`.\n\n2. **Global Tokens**:\n\n   * Some special tokens (like headings or keywords) can attend to *everything*, and everything can attend to them.\n   * Example: In a document, the word `\"Title:\"` might be a global token. All other tokens can look at it, no matter where it is.\n\n3. **Strided Attention**:\n\n   * Look at every 2nd or 3rd token.\n   * Example: Look at token 0, 2, 4, 6, etc.\n\n---\n\n## 🧠 Why Use Sparse Attention?\n\nBecause attention across **long sequences** is **slow** and uses **a lot of memory**.\n\n| Attention Type | Speed  | Memory Use | Accuracy                  |\n| -------------- | ------ | ---------- | ------------------------- |\n| Full           | ❌ Slow | ❌ High     | ✅ High                    |\n| Sparse         | ✅ Fast | ✅ Low      | ✅ Good (if designed well) |\n\nThat's why models like **Longformer**, **BigBird**, and your **VerticalSlashAttention** use sparse attention to scale better.\n\n---\n\n## 👀 Vertical Slash Attention (your example)\n\nYour code does something like this:\n\n1. Each token looks at a **window** of nearby tokens (local attention).\n2. Also lets every token look at some **global tokens**.\n\nSo it’s like saying:\n\n\u003e “I’ll mostly look around me, but I’ll also glance at the important headers.”\n\n---\n\n## 🧸 Sentence:\n\n\u003e `\"The quick brown fox jumps over the lazy dog\"`\n\nLet’s say we convert each word into a **token**. So we have:\n\n```\n[0] The  \n[1] quick  \n[2] brown  \n[3] fox  \n[4] jumps  \n[5] over  \n[6] the  \n[7] lazy  \n[8] dog\n```\n\nLet’s imagine we’re applying **sparse attention with a local window size of 2**:\n\n* Each token can only “see” **itself** and **2 tokens before and after** it.\n* This is **local attention**.\n\n---\n\n### 🧠 Full Attention (Just for comparison)\n\nIn full attention, every token attends to **all tokens**:\n\n|     | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |\n| --- | - | - | - | - | - | - | - | - | - |\n| 0   | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |\n| 1   | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |\n| 2   | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |\n| ... |   |   |   |   |   |   |   |   |   |\n| 8   | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |\n\nToo much for long sequences! ❌\n\n---\n\n### ✅ Sparse Attention (Window = 2)\n\nLet’s define a simple rule:\n\n* A token at position `i` can attend to: `i-2, i-1, i, i+1, i+2` (within bounds).\n\nExample:\nFor token `fox [3]`, it can attend to `[1] quick`, `[2] brown`, `[3] fox`, `[4] jumps`, `[5] over`.\n\nNow let’s build the matrix:\n\n|   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |\n| - | - | - | - | - | - | - | - | - | - |\n| 0 | ✔ | ✔ | ✔ |   |   |   |   |   |   |\n| 1 | ✔ | ✔ | ✔ | ✔ |   |   |   |   |   |\n| 2 | ✔ | ✔ | ✔ | ✔ | ✔ |   |   |   |   |\n| 3 |   | ✔ | ✔ | ✔ | ✔ | ✔ |   |   |   |\n| 4 |   |   | ✔ | ✔ | ✔ | ✔ | ✔ |   |   |\n| 5 |   |   |   | ✔ | ✔ | ✔ | ✔ | ✔ |   |\n| 6 |   |   |   |   | ✔ | ✔ | ✔ | ✔ | ✔ |\n| 7 |   |   |   |   |   | ✔ | ✔ | ✔ | ✔ |\n| 8 |   |   |   |   |   |   | ✔ | ✔ | ✔ |\n\n✅ Much sparser, faster for long texts!\n\n---\n\n### 🚀 Add Global Attention (Optional)\n\nLet’s say the word `\"The\"` at position `0` is a **global token**.\n\n* Every token can look at token 0.\n* Token 0 can look at all tokens.\n\nUpdate:\n\n|   | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |          |\n| - | - | - | - | - | - | - | - | - | - | -------- |\n| 0 | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ← Global |\n| 1 | ✔ | ✔ | ✔ | ✔ |   |   |   |   |   |          |\n| 2 | ✔ | ✔ | ✔ | ✔ | ✔ |   |   |   |   |          |\n| 3 | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |   |   |   |          |\n| 4 | ✔ |   | ✔ | ✔ | ✔ | ✔ | ✔ |   |   |          |\n| 5 | ✔ |   |   | ✔ | ✔ | ✔ | ✔ | ✔ |   |          |\n| 6 | ✔ |   |   |   | ✔ | ✔ | ✔ | ✔ | ✔ |          |\n| 7 | ✔ |   |   |   |   | ✔ | ✔ | ✔ | ✔ |          |\n| 8 | ✔ |   |   |   |   |   | ✔ | ✔ | ✔ |          |\n\nGlobal token adds **long-distance awareness**!\n\n---\n\n## Techniques for Sparse Attention Implementation\n\n### 🧩 1. **Unit of Sparsification** (Where to apply attention)\n\nThis is about the **pattern** or **shape** of attention — who can \"talk\" to who.\n\n#### A. Local Window\n\n📦 **Idea**: Each token can only see its neighbors.\n\n👶 **Example**: Sentence = `A B C D E`\n\nIf we use a window of size 3, then:\n\n```\nC → [B C D]  (C can see B, C, D)\nE → [D E]    (E can only see D and itself)\n```\n\nLike how you talk to your friends sitting next to you in class.\n\n---\n\n#### B. Global Tokens\n\n📦 **Idea**: Some special tokens (like \\[CLS]) can see all others, and all others can see them.\n\n👶 **Example**:\n\n```\nTokens: [CLS] A B C D\n[CLS] → [A B C D]\nA → [CLS A]\n```\n\nUseful when you need a summary token (like a team leader who listens to everyone).\n\n---\n\n#### C. Diagonals / Strided\n\n📦 **Idea**: Each token looks at past tokens with a fixed gap.\n\n👶 **Example**:\n\n```\nT0 T1 T2 T3 T4 T5\n\nT5 → [T3, T4, T5]   (looks 2 steps back)\n```\n\nLike checking every 2nd page in a book.\n\n---\n\n#### D. Block Attention\n\n📦 **Idea**: Group tokens into chunks (like 4 words at a time) and attend between groups.\n\n👶 **Example**:\n\nTokens = \\[A B C D] \\[E F G H] (two blocks)\n\n```\nBlock 1 → Block 1 and Block 2\n```\n\nInstead of person-to-person, it's **group-to-group** talk.\n\n---\n\n### 🧩 2. **Importance Estimation** (Which tokens are important?)\n\nThis decides **which tokens to keep** based on rules.\n\n\n#### A. Fixed Importance\n\n📦 **Idea**: Always use the same rule.\n\n👶 **Example**: Always keep the first 2 tokens.\n\n```\nKeep: T0, T1 (no matter the sentence)\n```\n\nSimple but not smart — like always picking the first 2 students in line.\n\n---\n\n#### B. Dynamic Importance\n\n📦 **Idea**: Use token properties like attention score or embedding size to decide.\n\n👶 **Example**:\n\nTokens = T0, T1, T2, T3\nEmbedding norms: \\[1.2, 2.0, 0.5, 2.5]\n\nKeep tokens with top 2 norms → T1 and T3\n\n```\nKeep: T1 (2.0), T3 (2.5)\n```\n\nLike picking top scorers in a test dynamically.\n\n---\n\n### 🧩 3. **Budget Allocation** (How many tokens to keep)\n\nThis is about **how many** tokens each part of the model gets to use.\n\n\n#### A. Uniform Budget\n\n📦 **Idea**: Each head or layer gets same number of important tokens.\n\n👶 **Example**: Every attention head gets to keep 4 tokens.\n\n```\nHead 1 → 4 tokens  \nHead 2 → 4 tokens\n```\n\nFair, but not always efficient.\n\n---\n\n#### B. Adaptive Budget\n\n📦 **Idea**: More important layers or heads get more budget.\n\n👶 **Example**:\n\n```\nLayer 1 → 8 tokens  \nLayer 6 → 3 tokens\n```\n\nLike giving senior employees more resources.\n\n---\n\n### 🧩 4. **KV Cache Management** (Which keys/values to keep during decoding)\n\nThis is for **text generation** where you can’t keep all history due to memory limits.\n\n\n#### A. Sliding Window\n\n📦 **Idea**: Only keep the last N tokens.\n\n👶 **Example**: N = 3, and sentence so far = A B C D E\n\n```\nKeep: C D E (drop A, B)\n```\n\nLike only remembering the last few lines of a conversation.\n\n---\n\n#### B. Attention Score Based\n\n📦 **Idea**: Keep tokens that were most useful in the past (high attention).\n\n👶 **Example**:\n\nTokens: A B C D\nAttention scores: \\[0.1, 0.9, 0.3, 0.8]\n\nKeep top-2: B and D\n\n```\nDrop: A and C\n```\n\nLike keeping people you talked to the most at a party.\n\n---\n\n#### C. Combine Strategies\n\n📦 **Idea**: Keep recent + most useful tokens.\n\n👶 **Example**:\n\nKeep: last 2 tokens + any token with high score\n\n```\nResult: D E + B (if B had high attention)\n```\n\nBest of both worlds.\n\n---\n\n## Summary (TL;DR)\n\n| Concept          | Meaning                                                                |\n| ---------------- | ---------------------------------------------------------------------- |\n| Attention        | Helps tokens focus on others to understand meaning.                    |\n| Full Attention   | Looks at **all** tokens (slow for long sequences).                     |\n| Sparse Attention | Looks at a **few** nearby or important tokens (faster).                |\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsid3503%2Fsparse-attention","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsid3503%2Fsparse-attention","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsid3503%2Fsparse-attention/lists"}