https://github.com/kyegomez/differentialtransformer
An open source community implementation of the model from "DIFFERENTIAL TRANSFORMER" paper by Microsoft.
https://github.com/kyegomez/differentialtransformer
ai attention ml rnns ssm transformers transformers-library transformers-models
Last synced: 5 months ago
JSON representation
An open source community implementation of the model from "DIFFERENTIAL TRANSFORMER" paper by Microsoft.
- Host: GitHub
- URL: https://github.com/kyegomez/differentialtransformer
- Owner: kyegomez
- License: mit
- Created: 2024-10-12T20:16:59.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-04-19T12:55:12.000Z (6 months ago)
- Last Synced: 2025-04-19T20:16:52.669Z (6 months ago)
- Topics: ai, attention, ml, rnns, ssm, transformers, transformers-library, transformers-models
- Language: Python
- Homepage: https://discord.com/servers/agora-999382051935506503
- Size: 2.16 MB
- Stars: 24
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# Differential Transformer
An open source community implementation of the model from "DIFFERENTIAL TRANSFORMER" paper by Microsoft. [Paper Link](https://arxiv.org/abs/2410.05258). "Differential attention takes the difference between two softmax attention functions to eliminate attention noise. The idea is analogous to differential amplifiers [19] proposed in electrical engineering,where the difference between two signals is used as output, so that we can null out the common-mode noise of the input. In addition, the design of noise-canceling headphones is based on a similar idea. We can directly reuse FlashAttention [8] as described in Appendix A, which significantly improves model efficiency."
[](https://discord.gg/agora-999382051935506503) [](https://www.youtube.com/@kyegomez3242) [](https://www.linkedin.com/in/kye-g-38759a207/) [](https://x.com/kyegomezb)
## Install
```bash
$ pip3 install differential-transformers
```## Usage Transformer
```python
import torch
from differential_transformer.main import DifferentialTransformer
from loguru import logger# Example usage:
# Example dimensions
batch_size = 32
seq_len = 128
embedding_dim = 64
h = 8
λ = 0.1
λinit = 0.05# Create random input tensor
x = torch.randint(0, 256, (1, 1024))# Instantiate and run the multi-head attention
multi_head = DifferentialTransformer(heads=h, dim=embedding_dim, λinit=λinit)
output = multi_head(x, λ=λ)logger.info(f"Output shape: {output.shape}")
```
# License
MIT## Citation
```bibtex
@misc{ye2024differentialtransformer,
title={Differential Transformer},
author={Tianzhu Ye and Li Dong and Yuqing Xia and Yutao Sun and Yi Zhu and Gao Huang and Furu Wei},
year={2024},
eprint={2410.05258},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.05258},
}```