{"id":13564367,"url":"https://github.com/lucidrains/reformer-pytorch","last_synced_at":"2025-05-14T12:08:41.020Z","repository":{"id":38058155,"uuid":"232901618","full_name":"lucidrains/reformer-pytorch","owner":"lucidrains","description":"Reformer, the efficient Transformer, in Pytorch","archived":false,"fork":false,"pushed_at":"2023-06-21T14:17:49.000Z","size":36190,"stargazers_count":2163,"open_issues_count":17,"forks_count":256,"subscribers_count":52,"default_branch":"master","last_synced_at":"2025-04-11T04:57:38.754Z","etag":null,"topics":["artificial-intelligence","attention-mechanism","machine-learning","pytorch","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2020-01-09T20:42:37.000Z","updated_at":"2025-04-02T12:23:53.000Z","dependencies_parsed_at":"2022-07-12T00:30:44.297Z","dependency_job_id":"9b8b088c-0c13-4354-80cf-d8b840d75864","html_url":"https://github.com/lucidrains/reformer-pytorch","commit_stats":{"total_commits":221,"total_committers":11,"mean_commits":20.09090909090909,"dds":0.08597285067873306,"last_synced_commit":"66a19b6caf481a880c97d331306aba55d2ae4b9c"},"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Freformer-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Freformer-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Freformer-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Freformer-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/reformer-pytorch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248345273,"owners_count":21088244,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","attention-mechanism","machine-learning","pytorch","transformers"],"created_at":"2024-08-01T13:01:30.274Z","updated_at":"2025-04-11T04:57:48.095Z","avatar_url":"https://github.com/lucidrains.png","language":"Python","funding_links":[],"categories":["Python","Pytorch \u0026 related libraries｜Pytorch \u0026 相关库","Pytorch \u0026 related libraries","Pytorch实用程序","Language Model"],"sub_categories":["NLP \u0026 Speech Processing｜自然语言处理 \u0026 语音处理:","NLP \u0026 Speech Processing:"],"readme":"## Reformer, the Efficient Transformer, in Pytorch\n[![PyPI version](https://badge.fury.io/py/reformer-pytorch.svg)](https://badge.fury.io/py/reformer-pytorch)\n\n\u003cimg src=\"./lsh_attention.png\" width=\"500\"\u003e\n\nThis is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB\n\nIt includes LSH attention, reversible network, and chunking. It has been validated with an auto-regressive task (enwik8).\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1am1DRl80Kd3o6n_4u3MomPzYS0NfdHAC) 32k tokens\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1awNgXYtjvUeXl1gS-v1iyDXTJJ-fyJIK) 81k tokens with half precision\n\n## Install\n\n```bash\n$ pip install reformer_pytorch\n```\n\n## Usage\n\nA simple Reformer language model\n\n```python\n# should fit in ~ 5gb - 8k tokens\n\nimport torch\nfrom reformer_pytorch import ReformerLM\n\nmodel = ReformerLM(\n    num_tokens= 20000,\n    dim = 1024,\n    depth = 12,\n    max_seq_len = 8192,\n    heads = 8,\n    lsh_dropout = 0.1,\n    ff_dropout = 0.1,\n    post_attn_dropout = 0.1,\n    layer_dropout = 0.1,  # layer dropout from 'Reducing Transformer Depth on Demand' paper\n    causal = True,        # auto-regressive or not\n    bucket_size = 64,     # average size of qk per bucket, 64 was recommended in paper\n    n_hashes = 4,         # 4 is permissible per author, 8 is the best but slower\n    emb_dim = 128,        # embedding factorization for further memory savings\n    dim_head = 64,        # be able to fix the dimension of each head, making it independent of the embedding dimension and the number of heads\n    ff_chunks = 200,      # number of chunks for feedforward layer, make higher if there are memory issues\n    attn_chunks = 8,      # process lsh attention in chunks, only way for memory to fit when scaling to 16k tokens\n    num_mem_kv = 128,       # persistent learned memory key values, from all-attention paper\n    full_attn_thres = 1024, # use full attention if context length is less than set value\n    reverse_thres = 1024,   # turn off reversibility for 2x speed for sequence lengths shorter or equal to the designated value\n    use_scale_norm = False,  # use scale norm from 'Transformers without tears' paper\n    use_rezero = False,      # remove normalization and use rezero from 'ReZero is All You Need'\n    one_value_head = False,  # use one set of values for all heads from 'One Write-Head Is All You Need'\n    weight_tie = False,           # tie parameters of each layer for no memory per additional depth\n    weight_tie_embedding = False, # use token embedding for projection of output, some papers report better results\n    n_local_attn_heads = 2,       # many papers suggest mixing local attention heads aids specialization and improves on certain tasks\n    pkm_layers = (4,7),           # specify layers to use product key memory. paper shows 1 or 2 modules near the middle of the transformer is best\n    pkm_num_keys = 128,           # defaults to 128, but can be increased to 256 or 512 as memory allows\n    use_full_attn = False    # only turn on this flag to override and turn on full attention for all sequence lengths. for comparison with LSH to show that it is working\n).cuda()\n\nx = torch.randint(0, 20000, (1, 8192)).long().cuda()\ny = model(x) # (1, 8192, 20000)\n```\n\nThe Reformer (just a stack of reversible LSH attention)\n\n```python\n# should fit in ~ 5gb - 8k embeddings\n\nimport torch\nfrom reformer_pytorch import Reformer\n\nmodel = Reformer(\n    dim = 512,\n    depth = 12,\n    heads = 8,\n    lsh_dropout = 0.1,\n    causal = True\n).cuda()\n\nx = torch.randn(1, 8192, 512).cuda()\ny = model(x) # (1, 8192, 512)\n```\n\nSelf Attention with LSH\n\n```python\nimport torch\nfrom reformer_pytorch import LSHSelfAttention\n\nattn = LSHSelfAttention(\n    dim = 128,\n    heads = 8,\n    bucket_size = 64,\n    n_hashes = 8,\n    causal = False\n)\n\nx = torch.randn(10, 1024, 128)\ny = attn(x) # (10, 1024, 128)\n```\n\nLSH (locality sensitive hashing) Attention\n\n```python\nimport torch\nfrom reformer_pytorch import LSHAttention\n\nattn = LSHAttention(\n    bucket_size = 64,\n    n_hashes = 16,\n    causal = True\n)\n\nqk = torch.randn(10, 1024, 128)\nv = torch.randn(10, 1024, 128)\n\nout, attn, buckets = attn(qk, v) # (10, 1024, 128)\n# attn contains the unsorted attention weights, provided return_attn is set to True (costly otherwise)\n# buckets will contain the bucket number (post-argmax) of each token of each batch\n```\n\n## Masking\n\nThis repository supports masks on the input sequence `input_mask (b x i_seq)`, the context sequence `context_mask (b x c_seq)`, as well as the rarely used full attention matrix itself `input_attn_mask (b x i_seq x i_seq)`, all made compatible with LSH attention. Masks are made of booleans where `False` denotes masking out prior to the softmax.\n\nThe causal triangular mask is all taken care of for you if you set `causal = True`.\n\n```python\nimport torch\nfrom reformer_pytorch import ReformerLM\n\nCONTEXT_LEN = 512\nSEQ_LEN = 8192\n\nmodel = ReformerLM(\n    num_tokens= 20000,\n    dim = 1024,\n    depth = 1,\n    max_seq_len = SEQ_LEN,\n    ff_chunks = 8,\n    causal = True\n)\n\nc = torch.randn(1, CONTEXT_LEN, 1024)\nx = torch.randint(0, 20000, (1, SEQ_LEN)).long()\n\ni_mask = torch.ones(1, SEQ_LEN).bool()\nc_mask = torch.ones(1, CONTEXT_LEN).bool()\n\ny = model(x, keys = c, input_mask = i_mask, context_mask = c_mask)\n# masking done correctly in LSH attention\n```\n\n## Positional Embeddings\n\nThe default positional embedding uses \u003ca href=\"https://arxiv.org/abs/2104.09864\"\u003erotary embeddings\u003c/a\u003e.\n\nHowever, \u003ca href=\"https://github.com/AranKomat\"\u003eAran\u003c/a\u003e has informed me that the Reformer team used axial position embeddings with great results on longer sequences.\n\nYou can turn on axial positional embedding and adjust the shape and dimension of the axial embeddings by following the instructions below.\n\n```python\nimport torch\nfrom reformer_pytorch import ReformerLM\n\nmodel = ReformerLM(\n    num_tokens= 20000,\n    dim = 1024,\n    depth = 12,\n    max_seq_len = 8192,\n    ff_chunks = 8,\n    attn_chunks = 2,\n    causal = True,\n    axial_position_emb = True,         # set this to True\n    axial_position_shape = (128, 64),  # the shape must multiply up to the max_seq_len (128 x 64 = 8192)\n)\n\nx = torch.randint(0, 20000, (1, 8192)).long()\ny = model(x) # (1, 8192, 20000)\n```\n\nIf you would rather use absolute positional embeddings, you can turn it on with `absolute_position_emb = True` flag on initialization.\n\n## Training\n\nSince version `0.17.0`, and some corrections to the reversible network, Reformer Pytorch is compatible with Microsoft's Deepspeed! If you have multiple local GPUs, you can follow the instructions / example \u003ca href=\"https://github.com/lucidrains/reformer-pytorch/tree/master/examples/enwik8_deepspeed\"\u003ehere\u003c/a\u003e.\n\n## Examples\n\nA full Reformer sequence → sequence, say translation\n\n```python\nimport torch\nfrom reformer_pytorch import ReformerLM\n\nDE_SEQ_LEN = 4096\nEN_SEQ_LEN = 4096\n\nencoder = ReformerLM(\n    num_tokens = 20000,\n    emb_dim = 128,\n    dim = 1024,\n    depth = 12,\n    heads = 8,\n    max_seq_len = DE_SEQ_LEN,\n    fixed_position_emb = True,\n    return_embeddings = True # return output of last attention layer\n).cuda()\n\ndecoder = ReformerLM(\n    num_tokens = 20000,\n    emb_dim = 128,\n    dim = 1024,\n    depth = 12,\n    heads = 8,\n    max_seq_len = EN_SEQ_LEN,\n    fixed_position_emb = True,\n    causal = True\n).cuda()\n\nx  = torch.randint(0, 20000, (1, DE_SEQ_LEN)).long().cuda()\nyi = torch.randint(0, 20000, (1, EN_SEQ_LEN)).long().cuda()\n\nenc_keys = encoder(x)               # (1, 4096, 1024)\nyo = decoder(yi, keys = enc_keys)   # (1, 4096, 20000)\n```\n\nA full Reformer image → caption\n\n```python\nimport torch\nfrom torch.nn import Sequential\nfrom torchvision import models\nfrom reformer_pytorch import Reformer, ReformerLM\n\nresnet = models.resnet50(pretrained=True)\nresnet = Sequential(*list(resnet.children())[:-4])\n\nSEQ_LEN = 4096\n\nencoder = Reformer(\n    dim = 512,\n    depth = 6,\n    heads = 8,\n    max_seq_len = 4096\n)\n\ndecoder = ReformerLM(\n    num_tokens = 20000,\n    dim = 512,\n    depth = 6,\n    heads = 8,\n    max_seq_len = SEQ_LEN,\n    causal = True\n)\n\nx  = torch.randn(1, 3, 512, 512)\nyi = torch.randint(0, 20000, (1, SEQ_LEN)).long()\n\nvisual_emb = resnet(x)\nb, c, h, w = visual_emb.shape\nvisual_emb = visual_emb.view(1, c, h * w).transpose(1, 2) # nchw to nte\n\nenc_keys = encoder(visual_emb)\nyo = decoder(yi, keys = enc_keys) # (1, 4096, 20000)\n```\n\n## Reformer Encoder Decoder Architecture\n\n**There is a bug in versions \u003c `0.21.0`. Please upgrade to at least the version specified for the working encoder / decoder Reformer.**\n\nBy popular demand, I have coded up a wrapper that removes a lot of the manual work in writing up a generic Reformer encoder / decoder architecture. To use, you would import the `ReformerEncDec` class. Encoder keyword arguments would be passed with a `enc_` prefix and decoder keyword arguments with `dec_`. The model dimension (`dim`) must be prefix free and will be shared between encoder and decoder. The framework will also take care of passing the encoder input mask to the decoder context mask, unless explicitly overridden.\n\n```python\nimport torch\nfrom reformer_pytorch import ReformerEncDec\n\nDE_SEQ_LEN = 4096\nEN_SEQ_LEN = 4096\n\nenc_dec = ReformerEncDec(\n    dim = 512,\n    enc_num_tokens = 20000,\n    enc_depth = 6,\n    enc_max_seq_len = DE_SEQ_LEN,\n    dec_num_tokens = 20000,\n    dec_depth = 6,\n    dec_max_seq_len = EN_SEQ_LEN\n).cuda()\n\ntrain_seq_in = torch.randint(0, 20000, (1, DE_SEQ_LEN)).long().cuda()\ntrain_seq_out = torch.randint(0, 20000, (1, EN_SEQ_LEN)).long().cuda()\ninput_mask = torch.ones(1, DE_SEQ_LEN).bool().cuda()\n\nloss = enc_dec(train_seq_in, train_seq_out, return_loss = True, enc_input_mask = input_mask)\nloss.backward()\n# learn\n\n# evaluate with the following\neval_seq_in = torch.randint(0, 20000, (1, DE_SEQ_LEN)).long().cuda()\neval_seq_out_start = torch.tensor([[0.]]).long().cuda() # assume 0 is id of start token\nsamples = enc_dec.generate(eval_seq_in, eval_seq_out_start, seq_len = EN_SEQ_LEN, eos_token = 1) # assume 1 is id of stop token\nprint(samples.shape) # (1, \u003c= 1024) decode the tokens\n```\n\n## Product Key Memory\n\nTo see the benefits of using PKM, the learning rate of the values must be set higher than the rest of the parameters. (Recommended to be `1e-2`)\n\nYou can follow the instructions here to set it correctly https://github.com/lucidrains/product-key-memory#learning-rates\n\n## Customizing Feedforward\n\nBy default, the activation function is `GELU`. If you would like an alternative activation function, you can pass in the class to the keyword `ff_activation`.\n\n```python\nimport torch\nfrom reformer_pytorch import ReformerLM\nfrom torch import nn\n\nmodel = ReformerLM(\n    num_tokens= 20000,\n    dim = 512,\n    depth = 6,\n    max_seq_len = 8192,\n    ff_chunks = 8,\n    ff_dropout = 0.1,\n    ff_mult = 6,\n    ff_activation = nn.LeakyReLU,\n    ff_glu = True # use GLU in feedforward, from paper 'GLU Variants Improve Transformer'\n)\n\nx = torch.randint(0, 20000, (1, 8192)).long()\ny = model(x) # (1, 8192, 20000)\n```\n\n## Research\n\nTo access the attention weights and bucket distribution, simply wrap the instantiated model with the `Recorder` wrapper class.\n\n```python\nimport torch\nfrom reformer_pytorch import Reformer, Recorder\n\nmodel = Reformer(\n    dim = 512,\n    depth = 12,\n    max_seq_len = 8192,\n    heads = 8,\n    lsh_dropout = 0.1,\n    causal = True\n).cuda()\n\nmodel = Recorder(model)\n\nx = torch.randn(1, 8192, 512).cuda()\ny = model(x)\n\nmodel.recordings[0] # a list of attention weights and buckets for the first forward pass\n\nmodel.turn_off() # stop recording\nmodel.turn_on() # start recording\nmodel.clear() # clear the recordings\n\nmodel = model.eject() # recover the original model and remove all listeners\n```\n\n## Additional Helpers\n\nReformer comes with a slight drawback that the sequence must be neatly divisible by the bucket size * 2. I have provided a small helper tool that can help you auto-round the sequence length to the next best multiple.\n\n```python\nimport torch\nfrom reformer_pytorch import ReformerLM, Autopadder\n\nmodel = ReformerLM(\n    num_tokens= 20000,\n    dim = 1024,\n    depth = 12,\n    max_seq_len = 8192,\n    heads = 8,\n    lsh_dropout = 0.1,\n    causal = True,\n    bucket_size = 63,   # odd bucket size\n    num_mem_kv = 77     # odd memory key length\n).cuda()\n\nmodel = Autopadder(model)\n\nSEQ_LEN = 7777 # odd sequence length\nkeys = torch.randn(1, 137, 1024) # odd keys length\n\nx = torch.randint(0, 20000, (1, SEQ_LEN)).long().cuda()\ny = model(x, keys = keys) # (1, 7777, 20000)\n```\n\n## Helpers for training auto-regressive models\n\nA lot of users are only interested in an auto-regressive language model (like GPT-2). Here is a training wrapper to make it easy to both train and evaluate on arbitrarily lengthed sequences of encoded tokens. You will have to take care of the encoding and decoding yourself.\n\n```python\nimport torch\nfrom torch import randint\n\nfrom reformer_pytorch import ReformerLM\nfrom reformer_pytorch.generative_tools import TrainingWrapper\n\nmodel = ReformerLM(\n    num_tokens= 20000,\n    dim = 1024,\n    depth = 12,\n    max_seq_len = 4096,\n    lsh_dropout = 0.1,\n    causal = True,\n    full_attn_thres = 1024\n)\n\n# 0 is used for padding and no loss to be calculated on it\nmodel = TrainingWrapper(model, ignore_index = 0, pad_value = 0)\n\n# the wrapper can handle evenly packed sequences\nx_train = randint(0, 20000, (3, 357))\n\n# or if you have a list of uneven sequences, it will be padded for you\nx_train = [\n    randint(0, 20000, (120,)),\n    randint(0, 20000, (253,)),\n    randint(0, 20000, (846,))\n]\n\n# when training, set return_loss equal to True\nmodel.train()\nloss = model(x_train, return_loss = True)\nloss.backward()\n\n# when evaluating, just use the generate function, which will default to top_k sampling with temperature of 1.\ninitial = torch.tensor([[0]]).long() # assume 0 is start token\nsample = model.generate(initial, 100, temperature=1., filter_thres = 0.9, eos_token = 1) # assume end token is 1, or omit and it will sample up to 100\nprint(sample.shape) # (1, \u003c=100) token ids\n```\n\n\n## Issues\n\n\u003ca href=\"https://github.com/andreabac3\"\u003eAndrea\u003c/a\u003e has uncovered that using O2 optimization level when training with mixed precision can lead to instability. Please use O1 instead, which can be set with the `amp_level` in Pytorch Lightning, or `opt_level` in Nvidia's Apex library.\n\n## Alternatives\n\n1. Routing Transformer - https://github.com/lucidrains/routing-transformer\n2. Sinkhorn Transformer - https://github.com/lucidrains/sinkhorn-transformer\n3. Performer - https://github.com/lucidrains/performer-pytorch\n4. Linear Transformer - https://github.com/lucidrains/linear-attention-transformer/\n5. Compressive Transformer - https://github.com/lucidrains/compressive-transformer-pytorch\n\n## Citations\n```bibtex\n@inproceedings{kitaev2020reformer,\n    title       = {Reformer: The Efficient Transformer},\n    author      = {Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya},\n    booktitle   = {International Conference on Learning Representations},\n    year        = {2020},\n    url         = {https://openreview.net/forum?id=rkgNKkHtvB}\n}\n```\n\n```bibtex\n@article{DBLP:journals/corr/abs-1907-01470,\n    author    = {Sainbayar Sukhbaatar and\n               Edouard Grave and\n               Guillaume Lample and\n               Herv{\\'{e}} J{\\'{e}}gou and\n               Armand Joulin},\n    title     = {Augmenting Self-attention with Persistent Memory},\n    journal   = {CoRR},\n    volume    = {abs/1907.01470},\n    year      = {2019},\n    url       = {http://arxiv.org/abs/1907.01470}\n}\n```\n\n```bibtex\n@article{1910.05895,\n    author  = {Toan Q. Nguyen and Julian Salazar},\n    title   = {Transformers without Tears: Improving the Normalization of Self-Attention},\n    year    = {2019},\n    eprint  = {arXiv:1910.05895},\n    doi     = {10.5281/zenodo.3525484},\n}\n```\n\n```bibtex\n@inproceedings{fan2020reducing,\n    title     = {Reducing Transformer Depth on Demand with Structured Dropout},\n    author    = {Angela Fan and Edouard Grave and Armand Joulin},\n    booktitle = {International Conference on Learning Representations},\n    year      = {2020},\n    url       = {https://openreview.net/forum?id=SylO2yStDr}\n}\n```\n\n```bibtex\n@article{Shazeer2019FastTD,\n    title   = {Fast Transformer Decoding: One Write-Head is All You Need},\n    author  = {Noam Shazeer},\n    journal = {ArXiv},\n    year    = {2019},\n    volume  = {abs/1911.02150}\n}\n```\n\n```bibtex\n@misc{shazeer2020glu,\n    title   = {GLU Variants Improve Transformer},\n    author  = {Noam Shazeer},\n    year    = {2020},\n    url     = {https://arxiv.org/abs/2002.05202}    \n}\n```\n\n```bibtex\n@misc{roy*2020efficient,\n    title   = {Efficient Content-Based Sparse Attention with Routing Transformers},\n    author  = {Aurko Roy* and Mohammad Taghi Saffar* and David Grangier and Ashish Vaswani},\n    year    = {2020},\n    url     = {https://openreview.net/forum?id=B1gjs6EtDr}\n}\n```\n\n```bibtex\n@misc{bachlechner2020rezero,\n    title   = {ReZero is All You Need: Fast Convergence at Large Depth},\n    author  = {Thomas Bachlechner and Bodhisattwa Prasad Majumder and Huanru Henry Mao and Garrison W. Cottrell and Julian McAuley},\n    year    = {2020},\n    url     = {https://arxiv.org/abs/2003.04887}\n}\n```\n\n```bibtex\n@misc{lample2019large,\n    title   = {Large Memory Layers with Product Keys},\n    author  = {Guillaume Lample and Alexandre Sablayrolles and Marc'Aurelio Ranzato and Ludovic Denoyer and Hervé Jégou},\n    year    = {2019},\n    eprint  = {1907.05242},\n    archivePrefix = {arXiv}\n}\n```\n\n```bibtex\n@misc{bhojanapalli2020lowrank,\n    title   = {Low-Rank Bottleneck in Multi-head Attention Models},\n    author  = {Srinadh Bhojanapalli and Chulhee Yun and Ankit Singh Rawat and Sashank J. Reddi and Sanjiv Kumar},\n    year    = {2020},\n    eprint  = {2002.07028}\n}\n```\n\n```bibtex\n@misc{dong2021attention,\n    title   = {Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth}, \n    author  = {Yihe Dong and Jean-Baptiste Cordonnier and Andreas Loukas},\n    year    = {2021},\n    eprint  = {2103.03404}\n}\n```\n\n```bibtex\n@misc{su2021roformer,\n    title   = {RoFormer: Enhanced Transformer with Rotary Position Embedding},\n    author  = {Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu},\n    year    = {2021},\n    eprint  = {2104.09864},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n```bibtex\n@misc{vaswani2017attention,\n    title   = {Attention Is All You Need},\n    author  = {Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},\n    year    = {2017},\n    eprint  = {1706.03762},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n\n[♥](https://www.youtube.com/watch?v=GUo2XuqMcCU)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Freformer-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Freformer-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Freformer-pytorch/lists"}