{"id":15601063,"url":"https://github.com/lucidrains/compressive-transformer-pytorch","last_synced_at":"2025-04-09T23:15:04.215Z","repository":{"id":62564258,"uuid":"274760510","full_name":"lucidrains/compressive-transformer-pytorch","owner":"lucidrains","description":"Pytorch implementation of Compressive Transformers, from Deepmind","archived":false,"fork":false,"pushed_at":"2021-10-04T05:28:35.000Z","size":35788,"stargazers_count":157,"open_issues_count":3,"forks_count":21,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-04-09T23:14:57.829Z","etag":null,"topics":["artificial-intelligence","attention-mechanism","deep-learning","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucidrains.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-06-24T20:13:02.000Z","updated_at":"2025-04-02T16:07:09.000Z","dependencies_parsed_at":"2022-11-03T16:32:26.632Z","dependency_job_id":null,"html_url":"https://github.com/lucidrains/compressive-transformer-pytorch","commit_stats":null,"previous_names":[],"tags_count":36,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fcompressive-transformer-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fcompressive-transformer-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fcompressive-transformer-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucidrains%2Fcompressive-transformer-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucidrains","download_url":"https://codeload.github.com/lucidrains/compressive-transformer-pytorch/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248125591,"owners_count":21051770,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","attention-mechanism","deep-learning","transformer"],"created_at":"2024-10-03T02:13:33.408Z","updated_at":"2025-04-09T23:15:04.092Z","avatar_url":"https://github.com/lucidrains.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"./memory.png\"\u003e\u003c/img\u003e\n\n## Compressive Transformer in Pytorch\n\nPytorch implementation of \u003ca href=\"https://openreview.net/forum?id=SylKikSYDH\"\u003eCompressive Transformers\u003c/a\u003e, a variant of Transformer-XL with compressed memory for long-range language modelling. I will also combine this with an idea from \u003ca href=\"https://arxiv.org/abs/1910.06764\"\u003eanother paper\u003c/a\u003e that adds gating at the residual intersection. The memory and the gating may be synergistic, and lead to further improvements in both language modeling as well as reinforcement learning.\n\n[![PyPI version](https://badge.fury.io/py/compressive-transformer-pytorch.svg)](https://badge.fury.io/py/compressive-transformer-pytorch)\n\n## Install\n\n```bash\n$ pip install compressive_transformer_pytorch\n```\n\n## Usage\n\n```python\nimport torch\nfrom compressive_transformer_pytorch import CompressiveTransformer\n\nmodel = CompressiveTransformer(\n    num_tokens = 20000,\n    emb_dim = 128,                 # embedding dimensions, embedding factorization from Albert paper\n    dim = 512,\n    depth = 12,\n    seq_len = 1024,\n    mem_len = 1024,                # memory length\n    cmem_len = 1024 // 4,          # compressed memory buffer length\n    cmem_ratio = 4,                # compressed memory ratio, 4 was recommended in paper\n    reconstruction_loss_weight = 1,# weight to place on compressed memory reconstruction loss\n    attn_dropout = 0.1,            # dropout post-attention\n    ff_dropout = 0.1,              # dropout in feedforward\n    attn_layer_dropout = 0.1,      # dropout for attention layer output\n    gru_gated_residual = True,     # whether to gate the residual intersection, from 'Stabilizing Transformer for RL' paper\n    mogrify_gru = False,           # experimental feature that adds a mogrifier for the update and residual before gating by the GRU\n    memory_layers = range(6, 13),  # specify which layers to use long-range memory, from 'Do Transformers Need LR Memory' paper\n    ff_glu = True                  # use GLU variant for feedforward\n)\n\ninputs = torch.randint(0, 256, (1, 2048))\nmasks = torch.ones_like(inputs).bool()\n\nsegments = inputs.reshape(1, -1, 1024).transpose(0, 1)\nmasks = masks.reshape(1, -1, 1024).transpose(0, 1)\n\nlogits, memories, aux_loss = model(segments[0], mask = masks[0])\nlogits,        _, aux_loss = model(segments[1], mask = masks[1], memories = memories)\n\n# memories is a named tuple that contains the memory (mem) and the compressed memory (cmem)\n```\n\nWhen training, you can use the `AutoregressiveWrapper` to have memory management across segments taken care of for you. As easy as it gets.\n\n```python\nimport torch\nfrom compressive_transformer_pytorch import CompressiveTransformer\nfrom compressive_transformer_pytorch import AutoregressiveWrapper\n\nmodel = CompressiveTransformer(\n    num_tokens = 20000,\n    dim = 512,\n    depth = 6,\n    seq_len = 1024,\n    mem_len = 1024,\n    cmem_len = 256,\n    cmem_ratio = 4,\n    memory_layers = [5,6]\n).cuda()\n\nmodel = AutoregressiveWrapper(model)\n\ninputs = torch.randint(0, 20000, (1, 2048 + 1)).cuda()\n\nfor loss, aux_loss, _ in model(inputs, return_loss = True):\n    (loss + aux_loss).backward()\n    # optimizer step and zero grad\n\n# ... after much training ...\n\n# generation is also greatly simplified and automated away\n# just pass in the prime, which can be 1 start token or any length\n# all is taken care of for you\n\nprime = torch.ones(1, 1).cuda()  # assume 1 is start token\nsample = model.generate(prime, 4096)\n```\n\n\n## Citations\n\n```bibtex\n@misc{rae2019compressive,\n    title   = {Compressive Transformers for Long-Range Sequence Modelling},\n    author  = {Jack W. Rae and Anna Potapenko and Siddhant M. Jayakumar and Timothy P. Lillicrap},\n    year    = {2019},\n    eprint  = {1911.05507},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@misc{parisotto2019stabilizing,\n    title   = {Stabilizing Transformers for Reinforcement Learning},\n    author  = {Emilio Parisotto and H. Francis Song and Jack W. Rae and Razvan Pascanu and Caglar Gulcehre and Siddhant M. Jayakumar and Max Jaderberg and Raphael Lopez Kaufman and Aidan Clark and Seb Noury and Matthew M. Botvinick and Nicolas Heess and Raia Hadsell},\n    year    = {2019},\n    eprint  = {1910.06764},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG}\n}\n```\n\n```bibtex\n@inproceedings{rae-razavi-2020-transformers,\n    title   = \"Do Transformers Need Deep Long-Range Memory?\",\n    author  = \"Rae, Jack  and\n      Razavi, Ali\",\n    booktitle = \"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics\",\n    month   = jul,\n    year    = \"2020\",\n    address = \"Online\",\n    publisher = \"Association for Computational Linguistics\",\n    url     = \"https://www.aclweb.org/anthology/2020.acl-main.672\"\n}\n```\n\n```bibtex\n@article{Shazeer2019FastTD,\n    title   = {Fast Transformer Decoding: One Write-Head is All You Need},\n    author  = {Noam Shazeer},\n    journal = {ArXiv},\n    year    = {2019},\n    volume  = {abs/1911.02150}\n}\n```\n\n```bibtex\n@misc{shazeer2020glu,\n    title   = {GLU Variants Improve Transformer},\n    author  = {Noam Shazeer},\n    year    = {2020},\n    url     = {https://arxiv.org/abs/2002.05202}\n}\n```\n\n```bibtex\n@misc{lan2019albert,\n    title       = {ALBERT: A Lite BERT for Self-supervised Learning of Language Representations},\n    author      = {Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut},\n    year        = {2019},\n    url         = {https://arxiv.org/abs/1909.11942}\n}\n```\n\n```bibtex\n@misc{ding2021erniedoc,\n    title   = {ERNIE-Doc: A Retrospective Long-Document Modeling Transformer},\n    author  = {Siyu Ding and Junyuan Shang and Shuohuan Wang and Yu Sun and Hao Tian and Hua Wu and Haifeng Wang},\n    year    = {2021},\n    eprint  = {2012.15688},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fcompressive-transformer-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucidrains%2Fcompressive-transformer-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucidrains%2Fcompressive-transformer-pytorch/lists"}