{"id":16677644,"url":"https://github.com/shjwudp/megabyte","last_synced_at":"2025-07-15T18:39:52.918Z","repository":{"id":171669493,"uuid":"648239428","full_name":"shjwudp/megabyte","owner":"shjwudp","description":"A PyTorch implementation of MEGABYTE. This multi-scale transformer architecture has the excellent features of tokenization-free and sub-quadratic attention. The paper link: https://arxiv.org/abs/2305.07185","archived":false,"fork":false,"pushed_at":"2024-02-06T06:36:30.000Z","size":63,"stargazers_count":4,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-19T01:54:04.001Z","etag":null,"topics":["deep-learning","language-model","sub-quadratic-attention","tokenization-free"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shjwudp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-01T14:08:51.000Z","updated_at":"2024-11-01T11:29:42.000Z","dependencies_parsed_at":"2024-02-06T07:44:35.813Z","dependency_job_id":null,"html_url":"https://github.com/shjwudp/megabyte","commit_stats":null,"previous_names":["shjwudp/megabyte"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shjwudp%2Fmegabyte","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shjwudp%2Fmegabyte/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shjwudp%2Fmegabyte/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shjwudp%2Fmegabyte/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shjwudp","download_url":"https://codeload.github.com/shjwudp/megabyte/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227615436,"owners_count":17794076,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","language-model","sub-quadratic-attention","tokenization-free"],"created_at":"2024-10-12T13:27:07.126Z","updated_at":"2024-12-02T18:45:44.252Z","avatar_url":"https://github.com/shjwudp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Megabyte\r\n\r\nThis repository implements [MEGABYTE](https://arxiv.org/abs/2305.07185) with pytorch, and tries to explore the best practice of Megabyte architecture. The original architecture described in the paper is implemented in [megabyte.py](./model/megabyte.py), and the best practices are implemented in [megabyte_in_action.py](./model/megabyte_in_action.py).\r\n\r\nMegabyte is a new architecture that overcomes the performance defects of bytes end-to-end training and makes tokenization-free autoregressive sequence modeling possible.\r\n\r\n## Megabyte in autoregressive training\r\n\r\n```python\r\nimport torch\r\nimport torch.nn.functional as F\r\nfrom einops import rearrange\r\nfrom model import MegabyteConfig, Megabyte\r\n\r\nV = 512         # vocabulary size, input bytes have 256 characters, and the extra 256 are reserved for special tokens.\r\nP = 4           # patch size\r\nD_G = 512       # global model dimension\r\nD_L = 128       # local model dimension\r\nT = 1024        # sequence length\r\nB = 2           # batch size\r\nK = T//P        # number of patches\r\nPAD_ID = 257    # padding token id\r\nEOS_ID = 258    # end of sequence token id\r\n\r\nconfig = MegabyteConfig(\r\n    V=V,\r\n    P=P,\r\n    D_G=D_G,\r\n    D_L=D_L,\r\n    T_MAX=T,\r\n    initializer_range=0.02, # Parameter initialization value range\r\n    g_nlayers=4,            # number of global model layers\r\n    g_nheads=32,            # number of global model attention heads\r\n    l_nlayers=2,            # number of local model attention layers\r\n    l_nheads=2,             # number of local model attention heads\r\n    pad_id=PAD_ID,\r\n    eos_id=EOS_ID,\r\n)\r\nmegabyte = Megabyte(config)\r\ninput_ids = torch.randint(0, 255, (B, T))\r\n# Autoregressive learning, megabyte will learn from the inputs input[:, :-1], labels input[:, :], and learn to predict the next token.\r\nloss = megabyte(input_ids, return_loss=True).loss\r\nloss.backward()\r\n\r\nprint(loss.norm())\r\n```\r\n\r\n## Megabyte in generation\r\n\r\n```python\r\n...\r\nfrom model.megabyte_transformers import MegabyteLMHeadModel, MegabyteTokenizer\r\nlm_head_megabyte = MegabyteLMHeadModel.from_native_megabyte(megabyte)\r\ntokenizer = MegabyteTokenizer(\r\n    eos_token_id=lm_head_megabyte.config.eos_token_id,\r\n)\r\n\r\ninputs = tokenizer(\"Today is\", return_tensors=\"pt\")\r\noutputs = lm_head_megabyte.generate(\r\n    **inputs,\r\n    max_new_tokens=5,\r\n    return_dict_in_generate=True,\r\n    output_scores=True,\r\n)\r\n\r\ntexts = tokenizer.decode(outputs.sequences)\r\nprint(texts)\r\n```\r\n\r\n## Benchmark\r\n\r\nYou can use the [benchmark.py](https://github.com/shjwudp/megabyte/blob/main/benchmark.py) script for Megabyte's performance measurement. The following table compares the training of Megabyte and GPT2 on wikitext-103-v1 with the same parameter scale.\r\n\r\n| model                   | # of parameters (M) | training speed (KB/s) | GPU Memory Allocated % | eval loss ↓ | eval loss bpc ↓ |\r\n| :---------------------- | :-------------- | :-------------------- | :--------------------- | :-------- | :------------ |\r\n| gpt2                    | 119       | 143.68                | 42.97                  | 5.06      | 1.10          |\r\n| megabyte(P=8)           | 126       | 189.13                | 17.62                  | 1.13      | 1.13          |\r\n| megabyte_in_action(P=8) | 126       | 197.47                | 18.69                  | 1.09      | 1.09          |\r\n\r\n## Citation\r\n\r\n```text\r\n@misc{yu2023megabyte,\r\n      title={MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers}, \r\n      author={Lili Yu and Dániel Simig and Colin Flaherty and Armen Aghajanyan and Luke Zettlemoyer and Mike Lewis},\r\n      year={2023},\r\n      eprint={2305.07185},\r\n      archivePrefix={arXiv},\r\n      primaryClass={cs.LG}\r\n}\r\n```\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshjwudp%2Fmegabyte","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshjwudp%2Fmegabyte","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshjwudp%2Fmegabyte/lists"}