{"id":22114291,"url":"https://github.com/erogol/blagpt","last_synced_at":"2025-07-07T00:06:21.658Z","repository":{"id":264720504,"uuid":"875833958","full_name":"erogol/BlaGPT","owner":"erogol","description":"Experimental playground for benchmarking language model (LM) architectures, layers, and tricks on smaller datasets. Designed for flexible experimentation and exploration.","archived":false,"fork":false,"pushed_at":"2025-06-24T01:46:25.000Z","size":638,"stargazers_count":59,"open_issues_count":2,"forks_count":5,"subscribers_count":7,"default_branch":"main","last_synced_at":"2025-07-03T12:59:58.468Z","etag":null,"topics":["attention-mechanisms","deep-learning","diffusion-llm","dllm","gpt","gpt-2","hymba","large-language-models","llm","llm-training","machine-learning","position-embedding","pytorch","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/erogol.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-20T23:51:34.000Z","updated_at":"2025-07-01T00:32:12.000Z","dependencies_parsed_at":null,"dependency_job_id":"a53d9dea-d1ce-4fb6-a098-a00a89555a07","html_url":"https://github.com/erogol/BlaGPT","commit_stats":null,"previous_names":["erogol/blagpt"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/erogol/BlaGPT","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erogol%2FBlaGPT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erogol%2FBlaGPT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erogol%2FBlaGPT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erogol%2FBlaGPT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/erogol","download_url":"https://codeload.github.com/erogol/BlaGPT/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/erogol%2FBlaGPT/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263991444,"owners_count":23540665,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention-mechanisms","deep-learning","diffusion-llm","dllm","gpt","gpt-2","hymba","large-language-models","llm","llm-training","machine-learning","position-embedding","pytorch","transformers"],"created_at":"2024-12-01T11:10:41.098Z","updated_at":"2025-07-07T00:06:21.652Z","avatar_url":"https://github.com/erogol.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BlaGPT\n\nExperimental playground for benchmarking language model (LM) architectures, layers, and tricks on smaller datasets. Designed for flexible experimentation and exploration.\n\n## BlaGPT Model\nBlaGPT is a flexible Transformer implementation that you can turn on/off following things in the config.\n\nMulti-token prediction - [link](https://arxiv.org/pdf/2404.19737)\n\nWeight tying - [link](https://arxiv.org/abs/1608.05859v3)\n\nGrouped query attention - [link](https://arxiv.org/pdf/2305.13245)\n\nCapping logits - [link](https://arxiv.org/pdf/2408.00118)\n\nQKV bias - [link](https://arxiv.org/abs/2407.10671)\n\nZero-init projection layer - [link](https://arxiv.org/abs/2407.10671)\n\nPost and pre-RMSNorm - [link](https://arxiv.org/pdf/2408.00118)\n\nSetting base theta to 1_000_000 - [llama3](https://github.com/meta-llama/llama3/blob/main/llama/model.py#L49) - increased the final validation loss - best `3.3324`\n\nZ-loss regularization - [link](https://arxiv.org/pdf/2309.14322) - increased the final validation loss by 0.02 - loss: `3.3527`\n\nKV-Shifting attention - [link](https://arxiv.org/abs/2411.19574) - seems to improve performance - loss: `3.3310` -\u003e `3.3138` - peak memory consumption: `42858 MiB`\n\nDilated Attention (LongNet) - [link](https://arxiv.org/pdf/2307.02486)\n\nMulti-Head Latent Attention - [link](https://arxiv.org/abs/2502.07864) - loss: `3.3479` - peak memory consumption: `42192 MiB`\n\nPer token output bias - [link]() - loss: `3.3257` - peak memory consumption: `42120 MiB`\n\nDyT Norm - [link](https://arxiv.org/html/2503.10622v1) - didn't really work. Loss stuck too high\n\nForgetting Transformer (Vanilla and Pro vers) - [link](https://openreview.net/pdf?id=q2Lnyegkr8) - vanilla loss: `3.3243`, pro loss: `OOM`\n\nMulti-Token Attention - [link](https://arxiv.org/pdf/2504.00927) - loss: `3.3357` - peak memory: `42136 MiB`\n\nDifferential Attention - [link](https://arxiv.org/abs/2410.05258) - loss: `3.3352` - peak memory: `41521 MiB`\n\nSoftpick - [link](https://arxiv.org/abs/2504.20966) - loss: `3.3446` - peak memory: `59417 MiB`\n\nCanon Layer - [link](https://physics.allen-zhu.com/part-4-architecture-design/part-4-1) - loss: `3.3217` - peak memory: `43199 MiB`\n\nParallel Transformer Block - [link](https://arxiv.org/abs/2204.02311) - loss: `3.3473` - peak memory: `40302 MiB`\n\nPer Layer Token Embedding - [link](https://blog.google/technology/developers/gemma-3/) - loss: `3.2411` - peak memory: `40916 MiB`\n\n## Other Models\nMegaByte - [link](https://arxiv.org/abs/2305.07185) - loss: `3.810`\n\nFTP (heavily modified) - [link](https://arxiv.org/pdf/2410.18160) - loss: `3.901`\n\nRene - [link](https://huggingface.co/cartesia-ai/Rene-v0.1-1.3b-pytorch) - loss: `3.340`\n\nRwkv7 - [link](https://github.com/BlinkDL/RWKV-LM) - loss: `4.450`\n\nZamba2 - [link](https://huggingface.co/Zyphra/Zamba2-2.7B) - Zamba2 \u003e Rene \u003e Rwkv7\n\nHourglass Transformer (modified) - [link](https://arxiv.org/abs/2110.13711) - Hourglass \u003e MegaByte \u003e FTP - loss: `3.710`\n\nHymba - [link](https://arxiv.org/html/2411.13676v1) - train step time is significantly slower than the transformers. Best validation loss so far: `4.7505`\n\nTokenformer (in BlaGPT model) - [link](https://github.com/Haiyang-W/TokenFormer) - loss: `3.390`\n\nLLaDa (dLLM) - [link](https://arxiv.org/abs/2502.09992) - val-loss: `8.6930`, xentropy-loss: `4.2891` (comparable to other models and estimated by `llada_validation_cross_entropy.py`),\n\nAvey - [link](https://arxiv.org/pdf/2506.11305v1) - loss: `3.323`, peak memory: `51962 MiB` (batch size 8), step_time: `2871ms` (very slow to train and uses \u003e3x more memory than other models)\n\n## Optimizers\nPaLMForeachSOAP - [link](https://github.com/ClashLuke/HeavyBall) - almost 2 times slower than Adam but the best results\n\nAdemamix - [link](https://github.com/nanowell/AdEMAMix-Optimizer-Pytorch/blob/main/AdEMAMix.py) - Unstable even after trying different learning rates.\n\nAdopt - [link](https://github.com/iShohei220/adopt) - straight up Nan\n\nCAdamW - [link](https://github.com/kyleliang919/C-Optim/blob/main/c_adamw.py) - loss: `3.3517`\n\nAdamW with independent weight decay - [link](https://arxiv.org/pdf/2309.14322) - loss: `3.320`\n\nAdam - loss: `3.3224`\n\nAdamW - loss: `3.3310`, peak VRAM: `42053 MiB`, step_time: `533ms`\n\nDeMo - [link](https://arxiv.org/abs/2411.19870) - Saves 7 GB per GPU, loss is higher than baseline, step time is slower than Adam -  loss: `3.4676`, peak VRAM: `41534 MiB`, step_time: `820ms`\n\nAdam-Mini - [link]() - loss is higher than Adam and AdamW and also slower ??, saved a bit of VRAM  - loss: `3.3324`, peak VRAM: `41534 MiB`, step_time: `610ms`\n\nMARS - [link](https://github.com/AGI-Arena/MARS) - loss: `3.3459`, peak VRAM: 40953 MiB, step_time: `628ms`\n\nMuon - [link](https://kellerjordan.github.io/posts/muon/) - loss: `3.2923`, peak VRAM: `40332MB`, step_time: `620.24ms`\n\nBiClip - [link](https://arxiv.org/pdf/2502.04164) - (not working well) loss: `7.2292`, peak VRAM: `39751 MiB`, step_time: `510ms`\n\n## Adding a New Model\n\n- Implement the model\n- Return the loss in the forward function\n- Add model to `model_registry.py`\n- And start training\n\nSee one of the implementations for details.\n\n\n## Training\n\n- Get the data by running `data/fineweb10B_cached.py`\n\n- Start training with:\n\n```bash\ntorchrun --standalone --nproc_per_node=8 train.py --run_name pre_post_norm --model_name blagpt\n```\n\n- (Optional) Run the learning rate finder before the training\n\n```bash\ntorchrun --standalone --nproc_per_node=8 find_lr.py --model_name blagpt\n\n# Output\nResults:\nSteepest gradient learning rate: 3.31e-06\nElbow point learning rate: 1.20e-01\nPlot saved to: logs/lr_finder_blagpt/lr_finder_plot.png\nResults saved to: logs/lr_finder_blagpt/lr_finder_results.pt\n```\n\n## Best Model So Far\n\n- Check `best_model_config.py` for the best model configuration so far.\n\n- You can run the training with the best model config by running:\n\n```bash\ntorchrun --standalone --nproc_per_node=8 train.py --run_name best_model --model_name best\n```\n\n## Acknowledgements\n\nThe initial code is based on\n\nNano GPT - [link](https://github.com/karpathy/nanoGPT)\n\nModded NanoGPT - [link](https://github.com/KellerJordan/modded-nanogpt)\n\nThanks to @xumingyu2021 for memory friendly implementation of the Differential Attention\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferogol%2Fblagpt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ferogol%2Fblagpt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ferogol%2Fblagpt/lists"}