{"id":46486342,"url":"https://github.com/nguyenphuminh/planckgpt","last_synced_at":"2026-05-16T11:33:03.387Z","repository":{"id":311469724,"uuid":"1043799893","full_name":"nguyenphuminh/planckgpt","owner":"nguyenphuminh","description":"Train a GPT from scratch on your laptop","archived":false,"fork":false,"pushed_at":"2026-05-13T12:42:03.000Z","size":324,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2026-05-13T13:36:37.631Z","etag":null,"topics":["ai","attention","cuda","deep-learning","dl","gpt","gpu","language-model","llm","machine-learning","ml","nlp","torch","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nguyenphuminh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-24T16:40:41.000Z","updated_at":"2026-05-13T11:47:28.000Z","dependencies_parsed_at":"2025-08-24T21:04:00.087Z","dependency_job_id":"17e8559f-9c39-4592-8c4f-4d7c23e3f25e","html_url":"https://github.com/nguyenphuminh/planckgpt","commit_stats":null,"previous_names":["nguyenphuminh/smallm","nguyenphuminh/planckgpt"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/nguyenphuminh/planckgpt","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nguyenphuminh%2Fplanckgpt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nguyenphuminh%2Fplanckgpt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nguyenphuminh%2Fplanckgpt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nguyenphuminh%2Fplanckgpt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nguyenphuminh","download_url":"https://codeload.github.com/nguyenphuminh/planckgpt/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nguyenphuminh%2Fplanckgpt/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33100861,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-16T04:41:52.686Z","status":"ssl_error","status_checked_at":"2026-05-16T04:41:52.009Z","response_time":115,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","attention","cuda","deep-learning","dl","gpt","gpu","language-model","llm","machine-learning","ml","nlp","torch","transformer"],"created_at":"2026-03-06T09:32:19.206Z","updated_at":"2026-05-16T11:33:03.381Z","avatar_url":"https://github.com/nguyenphuminh.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PlanckGPT\r\n\r\nPlanckGPT (planck length reference :D) is my attempt to make a tiny language model from scratch mostly for fun and educational purposes, but also to see how far a consumer-level computer can go in AI development **from scratch**. It has about 150m parameters and is pretrained on roughly 3 billion tokens of the Fineweb-edu dataset. This is small compared to modern LLMs' standards, and it only does next token prediction, but you can definitely train this on a mid-range card for just 1-2 days. Its performance should match that of a GPT2-small, with ~3.1 val loss on Fineweb-edu.\r\n\r\n## Setup\r\n\r\nSetup venv and install necessary packages:\r\n```sh\r\n# Create and activate venv\r\npython -m venv venv\r\n# Run this every time you start\r\nsource venv/scripts/activate\r\n# or \"./venv/scripts/activate\" if you are on windows\r\n\r\n# Install packages (once)\r\npip install torch torchvision --index-url https://download.pytorch.org/whl/cu130\r\npip install tiktoken datasets bitsandbytes\r\n```\r\n\r\nOf course, you should already install compatible CUDA and Python versions, I currently use Python 3.14 and CUDA 13.\r\n\r\n## Running PlanckGPT\r\n\r\n1. Download the latest model (`chatbot.pth`) in the releases page.\r\n2. Simply run:\r\n```sh\r\npython inference.py\r\n```\r\n\r\nA prompt will appear for you to chat with the model.\r\n\r\n## Pretraining\r\n\r\nTo pretrain the model from scratch, run:\r\n```sh\r\npython train.py\r\n```\r\n\r\nThe model will train with ~3b tokens/20 150m-token segments (estimated 40 hours on my Laptop RTX 5070 Mobile), and after each epoch it will save the current model to `./chatbot.pth`.\r\n\r\nOf course, for more control, you can check out `model.py`.\r\n\r\n## Architecture\r\n\r\nCurrently it uses:\r\n\r\n* Tokenizer: Tiktoken with GPT-2 encoding (50,257 vocab size).\r\n* Embedding: 768-dimensional token embedding.\r\n* Rotary positional embedding.\r\n* Transformer: 12 decoder layers, 6 query heads, 3072 ffn dim, 768 embedding dim.\r\n* Multi-Query Attention.\r\n* Squared ReLU for activation.\r\n* RMSNorm without learnable params, notably used on QK, embedding, and output logits.\r\n* Output: Linear projection with softcap logits (-15, 15).\r\n\r\nand is pretrained with:\r\n\r\n* Dataset: Fineweb-edu (~3b tokens).\r\n* Context Window: 1024 tokens.\r\n* Batch Size: 4 (effective batch size: 512 with gradient accumulation).\r\n* NorMuon optimizer for transformer weights, 8-bit Adam optimizer for embedding and output projection.\r\n* Stable LR for the first 55% of the steps, LinearLR decay to 10% of base LR for the rest.\r\n* BF16 mixed precision training and other Blackwell-specific features.\r\n* Training with torch.compile on \"max-autotune\" mode and `dynamic=False`.\r\n* Gradient checkpointing in 1/3 of the transformer layers.\r\n\r\nand generates text with:\r\n\r\n* Top-k sampling (k=50) and top-p sampling (p=0.95) right after.\r\n* Temperature: 1.0.\r\n* Context Window: 1024 tokens.\r\n* Repetition penalty: 1.1 on full context window.\r\n* Stopping: EOS token or fixed limit (1024 by default).\r\n* KV cache for faster inference.\r\n\r\nThe current configuration is designed to squeeze out the best possible performance out of an 8gb 5070 Mobile, you can change the configs to match your card.\r\n\r\n## Potential todos\r\n\r\nThese are things I might implement in the future:\r\n\r\n* Training improvements:\r\n  * Try out different pretraining datasets, e.g. ClimbMix.\r\n  * Consider adding LR warmup.\r\n  * Try Gram Newton-Schulz to improve Muon's speed.\r\n  * Use up-to-date Flash Attention implementation.\r\n  * Support FP8 and potentially NVFP4 training.\r\n  * Tune hyperparameters further.\r\n* Architecture improvements:\r\n  * Interesting idea to try out: Overwhelmingly large vocab like Gemma-3-270m which might help with small models.\r\n  * Custom tokenizer.\r\n  * Value embeddings.\r\n  * Dynamic scales for some layers.\r\n  * Mamba? RWKV? MoE?\r\n  * Sliding window attention.\r\n  * Smear.\r\n  * Backout.\r\n* Potential issues to look after:\r\n  * The current setup uses 20:1 data to params ratio, which is not optimal for Muon, which is closer to 10:1.\r\n  * Embedding might be unstable currently due to AdamW8bit.\r\n* Finetuning for multiple purposes.\r\n* Try different datasets for both pretraining and finetuning.\r\n* Export to multiple formats for inference.\r\n* Code refactoring.\r\n\r\n## Acknowledgements\r\n\r\nPlanckGPT is inspired by [`modded-nanogpt`](https://github.com/KellerJordan/modded-nanogpt) and [`nanochat`](https://github.com/karpathy/nanochat).\r\n\r\n## Cite PlanckGPT\r\n\r\n```bibtex\r\n@misc{planckgpt,\r\n  author = {Phu Minh Nguyen},\r\n  title = {PlanckGPT: Train a GPT from scratch on your laptop},\r\n  year = {2025},\r\n  publisher = {GitHub},\r\n  url = {https://github.com/nguyenphuminh/planckgpt}\r\n}\r\n```\r\n\r\n## Copyrights and License\r\n\r\nCopyrights © 2025 Nguyen Phu Minh.\r\n\r\nThis project is licensed under the Apache 2.0 License.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnguyenphuminh%2Fplanckgpt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnguyenphuminh%2Fplanckgpt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnguyenphuminh%2Fplanckgpt/lists"}