{"id":50307818,"url":"https://github.com/eniompw/llm-history","last_synced_at":"2026-05-28T18:01:20.069Z","repository":{"id":354405389,"uuid":"1223484616","full_name":"eniompw/llm-history","owner":"eniompw","description":"LLM Architecture Timeline","archived":false,"fork":false,"pushed_at":"2026-04-28T12:23:13.000Z","size":17,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-28T13:30:06.311Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eniompw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-28T11:18:29.000Z","updated_at":"2026-04-28T12:23:17.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/eniompw/llm-history","commit_stats":null,"previous_names":["eniompw/llm-history"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/eniompw/llm-history","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eniompw%2Fllm-history","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eniompw%2Fllm-history/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eniompw%2Fllm-history/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eniompw%2Fllm-history/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eniompw","download_url":"https://codeload.github.com/eniompw/llm-history/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eniompw%2Fllm-history/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33619972,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-28T18:01:14.925Z","updated_at":"2026-05-28T18:01:20.046Z","avatar_url":"https://github.com/eniompw.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# LLM Architecture Timeline\n\nA comprehensive reference for the key architectural innovations that shaped modern large language models from the Transformer (2017) through 1T-parameter models (2026), plus practical guides for building LLMs from scratch.\n\n## Table of Contents\n\n- [Overview](#overview)\n- [Architectural Milestones](#architectural-milestones)\n- [Building Your Own LLM](#building-your-own-llm)\n  - [Simplified Hybrid GPT/LLaMA Example](#simplified-hybrid-gptllama-example-microgpt-style)\n  - [Key Architectural Ideas](#key-architectural-ideas)\n  - [Quick-Start Guide](#quick-start-llama-2-in-4-lines)\n  - [Performance Enhancements](#enhancing-microgpt-for-modern-performance)\n\n## Overview\n\nThis timeline documents key architectural milestones from 2017 to April 2026, tracking the evolution from the foundational Transformer architecture through the emergence of sparse mixture-of-experts (MoE) models, scaling laws, alignment techniques (RLHF), and optimization innovations like FlashAttention. Each entry highlights the core innovation and its impact on subsequent model development.\n\n## Architectural Milestones\n\nKey architectural innovations, 2017 to Apr 2026.\n\n| Date | Model / Org | Params / Architecture | Key Innovation |\n| --- | --- | --- | --- |\n| Jun 2017 | [Transformer (Google Brain)](https://arxiv.org/abs/1706.03762) | —; Encoder-Decoder Attention | \"Attention Is All You Need\"; foundational architecture every LLM descends from. |\n| Jun 2018 | [GPT-1 (OpenAI)](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf) | 117M; Decoder-only Transformer | Introduced the decoder-only pretrain-then-finetune paradigm; showed unsupervised pretraining on text transfers well to downstream tasks. |\n| Feb 2019 | [GPT-2 (OpenAI)](https://cdn.openai.com/better-language-models/language-models.pdf) | 1.5B; Decoder-only Transformer | Scaled GPT-1's recipe; sparked mainstream AI awareness through its staged release over safety concerns; open-sourced in full on GitHub. |\n| Oct 2019 | [RMSNorm (Zhang \u0026 Sennrich)](https://arxiv.org/abs/1910.07467) | —; Normalization layer | Dropped LayerNorm's re-centering step while preserving scaling stability; later became standard in LLaMA-style decoder LLMs. |\n| Feb 2020 | [SwiGLU (Shazeer, Google)](https://arxiv.org/abs/2002.05202) | —; Activation function | Replaced ReLU/GeLU in FFN layers with a gated variant; better gradient flow; later adopted by PaLM, LLaMA, and most modern decoder models. |\n| May 2020 | [GPT-3 (OpenAI)](https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html) | 175B; Dense decoder + pre-norm | Demonstrated emergent few-shot in-context learning at scale; introduced pre-normalization as a training stability technique. |\n| Jun 2020 | [GShard (Google)](https://arxiv.org/abs/2006.16668) | 600B; Coarse sparse MoE, top-k routing | First to scale MoE to 600B parameters via top-k routing; established a blueprint later refined by DeepSeek. _Note: Shazeer et al. Jan 2017 Sparse MoE introduced the expert routing concept this builds on._ |\n| Apr 2021 | [RoPE (Su et al.)](https://arxiv.org/abs/2104.09864) | —; Positional encoding | Rotary positional embeddings; hybrid absolute/relative encoding that generalizes better to longer sequences than learned absolute positions. |\n| Mar 2022 | [Chinchilla (DeepMind)](https://arxiv.org/abs/2203.15556) | 70B; Dense Transformer | Compute-optimal scaling: more tokens on a smaller model beats a bigger undertrained model at the same compute budget. |\n| Mar 2022 | [InstructGPT / RLHF (OpenAI)](https://arxiv.org/abs/2203.02155) | 1.3B-175B; RLHF (SFT + RM + PPO) | Introduced the three-stage RLHF pipeline that turned raw pretrained LLMs into usable instruction-following assistants. |\n| May 2022 | [FlashAttention (Dao et al.)](https://openreview.net/forum?id=H4DqfPSibmx) | —; IO-aware exact attention | Rewrote attention to be memory-efficient and fast via tiled IO; unlocked practical long-context training and became the de-facto attention kernel. |\n| Nov 2022 | [ChatGPT (OpenAI)](https://openai.com/blog/chatgpt) | GPT-3.5 class; RLHF chat assistant | Wrapped an RLHF-tuned LLM in a conversational web interface; proved commercial viability of chat-aligned models and triggered the global generative AI boom. |\n| Feb 2023 | [LLaMA 1 (Meta)](https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/) | 7B-65B; Dense + RoPE / SwiGLU / RMSNorm / FlashAttention | Synthesized the modern decoder recipe and sparked the open-source LLM explosion. |\n| Mar 2023 | [GPT-4 (OpenAI, leak)](https://arxiv.org/abs/2303.08774) | ~1.8T total; ~280B active; reported coarse MoE top-2 of 16 | Major capability milestone; architecture figures are leak-reported (Jul 2023), not official. |\n| Jul 2023 | [LLaMA 2 (Meta)](https://arxiv.org/abs/2307.09288) | 7B-70B; Dense + GQA | Added RLHF chat alignment; introduced Grouped-Query Attention for faster inference at larger scales. |\n| Dec 2023 | [Mixtral 8x7B (Mistral AI, open)](https://arxiv.org/abs/2401.04088) | 46.7B total; 12.9B active; coarse MoE top-2 of 8 | First open-weight MoE with broad adoption; showed sparse routing could beat dense models at equal active-parameter cost. |\n| May 2024 | [DeepSeek V2 (DeepSeek)](https://arxiv.org/abs/2405.04434) | 236B total; 21B active; fine-grained MoE + MLA | Introduced MLA (compressed KV cache) and DeepSeekMoE with fine-grained experts and shared expert isolation. |\n| Jul 2024 | [LLaMA 3.1 (Meta)](https://ai.meta.com/research/publications/the-llama-3-herd-of-models/) | 405B; Dense + 128K context | First open model widely seen as competitive with top closed models at frontier scale. |\n| Sep 2024 | [OpenAI o1 (OpenAI)](https://openai.com/o1/) | 671B total; 37B active; MoE + FP8 training + MTP | Shifted emphasis from pre-training scaling to test-time compute scaling; used RL for extended internal reasoning before answers. |\n| Dec 2024 | [DeepSeek V3 (DeepSeek, open)](https://arxiv.org/abs/2412.19437) | 671B total; 37B active; MoE + FP8 training + MTP | Auxiliary-loss-free load balancing, FP8 training, and multi-token prediction; redefined efficiency expectations. |\n| Jan 2025 | [DeepSeek R1 (DeepSeek, open)](https://arxiv.org/abs/2501.12948) | 671B total; 37B active; V3 arch + pure RL post-training | First open model reported to match o1-level reasoning using pure RL, without SFT warm-up. |\n| Jul 2025 | [Kimi K2 (Moonshot AI, open)](https://arxiv.org/abs/2507.20534) | 1T total; 32B active; MoE + MLA + MuonClip | 1T-scale open MoE from Moonshot AI; used MuonClip for training stability at trillion-parameter scale. |\n| Apr 2026 | [Kimi K2.6 (Moonshot AI, open)](https://huggingface.co/moonshotai/Kimi-K2.6) | 1T total; 32B active; MoE + MLA + MuonClip, 384 experts, 256K context | Expanded to 384 experts and 256K context; highlighted agent-swarm workflows with up to 300 parallel sub-agents across 4,000 coordinated steps. |\n\n## Building Your Own LLM\n\n### Simplified Hybrid GPT/LLaMA Example (MicroGPT-style)\n\nThis is a compact educational architecture that is GPT-2-inspired but includes a few modern LLaMA-like choices.\n\nReference implementation: [microgpt.py](https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95)\n\n### Key Architectural Ideas\n\n- Decoder-only Transformer stack with causal self-attention.\n- Token + position embeddings at input.\n- Residual connections around attention and MLP sublayers.\n- RMSNorm (LLaMA-style) instead of LayerNorm.\n- Bias-free linear layers (LLaMA-style simplification).\n- ReLU in the MLP for simplicity (instead of GPT-2's GeLU).\n\n### Quick-Start: LLaMA 2 in 4 Lines\n\nFor end-to-end training and inference in minimal code:\n\nReference implementation: [modded-llama2.c](https://github.com/eniompw/modded-llama2.c)\n\n```bash\ngit clone https://github.com/eniompw/modded-llama2.c\n. ./modded-llama2.c/download_tinystories.sh\ncd modded-llama2.c \u0026\u0026 python train.py --max_iters=1\n./run out/model.bin -i \"Once upon a time \"\n```\n\nThis approach:\n- Uses pure C for fast inference with no dependencies.\n- Trains on TinyStories dataset for rapid iteration.\n- Demonstrates the full LLaMA 2 recipe (RoPE, SwiGLU, RMSNorm, grouped-query attention).\n- Ideal for learning how modern LLMs train and infer end-to-end.\n\n### Enhancing MicroGPT for Modern Performance\n\nReference implementation: [modded MicroGPT](https://github.com/eniompw/microgpt)\n\nTo make MicroGPT more performant and aligned with contemporary best practices:\n\n- **FlashAttention 2**: Replace standard attention with tiled IO-aware kernels for 2–4× wall-clock speedup.\n- **Grouped-Query Attention (GQA)**: Share KV heads across query heads to reduce memory and accelerate inference bandwidth.\n- **Multi-Token Prediction (MTP)**: Train the model to predict multiple tokens per forward pass, increasing sample efficiency.\n- **FP8 Training**: Use low-precision compute during training with proper scaling and accumulation for stability.\n- **Rotary Embeddings (RoPE)**: Replace learned absolute positions with rotary positional encodings for better length generalization.\n\nThese enhancements keep the educational clarity of MicroGPT while moving it closer to production-grade LLaMA 2 / DeepSeek-style architectures.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feniompw%2Fllm-history","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feniompw%2Fllm-history","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feniompw%2Fllm-history/lists"}