{"id":50511830,"url":"https://github.com/RightNow-AI/picolm","last_synced_at":"2026-06-19T15:00:37.471Z","repository":{"id":339450267,"uuid":"1161318197","full_name":"RightNow-AI/picolm","owner":"RightNow-AI","description":"Run a 1-billion parameter LLM on a $10 board with 256MB RAM","archived":false,"fork":false,"pushed_at":"2026-02-22T22:05:30.000Z","size":297,"stargazers_count":1051,"open_issues_count":10,"forks_count":111,"subscribers_count":19,"default_branch":"main","last_synced_at":"2026-02-28T02:39:36.969Z","etag":null,"topics":["arm","embedded","inference","llm","openclaw","picoclaw","quantization","raspberry-pi","risc-v"],"latest_commit_sha":null,"homepage":"https://www.rightnowai.co/forge","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RightNow-AI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-19T01:07:10.000Z","updated_at":"2026-02-28T02:35:53.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/RightNow-AI/picolm","commit_stats":null,"previous_names":["rightnow-ai/picolm"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/RightNow-AI/picolm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RightNow-AI%2Fpicolm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RightNow-AI%2Fpicolm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RightNow-AI%2Fpicolm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RightNow-AI%2Fpicolm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RightNow-AI","download_url":"https://codeload.github.com/RightNow-AI/picolm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RightNow-AI%2Fpicolm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34536283,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-19T02:00:06.005Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arm","embedded","inference","llm","openclaw","picoclaw","quantization","raspberry-pi","risc-v"],"created_at":"2026-06-02T21:00:23.809Z","updated_at":"2026-06-19T15:00:37.460Z","avatar_url":"https://github.com/RightNow-AI.png","language":"C","funding_links":[],"categories":["Edge And Embedded Directions"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Language-C11-blue?style=flat-square\" alt=\"C11\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Binary_Size-~80KB-brightgreen?style=flat-square\" alt=\"Binary Size\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Runtime_RAM-45MB-orange?style=flat-square\" alt=\"RAM\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Dependencies-Zero-success?style=flat-square\" alt=\"Zero Dependencies\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/License-MIT-yellow?style=flat-square\" alt=\"MIT License\"\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003ePicoLM\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eRun a 1-billion parameter LLM on a $10 board with 256MB RAM.\u003c/strong\u003e\u003cbr\u003e\n  Pure C. Zero dependencies. One binary. No Python. No cloud.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ccode\u003eecho \"Explain gravity\" | ./picolm model.gguf -n 100 -j 4\u003c/code\u003e\n\u003c/p\u003e\n\n---\n\n## The Perfect Match: PicoLM + PicoClaw\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"picolm.jpg\" alt=\"PicoLM — Run a 1-billion parameter LLM on a $10 board\" width=\"640\"\u003e\n  \u003cbr\u003e\u003cbr\u003e\n\u003c/div\u003e\n\nPicoLM was built as the **local brain** for [PicoClaw](https://github.com/sipeed/picoclaw) — an ultra-lightweight AI assistant in Go that runs on $10 hardware. Together, they form a **fully offline AI agent** — no cloud, no API keys, no internet, no monthly bills.\n\n\u003e **Every other LLM provider needs the internet. PicoLM doesn't.**\n\n\u003ctable align=\"center\"\u003e\n  \u003ctr align=\"center\"\u003e\n    \u003ctd\u003e\u003cb\u003eThe Hardware\u003c/b\u003e\u003c/td\u003e\n    \u003ctd\u003e\u003cb\u003eThe Architecture\u003c/b\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/sipeed/picoclaw/main/assets/licheervnano.png\" alt=\"$9.90 LicheeRV Nano\" width=\"360\"\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/sipeed/picoclaw/main/assets/arch.jpg\" alt=\"PicoClaw architecture — PicoLM sits in the LLM box\" width=\"420\"\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd align=\"center\"\u003e\u003cem\u003e$9.90 — that's the entire server\u003c/em\u003e\u003c/td\u003e\n    \u003ctd align=\"center\"\u003e\u003cem\u003ePicoLM powers the LLM box in PicoClaw's agent loop\u003c/em\u003e\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n### Why they're a perfect fit\n\n| | Cloud Provider (OpenAI, etc.) | PicoLM (Local) |\n|---|---|---|\n| **Cost** | Pay per token, forever | Free forever |\n| **Privacy** | Your data sent to servers | Everything stays on-device |\n| **Internet** | Required for every request | Not needed at all |\n| **Latency** | Network round-trip + inference | Inference only |\n| **Hardware** | Needs a $599 Mac Mini | Runs on a $10 board |\n| **Binary** | N/A | ~80KB single file |\n| **RAM** | N/A | 45 MB total |\n\n### How it works\n\nPicoClaw's agent loop spawns PicoLM as a subprocess. Messages come in from Telegram, Discord, or CLI — PicoClaw formats them into a chat template, pipes the prompt to `picolm` via stdin, and reads the response from stdout. When tools are needed, `--json` grammar mode guarantees valid JSON even from a 1B model.\n\n```\nTelegram / Discord / CLI\n        │\n        ▼\n   ┌──────────┐    stdin: prompt     ┌───────────┐\n   │ PicoClaw │ ──────────────────►  │  picolm   │\n   │   (Go)   │ ◄──────────────────  │   (C)     │\n   └──────────┘    stdout: response  │ + model   │\n        │                            └───────────┘\n        ▼                            45 MB RAM\n   User gets reply                   No internet\n```\n\n### Quick setup\n\n```bash\n# 1. Build PicoLM\ncd picolm \u0026\u0026 make native    # or: make pi (Raspberry Pi)\n\n# 2. Download model (one-time, 638 MB)\nmake model\n\n# 3. Build PicoClaw\ncd ../picoclaw \u0026\u0026 make deps \u0026\u0026 make build\n\n# 4. Configure (~/.picoclaw/config.json)\n```\n\n```json\n{\n  \"agents\": {\n    \"defaults\": {\n      \"provider\": \"picolm\",\n      \"model\": \"picolm-local\"\n    }\n  },\n  \"providers\": {\n    \"picolm\": {\n      \"binary\": \"~/.picolm/bin/picolm\",\n      \"model\": \"~/.picolm/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf\",\n      \"max_tokens\": 256,\n      \"threads\": 4,\n      \"template\": \"chatml\"\n    }\n  }\n}\n```\n\n```bash\n# 5. Chat — fully offline!\npicoclaw agent -m \"What is photosynthesis?\"\n```\n\n### Or install everything in one line\n\n```bash\ncurl -sSL https://raw.githubusercontent.com/RightNow-AI/picolm/main/install.sh | bash\n```\n\n### Performance on real hardware\n\n| Device | Price | Generation Speed | RAM Used |\n|--------|-------|-----------------|----------|\n| **Pi 5** (4-core) | $60 | ~10 tok/s | 45 MB |\n| **Pi 4** (4-core) | $35 | ~8 tok/s | 45 MB |\n| **Pi 3B+** | $25 | ~4 tok/s | 45 MB |\n| **Pi Zero 2W** | $15 | ~2 tok/s | 45 MB |\n| **LicheeRV Nano** | $10 | ~1 tok/s | 45 MB |\n\n### JSON tool calling\n\nPicoClaw automatically activates `--json` grammar mode when it needs structured output. This **guarantees syntactically valid JSON** even from a 1B parameter model — essential for reliable tool calling on tiny hardware:\n\n```bash\npicoclaw agent -m \"Search for weather in Tokyo\"\n# → PicoLM generates: {\"tool_calls\": [{\"function\": {\"name\": \"web_search\", \"arguments\": \"{\\\"query\\\": \\\"weather Tokyo\\\"}\"}}]}\n```\n\n\u003e For the full PicoClaw documentation, see the [PicoClaw README](https://github.com/sipeed/picoclaw).\n\n---\n\n## What is PicoLM?\n\nPicoLM is a **minimal, from-scratch LLM inference engine** written in ~2,500 lines of C11. It runs [TinyLlama 1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) (and other LLaMA-architecture models in GGUF format) on hardware that most inference frameworks won't even consider:\n\n- **Raspberry Pi Zero 2W** ($15, 512MB RAM, ARM Cortex-A53)\n- **Sipeed LicheeRV** ($12, 512MB RAM, RISC-V)\n- **Raspberry Pi 3/4/5** (1-8GB RAM, ARM NEON SIMD)\n- Any Linux/Windows/macOS x86-64 machine\n\nThe model file (638MB) stays on disk. PicoLM **memory-maps** it and streams one layer at a time through RAM. Total runtime memory: **~45MB** including the FP16 KV cache.\n\n```\n                    ┌──────────────────────────────────────────┐\n   What goes        │         45 MB Runtime RAM                │\n   in RAM           │  ┌─────────┐ ┌──────────┐ ┌───────────┐  │\n                    │  │ Buffers │ │ FP16 KV  │ │ Tokenizer │  │\n                    │  │  1.2 MB │ │ Cache    │ │   4.5 MB  │  │\n                    │  │         │ │  ~40 MB  │ │           │  │\n                    │  └─────────┘ └──────────┘ └───────────┘  │\n                    └──────────────────────────────────────────┘\n\n                    ┌──────────────────────────────────────────┐\n   What stays       │        638 MB Model on Disk              │\n   on disk          │       (mmap — OS pages in layers         │\n   (via mmap)       │        as needed, ~1 at a time)          │\n                    └──────────────────────────────────────────┘\n```\n\n---\n\n## Features\n\n| Feature | Description |\n|---------|-------------|\n| **GGUF Native** | Reads GGUF v2/v3 files directly — no conversion needed |\n| **K-Quant Support** | Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0, Q4_0, F16, F32 |\n| **mmap Layer Streaming** | Model weights stay on disk; OS pages in one layer at a time |\n| **FP16 KV Cache** | Halves KV cache memory (44MB vs 88MB for 2048 context) |\n| **Flash Attention** | Online softmax — no O(seq_len) attention buffer needed |\n| **Pre-computed RoPE** | cos/sin lookup tables eliminate transcendentals from hot loop |\n| **SIMD Acceleration** | ARM NEON (Pi 3/4/5) and x86 SSE2 (Intel/AMD) auto-detected |\n| **Fused Dot Products** | Dequantize + dot-product in one pass — no intermediate buffer |\n| **Multi-threaded matmul** | Parallel matrix-vector multiply across CPU cores |\n| **Grammar-Constrained JSON** | `--json` flag forces valid JSON output (for tool calling) |\n| **KV Cache Persistence** | `--cache` saves/loads prompt state — skip prefill on re-runs |\n| **BPE Tokenizer** | Score-based byte-pair encoding, loaded from GGUF metadata |\n| **Top-p Sampling** | Temperature + nucleus sampling with configurable seed |\n| **Pipe-friendly** | Reads prompts from stdin: `echo \"Hello\" \\| ./picolm model.gguf` |\n| **Zero Dependencies** | Only libc, libm, libpthread. No external libraries. |\n| **Cross-platform** | Linux, Windows (MSVC), macOS. ARM, x86-64, RISC-V. |\n\n---\n\n## Quick Start\n\n### One-liner install (Raspberry Pi / Linux)\n\n```bash\ncurl -sSL https://raw.githubusercontent.com/RightNow-AI/picolm/main/install.sh | bash\n```\n\nThis will:\n1. Detect your platform (ARM64, ARMv7, x86-64)\n2. Install build dependencies (`gcc`, `make`, `curl`)\n3. Build PicoLM with optimal SIMD flags for your CPU\n4. Download TinyLlama 1.1B Q4_K_M (638 MB)\n5. Run a quick test\n6. Generate PicoClaw config\n7. Add `picolm` to your PATH\n\n### Build from source\n\n```bash\ngit clone https://github.com/rightnow-ai/picolm.git\ncd picolm/picolm\n\n# Auto-detect CPU (enables SSE2/AVX on x86, NEON on ARM)\nmake native\n\n# Download a model\nmake model\n\n# Run it\n./picolm /opt/picolm/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \\\n    -p \"The meaning of life is\" -n 100\n```\n\n### Build on Windows (MSVC)\n\n```cmd\ncd picolm\nbuild.bat\npicolm.exe model.gguf -p \"Hello world\" -n 50\n```\n\n### Platform-specific builds\n\n```bash\nmake native      # x86/ARM auto-detect (recommended for local machine)\nmake pi          # Raspberry Pi 3/4/5 (64-bit ARM + NEON SIMD)\nmake pi-arm32    # Pi Zero / Pi 1 (32-bit ARM)\nmake cross-pi    # Cross-compile for Pi from x86 (static binary)\nmake riscv       # RISC-V (Sipeed LicheeRV, etc.)\nmake static      # Static binary for single-file deployment\nmake debug       # Debug build with symbols, no optimization\n```\n\n---\n\n## Usage\n\n```\nPicoLM — ultra-lightweight LLM inference engine\n\nUsage: picolm \u003cmodel.gguf\u003e [options]\n\nGeneration options:\n  -p \u003cprompt\u003e    Input prompt (or pipe via stdin)\n  -n \u003cint\u003e       Max tokens to generate (default: 256)\n  -t \u003cfloat\u003e     Temperature (default: 0.8, 0=greedy)\n  -k \u003cfloat\u003e     Top-p / nucleus sampling (default: 0.9)\n  -s \u003cint\u003e       RNG seed (default: 42)\n  -c \u003cint\u003e       Context length override\n  -j \u003cint\u003e       Number of threads (default: 4)\n\nAdvanced options:\n  --json         Grammar-constrained JSON output mode\n  --cache \u003cfile\u003e KV cache file (saves/loads prompt state)\n```\n\n### Examples\n\n**Basic generation:**\n```bash\n./picolm model.gguf -p \"Once upon a time\" -n 200\n```\n\n**Greedy decoding (deterministic, temperature=0):**\n```bash\n./picolm model.gguf -p \"The capital of France is\" -n 20 -t 0\n# Output: Paris. It is the largest city in France and...\n```\n\n**Chat with TinyLlama (ChatML format):**\n```bash\n./picolm model.gguf -n 200 -t 0.7 -p \"\u003c|user|\u003e\nWhat is photosynthesis?\u003c/s\u003e\n\u003c|assistant|\u003e\n\"\n```\n\n**Force JSON output (for tool calling / structured data):**\n```bash\n./picolm model.gguf --json -t 0.3 -n 100 -p \"\u003c|user|\u003e\nReturn the current time as JSON.\u003c/s\u003e\n\u003c|assistant|\u003e\n\"\n# Output: {\"time\": \"12:00 PM\"}\n```\n\n**Pipe from stdin:**\n```bash\necho \"Explain quantum computing in one sentence\" | ./picolm model.gguf -n 50\n```\n\n**KV cache — skip repeated prefill:**\n```bash\n# First run: processes prompt + saves cache\n./picolm model.gguf --cache prompt.kvc -p \"Long system prompt here...\" -n 50\n\n# Second run: loads cache, skips prompt prefill (74% faster)\n./picolm model.gguf --cache prompt.kvc -p \"Long system prompt here...\" -n 50\n# Output: \"Skipping 25 cached prompt tokens\"\n```\n\n**Multi-threaded on a Pi 4 (4 cores):**\n```bash\n./picolm model.gguf -p \"Hello\" -n 100 -j 4\n```\n\n---\n\n## Performance\n\nMeasured on TinyLlama 1.1B Q4_K_M (638 MB model):\n\n| Metric | x86-64 (8 threads) | Pi 4 (4 cores, NEON) | Pi Zero 2W |\n|--------|--------------------|-----------------------|------------|\n| **Prefill** | ~11 tok/s | ~6 tok/s | ~1.5 tok/s |\n| **Generation** | ~13 tok/s | ~8 tok/s* | ~2 tok/s* |\n| **Runtime RAM** | 45 MB | 45 MB | 45 MB |\n| **First token** | ~2.3s | ~4s | ~16s |\n| **Binary size** | ~80 KB | ~70 KB | ~65 KB |\n\n*\\*Estimated with NEON SIMD enabled. Actual numbers depend on SD card speed and thermal throttling.*\n\n### What makes it fast\n\n```\n Raw C inference          ████████████░░░░░░░░  13.5 tok/s  (baseline: 1.6)\n + Fused dot products     ████████████████░░░░  (eliminate dequant buffer)\n + Multi-threaded matmul  █████████████████░░░  (4-8 cores in parallel)\n + FP16 KV cache          █████████████████░░░  (halve memory bandwidth)\n + Pre-computed RoPE      ██████████████████░░  (no sin/cos in hot loop)\n + Flash attention        ██████████████████░░  (no O(n) attention alloc)\n + NEON/SSE2 SIMD         ███████████████████░  (4-wide vector ops)\n + KV cache persistence   ████████████████████  (skip prefill entirely)\n```\n\n---\n\n## Architecture\n\n```\n                          ┌─────────────────────────────────┐\n                          │           picolm.c              │\n                          │     CLI + Generation Loop       │\n                          └──────┬──────────────┬───────────┘\n                                 │              │\n                    ┌────────────┘              └────────────┐\n                    │                                        │\n           ┌────────┴────────┐                    ┌──────────┴──────────┐\n           │    model.h/c    │                    │    sampler.h/c      │\n           │  GGUF Parser    │                    │  Temperature +      │\n           │  mmap Layer     │                    │  Top-p Sampling     │\n           │  Streaming      │                    └──────────┬──────────┘\n           │  Forward Pass   │                               │\n           │  KV Cache I/O   │                    ┌──────────┴──────────┐\n           └───┬────────┬────┘                    │    grammar.h/c      │\n               │        │                         │  JSON Constraint    │\n      ┌────────┘        └───────┐                 │  Logit Masking      │\n      │                         │                 └─────────────────────┘\n┌─────┴──────┐          ┌───────┴────────┐\n│ tensor.h/c │          │ tokenizer.h/c  │\n│ matmul     │          │ BPE Encode     │\n│ rmsnorm    │          │ Decode         │\n│ softmax    │          │ Vocab Lookup   │\n│ rope       │          └────────────────┘\n│ silu       │\n│ threading  │\n└─────┬──────┘\n      │\n┌─────┴──────┐\n│  quant.h/c │\n│ Q4_K, Q6_K │\n│ Q3_K, Q2_K │\n│ FP16, F32  │\n│ NEON + SSE │\n│ Fused Dots │\n└────────────┘\n```\n\n### The LLaMA Forward Pass (what happens for each token)\n\n```\nInput Token\n    │\n    ▼\n┌───────────────┐\n│ Embedding     │  Dequantize row from token_embd → x[2048]\n│ Lookup        │\n└───────┬───────┘\n        │\n        ▼\n┌───────────────┐  ×22 layers\n│ RMSNorm       │─────────────────────────────────────────┐\n│               │                                         │\n│ Q = xb @ Wq   │  Matrix-vector multiply (quantized)     │\n│ K = xb @ Wk   │  Store K,V in FP16 KV cache             │\n│ V = xb @ Wv   │                                         │\n│               │                                         │\n│ RoPE(Q, K)    │  Rotary position encoding (table lookup)│\n│               │                                         │\n│ Attention     │  Flash attention with online softmax    │\n│ (GQA 32→4)    │  Grouped-query: 32 Q heads, 4 KV heads  │\n│               │                                         │\n│ x += Out@Wo   │  Output projection + residual           │\n│               │                                         │\n│ RMSNorm       │                                         │\n│               │                                         │\n│ SwiGLU FFN    │  gate=SiLU(xb@Wg), up=xb@Wu             │\n│               │  x += (gate*up) @ Wd                    │\n└───────┬───────┘─────────────────────────────────────────┘\n        │\n        ▼\n┌───────────────┐\n│ Final RMSNorm │\n│ x @ W_output  │─→ logits[32000]\n└───────┬───────┘\n        │\n        ▼\n┌───────────────┐\n│ Grammar Mask  │  (if --json: force valid JSON structure)\n│ Sample Token  │  temperature → softmax → top-p → pick\n└───────────────┘\n```\n\n---\n\n## Memory Budget\n\nFor TinyLlama 1.1B Q4_K_M with 2048 context length:\n\n| Component | Size | Notes |\n|-----------|------|-------|\n| FP16 KV cache | ~40 MB | 22 layers x 2 x 2048 x 256 x 2 bytes |\n| Tokenizer | ~4.5 MB | 32K vocab strings + scores + sorted index |\n| Activation buffers | ~0.14 MB | x, xb, xb2, q, hb, hb2 |\n| Logits buffer | ~0.12 MB | 32000 x 4 bytes |\n| Dequant scratch | ~0.02 MB | Max(n_embd, n_ffn) floats |\n| Norm weights (pre-dequant) | ~0.35 MB | 45 norm vectors x 2048 x 4 bytes |\n| RoPE tables | ~0.03 MB | cos + sin x 2048 x 32 entries |\n| **Total runtime** | **~45 MB** | |\n| | | |\n| Model file (on disk) | 638 MB | Memory-mapped, ~1 layer in RAM at a time |\n\nWith 512 context (for constrained devices):\n\n| Component | Size |\n|-----------|------|\n| FP16 KV cache | ~10 MB |\n| Everything else | ~5 MB |\n| **Total** | **~15 MB** |\n\n---\n\n## Optimizations Deep-Dive\n\nPicoLM implements 9 optimizations that brought generation speed from **1.6 tok/s to 13.5 tok/s** on x86, with even larger gains expected on ARM with NEON:\n\n### 1. ARM NEON SIMD\n\n4-wide float vector operations for all hot paths. Example: dequantizing Q4_K nibbles with `vmovl_u8` → `vmovl_u16` → `vcvtq_f32_u32`, and RoPE with interleaved `vld2q_f32` / `vst2q_f32`.\n\n### 2. x86 SSE2 SIMD\n\nAuto-detected on Intel/AMD. 4-wide `__m128` operations for dot products, RMSNorm, and vector operations.\n\n### 3. FP16 KV Cache\n\nKey and value vectors stored as 16-bit floats instead of 32-bit. Halves KV cache memory from ~88MB to ~44MB. Conversion uses software `fp32_to_fp16()` / `fp16_to_fp32()` — no hardware FP16 support required.\n\n### 4. Pre-computed RoPE Tables\n\nSine and cosine values for all positions computed once at model load. The forward pass does a table lookup instead of calling `sinf()` / `cosf()` / `powf()` 64 times per token.\n\n### 5. Flash Attention (Online Softmax)\n\nSingle-pass attention with running maximum rescaling. Eliminates the `O(seq_len)` attention score buffer — critical for long contexts on memory-constrained devices.\n\n### 6. Fused Dequantize + Dot Product\n\n`vec_dot_q4_K_f32()` dequantizes and accumulates in one pass. No intermediate float buffer for the weight row. Reduces memory traffic by ~50% for matmul.\n\n### 7. Multi-threaded Matrix Multiply\n\n`matmul()` distributes output rows across threads using pthreads. Each thread processes its chunk independently with fused dot products. Scales linearly up to ~8 cores.\n\n### 8. Grammar-Constrained JSON\n\nThe `--json` mode pre-analyzes every token in the vocabulary at load time (brace delta, bracket delta, quote parity). During generation, it masks logits to guarantee syntactically valid JSON — essential for tool-calling with small models.\n\n### 9. KV Cache Persistence\n\n`--cache file.kvc` saves the FP16 KV cache state after prompt processing. On the next run with the same prompt, it loads the cache and skips prefill entirely. **74% latency reduction** for repeated system prompts.\n\n---\n\n## Supported Models\n\nPicoLM supports any LLaMA-architecture model in GGUF format:\n\n| Model | Parameters | GGUF Size (Q4_K_M) | RAM Needed |\n|-------|-----------|---------------------|------------|\n| **TinyLlama 1.1B** | 1.1B | 638 MB | ~45 MB |\n| **Llama 2 7B** | 7B | 4.1 GB | ~200 MB |\n| **Phi-2** | 2.7B | 1.6 GB | ~90 MB |\n\n\u003e **Recommended for embedded:** TinyLlama 1.1B Q4_K_M — fits comfortably on devices with 256MB+ RAM.\n\n### Supported quantization formats\n\n`Q2_K` `Q3_K` `Q4_K` `Q4_0` `Q5_K` `Q6_K` `Q8_0` `F16` `F32`\n\n---\n\n## File Structure\n\n```\nPicoLM/\n├── README.md              ← you are here\n├── BLOG.md                ← technical deep-dive blog post\n├── install.sh             ← one-liner Pi installer\n│\n├── picolm/                ← the inference engine (pure C)\n│   ├── picolm.c           ← CLI entry point, generation loop (273 lines)\n│   ├── model.h/c          ← GGUF parser, mmap, forward pass (146 + 833 lines)\n│   ├── tensor.h/c         ← matmul, rmsnorm, softmax, rope (44 + 298 lines)\n│   ├── quant.h/c          ← dequantization, SIMD kernels (140 + 534 lines)\n│   ├── tokenizer.h/c      ← BPE tokenizer (32 + ~200 lines)\n│   ├── sampler.h/c        ← temperature + top-p sampling (19 + ~100 lines)\n│   ├── grammar.h/c        ← JSON grammar constraints (64 + 175 lines)\n│   ├── Makefile           ← build targets for all platforms\n│   └── build.bat          ← Windows MSVC build script\n│\n└── tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf  ← model file (638 MB, not in git)\n```\n\n**Total C source: ~2,500 lines.** That's the entire inference engine — GGUF parsing, mmap, dequantization, matrix math, attention, tokenization, sampling, and grammar constraints.\n\n---\n\n## How It Works\n\n### The mmap trick\n\nTraditional inference engines load the entire model into RAM. PicoLM doesn't. Instead:\n\n1. The model file is **memory-mapped** (`mmap` on Linux/macOS, `MapViewOfFile` on Windows)\n2. Weight pointers point directly into the mapped file — no copying\n3. During the forward pass, each layer's weights are accessed sequentially\n4. The OS automatically pages in the needed weights and evicts old ones\n5. `madvise(MADV_SEQUENTIAL)` hints the access pattern to the kernel\n\n**Result:** A 638MB model runs on a device with 256MB RAM. Only ~30MB of the model is in physical memory at any time.\n\n### Quantization\n\nWeights are stored in 4-bit quantized format (Q4_K_M). For TinyLlama:\n- **Original:** 1.1B parameters x 4 bytes = 4.4 GB\n- **Q4_K:** 1.1B parameters x ~0.56 bytes = 638 MB\n- **Quality loss:** Minimal — Q4_K preserves 6-bit scales per 32-weight sub-block\n\n### Grouped-Query Attention (GQA)\n\nTinyLlama uses 32 query heads but only 4 key/value heads. Each KV head is shared by 8 query heads. This reduces KV cache size by 8x compared to full multi-head attention.\n\n---\n\n## Building \u0026 Testing\n\n### Prerequisites\n\n| Platform | Requirements |\n|----------|-------------|\n| **Linux/Pi** | `gcc`, `make` (install via `apt install build-essential`) |\n| **macOS** | Xcode Command Line Tools (`xcode-select --install`) |\n| **Windows** | Visual Studio Build Tools (cl.exe) |\n\n### Verify your build\n\n```bash\n# Build\nmake native\n\n# Test with greedy decoding (deterministic output)\n./picolm model.gguf -p \"The capital of France is\" -n 20 -t 0\n# Expected: \"Paris. It is the largest city in France...\"\n\n# Test JSON mode\n./picolm model.gguf --json -p \"Return JSON with name and age\" -n 50 -t 0.3\n# Expected: valid JSON like {\"name\": \"...\", \"age\": ...}\n\n# Test KV cache\n./picolm model.gguf --cache test.kvc -p \"Hello\" -n 10 -t 0\n./picolm model.gguf --cache test.kvc -p \"Hello\" -n 10 -t 0\n# Second run should say \"Skipping N cached prompt tokens\"\n```\n\n### Memory verification\n\nPicoLM prints memory stats to stderr:\n\n```\nMemory: 1.17 MB runtime state (FP16 KV cache separate)\n```\n\nTotal = runtime state + FP16 KV cache. For TinyLlama with 2048 context: ~45 MB.\n\n---\n\n## FAQ\n\n**Q: Can this run Llama 2 7B?**\nA: Yes, if you have enough RAM for the KV cache (~1.4 GB for 7B with 4096 context). The model file stays on disk via mmap. On a Pi 4 with 4GB RAM, it works but is slow (~1-2 tok/s).\n\n**Q: Why not use llama.cpp?**\nA: llama.cpp is excellent but requires ~200MB+ for the runtime on small models, has complex build dependencies, and targets desktop/server use cases. PicoLM is purpose-built for embedded: 45MB RAM, 80KB binary, zero dependencies.\n\n**Q: Is the output quality good?**\nA: TinyLlama 1.1B is a small model — it handles simple tasks (Q\u0026A, summarization, basic reasoning, JSON generation) well. It won't match GPT-4, but it runs on a $10 board with no internet. For structured output, the `--json` grammar mode guarantees valid JSON regardless of model quality.\n\n**Q: What about GPU acceleration?**\nA: PicoLM is CPU-only by design. The target hardware ($10-15 boards) doesn't have GPUs. On x86/ARM CPUs, SIMD (NEON/SSE2) provides meaningful speedup.\n\n**Q: Can I use a different model?**\nA: Any LLaMA-architecture GGUF model works. Download from [HuggingFace](https://huggingface.co/models?search=gguf) and point PicoLM at it. Recommended quantizations: Q4_K_M (best quality/size balance) or Q2_K (smallest, lower quality).\n\n---\n\n## Roadmap\n\n- [ ] AVX2/AVX-512 kernels for x86 (2-4x generation speed on modern CPUs)\n- [ ] Speculative decoding with a draft model\n- [ ] Context sliding window (infinite generation beyond max_seq_len)\n- [ ] Weight pruning for further memory reduction\n- [ ] Continuous batching for server mode\n- [ ] Mistral / Phi architecture support\n\n---\n\n## Technical Blog\n\nFor a detailed writeup of the optimization journey (with code snippets and war stories), see [**BLOG.md**](BLOG.md).\n\n---\n\n## License\n\nMIT License. See [LICENSE](LICENSE) for details.\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003ePicoLM\u003c/strong\u003e — because intelligence shouldn't require a data center.\n\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRightNow-AI%2Fpicolm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FRightNow-AI%2Fpicolm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRightNow-AI%2Fpicolm/lists"}