{"id":50854513,"url":"https://github.com/kaiser-data/jetson-headless-inference","last_synced_at":"2026-06-14T17:07:02.678Z","repository":{"id":359828226,"uuid":"1245144816","full_name":"kaiser-data/jetson-headless-inference","owner":"kaiser-data","description":"Jetson Orin 8GB headless AI API — fast LLM inference with model switching, power management, and desktop restore","archived":false,"fork":false,"pushed_at":"2026-05-23T16:06:24.000Z","size":22,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-23T18:07:44.664Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kaiser-data.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-21T00:38:25.000Z","updated_at":"2026-05-23T16:06:27.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/kaiser-data/jetson-headless-inference","commit_stats":null,"previous_names":["kaiser-data/jetson-headless-inference"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/kaiser-data/jetson-headless-inference","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaiser-data%2Fjetson-headless-inference","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaiser-data%2Fjetson-headless-inference/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaiser-data%2Fjetson-headless-inference/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaiser-data%2Fjetson-headless-inference/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kaiser-data","download_url":"https://codeload.github.com/kaiser-data/jetson-headless-inference/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kaiser-data%2Fjetson-headless-inference/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34329791,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-14T02:00:07.365Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-14T17:07:01.980Z","updated_at":"2026-06-14T17:07:02.668Z","avatar_url":"https://github.com/kaiser-data.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Jetson Orin 8GB — Headless AI API\n\nRun local LLMs as a **LAN API endpoint** on NVIDIA Jetson Orin 8GB.  \nSwitch between headless AI mode and normal Ubuntu desktop with one command.\n\n**Hardware:** Jetson Orin Nano 8GB · LPDDR5 68 GB/s · CUDA 12.6 · JetPack 6.x  \n**Backend:** [Ollama](https://ollama.com) · OpenAI-compatible REST API\n\n---\n\n## Table of Contents\n- [Architecture](#architecture)\n- [Memory Layout](#memory-layout)\n- [Mode Flow](#mode-flow)\n- [Quick Start](#quick-start)\n- [Model Guide](#model-guide)\n- [Optimization Stack](#optimization-stack)\n- [API Usage](#api-usage)\n- [Boot Menu](#boot-menu)\n- [Test Suite](#test-suite)\n- [Troubleshooting](#troubleshooting)\n\n---\n\n## Architecture\n\n```\n╔═══════════════════════════════════════════════════════════════════╗\n║                    YOUR HOME / LAB NETWORK                        ║\n║                                                                   ║\n║  ┌────────────────────────────────────────────────────────────┐  ║\n║  │              NVIDIA JETSON ORIN  8GB                       │  ║\n║  │                                                            │  ║\n║  │  ┌─────────────────────────────────────────────────────┐  │  ║\n║  │  │  Ampere GPU (1024 CUDA cores)                       │  │  ║\n║  │  │  ◄──────────────────────────────────────────────►   │  │  ║\n║  │  │  ARM Cortex-A78AE  6-core CPU   MAXN_SUPER mode     │  │  ║\n║  │  └───────────────────────┬─────────────────────────────┘  │  ║\n║  │                          │                                 │  ║\n║  │  ╔═══════════════════════▼═══════════════════════════╗    │  ║\n║  │  ║      Unified LPDDR5 RAM — 8 GB  @  68 GB/s        ║    │  ║\n║  │  ║  ┌────────┬──────────────────────┬────────────┐   ║    │  ║\n║  │  ║  │OS 0.5G │  Model weights 2–7GB │  KV cache  │   ║    │  ║\n║  │  ║  └────────┴──────────────────────┴────────────┘   ║    │  ║\n║  │  ╚═══════════════════════════════════════════════════╝    │  ║\n║  │                                                            │  ║\n║  │  ┌──────────────────────────────────────────────────┐     │  ║\n║  │  │  ollama serve  →  0.0.0.0:11434                  │     │  ║\n║  │  │  ├── /api/generate          ollama native         │     │  ║\n║  │  │  ├── /v1/chat/completions   OpenAI-compatible     │     │  ║\n║  │  │  └── /api/ps                model status          │     │  ║\n║  │  └──────────────────────────────────────────────────┘     │  ║\n║  │                     192.168.0.115:11434                    │  ║\n║  └────────────────────────────────────────────────────────────┘  ║\n║                              │                                    ║\n║          ┌───────────────────┼──────────────────┐                ║\n║          ▼                   ▼                  ▼                ║\n║   ┌─────────────┐    ┌─────────────┐    ┌────────────┐          ║\n║   │Raspberry Pi │    │   Laptop    │    │ Any Device │          ║\n║   │192.168.0.148│    │             │    │            │          ║\n║   │ curl/Python │    │ OpenAI SDK  │    │HTTP client │          ║\n║   └─────────────┘    └─────────────┘    └────────────┘          ║\n╚═══════════════════════════════════════════════════════════════════╝\n```\n\n---\n\n## Memory Layout\n\nThe Jetson uses **unified memory** — CPU and GPU share the same physical pool.  \nStopping the desktop frees 1.5 GB, allowing larger models to run fully on GPU.\n\n```\n  ┌─────────────────────────────────────────────────────────────┐\n  │                  8 GB Unified RAM                           │\n  ├─────────────────────────────────────────────────────────────┤\n  │                                                             │\n  │  NORMAL MODE  (desktop running)                             │\n  │  ┌──────┬──────────┬──────────────────────────────────┐    │\n  │  │  OS  │  GNOME   │       Model space                │    │\n  │  │ 0.5G │  1.5 GB  │         ~ 6.0 GB free            │    │\n  │  └──────┴──────────┴──────────────────────────────────┘    │\n  │                                                             │\n  │  HEADLESS MODE  (./jetson-ai.sh start)                      │\n  │  ┌──────┬──────────────────────────────────────────────┐    │\n  │  │  OS  │             Model space                      │    │\n  │  │ 0.5G │               ~ 7.1 GB free                  │    │\n  │  └──────┴──────────────────────────────────────────────┘    │\n  │          ▲ +1.5 GB gained by stopping desktop               │\n  └─────────────────────────────────────────────────────────────┘\n\n  ⚠  Silent CPU Fallback Trap:\n     Model too big for GPU  →  Ollama silently uses CPU\n     GPU inference:  13–35 tok/s  ✓\n     CPU inference:   0.3 tok/s  ✗  (100× slower, unusable)\n     The bench command detects and warns about this automatically.\n```\n\n---\n\n## Mode Flow\n\n```mermaid\nflowchart TD\n    A([🔐 SSH / TTY Login]) --\u003e B\n\n    subgraph MENU [\"Boot Menu — 10s timeout\"]\n        B{Choice?}\n        B --\u003e|1 - default| C[🖥️ Start Desktop]\n        B --\u003e|2 / 3 / 4| D[⚡ AI API Mode]\n        B --\u003e|5| Z[💻 Shell only]\n    end\n\n    subgraph START [\"./jetson-ai.sh start\"]\n        D --\u003e E[🔋 MAXN_SUPER power\\nnvpmodel -m 2\\njetson_clocks]\n        E --\u003e F[🖥️ Stop GNOME\\n+1.5 GB RAM freed]\n        F --\u003e G{🧠 GPU fit\\ncheck}\n        G --\u003e|✓ fits| H[Load model\\ninto GPU RAM]\n        G --\u003e|⚠ tight| H\n        G --\u003e|✗ too big| X[⚠️ CPU fallback\\n0.3 tok/s — use smaller model]\n    end\n\n    subgraph API [\"API Ready\"]\n        H --\u003e I[🌐 LAN endpoint\\n192.168.0.115:11434\\nOpenAI-compatible]\n    end\n\n    subgraph LIVE [\"While Running\"]\n        I --\u003e J{Command}\n        J --\u003e|switch| K[🔄 Hot-swap model\\n~20 seconds\\nauto-pull if missing]\n        K --\u003e I\n        J --\u003e|bench| L[📊 3-run benchmark\\ntok/s + GPU% check]\n        L --\u003e I\n    end\n\n    subgraph STOP [\"./jetson-ai.sh stop\"]\n        J --\u003e|stop| M[💾 Unload model\\nfrom RAM]\n        M --\u003e N[🔋 Restore power\\nmode 15W]\n        N --\u003e O[🖥️ Start GNOME\\nDesktop back]\n    end\n\n    style X fill:#ff6b6b,color:#fff\n    style H fill:#51cf66,color:#fff\n    style I fill:#339af0,color:#fff\n```\n\n---\n\n## Quick Start\n\n```bash\n# 1. First-time setup — run once (needs sudo password)\n./jetson-ai.sh setup\n\n# 2. Start headless AI API\n./jetson-ai.sh start               # default: qwen3.5:4b\n./jetson-ai.sh start reasoning     # phi4-mini (task alias)\n./jetson-ai.sh start quality       # llama3.1:8b (headless only)\n\n# 3. Swap model on the fly — no restart needed\n./jetson-ai.sh switch code         # → qwen3.5:4b\n./jetson-ai.sh switch fast         # → phi4-mini\n./jetson-ai.sh switch vision       # → gemma4:e2b\n\n# 4. Benchmark\n./jetson-ai.sh bench\n\n# 5. Restore Ubuntu desktop\n./jetson-ai.sh stop\n```\n\n---\n\n## Model Guide\n\n```\n  MODEL SELECTION — RAM vs 8 GB LIMIT\n  ════════════════════════════════════════════════════════\n                                              headless\n                                    desktop   only\n  qwen3.5:0.8b  ██░░░░░░░░░░░░░░  1.0GB  ~35 tok/s  ✓\n  qwen2.5:3b    █████░░░░░░░░░░░  1.9GB  ~22 tok/s  ✓\n  llama3.2:3b   █████░░░░░░░░░░░  2.0GB  ~20 tok/s  ✓\n  phi4-mini   ★ ██████░░░░░░░░░░  2.5GB  ~18 tok/s  ✓\n  gemma3        █████████░░░░░░░  3.3GB  ~12 tok/s  ✓\n  qwen3.5:4b  ★ █████████░░░░░░░  3.4GB  ~13 tok/s  ✓\n  llama3.1:8b   █████████████░░░  4.9GB  ~ 8 tok/s  ○\n  gemma4:e2b    ████████████████  7.2GB  ~ 5 tok/s  ○\n  gemma4:e4b    ██████████████████████  9.6GB  ✗ too large\n                ├────────┬───────┼───────────────────┤\n                0       2GB    4GB                  8GB\n                         ▲ desktop  ▲ headless limit\n                         6GB free   7.1GB free\n\n  ✓ fits always   ○ headless only   ✗ avoid (CPU fallback)\n  ★ recommended\n```\n\n### Task Aliases\n\n| Alias | Model | Why |\n|---|---|---|\n| `default` | qwen3.5:4b | Best quality/speed balance |\n| `fast` | phi4-mini | Lowest latency |\n| `reasoning` | phi4-mini | Math, logic, step-by-step |\n| `code` | qwen3.5:4b | Coding \u0026 debugging |\n| `vision` | gemma4:e2b | Image understanding |\n| `german` | cas/discolm-german | German language |\n| `tiny` | qwen2.5:3b | Minimal RAM, fast |\n| `quality` | llama3.1:8b | Best output (headless only) |\n\n---\n\n## Optimization Stack\n\n```\n  ╔══════════════════════════════════════════════════════════════╗\n  ║               SPEED OPTIMIZATION STACK                       ║\n  ╠══════════════════════════════════════════════════════════════╣\n  ║                                                              ║\n  ║  BEFORE  (defaults)                                          ║\n  ║  ┌──────────────────────────────────────────────────────┐   ║\n  ║  │ 15W mode · GNOME running · 5-min evict · no flash    │   ║\n  ║  │ ~6 GB free · ~4–6 tok/s · CPU fallback risk: HIGH   │   ║\n  ║  └──────────────────────────────────────────────────────┘   ║\n  ║                           ▼                                  ║\n  ║  ➊  MAXN_SUPER power      nvpmodel -m 2 + jetson_clocks      ║\n  ║     └─ 2× faster GPU/CPU clocks, uncapped power budget       ║\n  ║                           ▼                                  ║\n  ║  ➋  Stop GNOME desktop    systemctl stop gdm3                ║\n  ║     └─ +1.5 GB RAM freed for model weights                   ║\n  ║                           ▼                                  ║\n  ║  ➌  Flash Attention       OLLAMA_FLASH_ATTENTION=1           ║\n  ║     └─ −30 to 50% KV cache memory (CUDA optimized)          ║\n  ║                           ▼                                  ║\n  ║  ➍  KV cache quant        OLLAMA_KV_CACHE_TYPE=q8_0          ║\n  ║     └─ halves KV cache RAM, negligible quality loss          ║\n  ║                           ▼                                  ║\n  ║  ➎  Model pinned          OLLAMA_KEEP_ALIVE=-1               ║\n  ║     └─ 0 s reload delay between requests                     ║\n  ║                           ▼                                  ║\n  ║  ➏  Systemd drop-in       /etc/systemd/system/ollama.d/      ║\n  ║     └─ settings persist across reboots \u0026 service restarts    ║\n  ║                           ▼                                  ║\n  ║  AFTER   (./jetson-ai.sh start)                              ║\n  ║  ┌──────────────────────────────────────────────────────┐   ║\n  ║  │ MAXN mode · headless · model pinned · flash attn on  │   ║\n  ║  │ ~7 GB free · 12–35 tok/s · CPU fallback risk: LOW   │   ║\n  ║  └──────────────────────────────────────────────────────┘   ║\n  ╚══════════════════════════════════════════════════════════════╝\n```\n\n---\n\n## API Usage\n\nThe API is fully OpenAI-compatible — works as a drop-in for existing apps.\n\n### Python — OpenAI SDK\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(\n    base_url=\"http://192.168.0.115:11434/v1\",\n    api_key=\"ollama\"          # any string, not validated\n)\nresponse = client.chat.completions.create(\n    model=\"qwen3.5:4b\",\n    messages=[{\"role\": \"user\", \"content\": \"Explain edge AI in 2 sentences.\"}]\n)\nprint(response.choices[0].message.content)\n```\n\n### Python — requests (streaming)\n```python\nimport requests, json\n\nwith requests.post(\"http://192.168.0.115:11434/api/generate\",\n    json={\"model\": \"qwen3.5:4b\", \"prompt\": \"Count to 5\", \"stream\": True},\n    stream=True) as r:\n    for line in r.iter_lines():\n        if line:\n            print(json.loads(line).get(\"response\", \"\"), end=\"\", flush=True)\n```\n\n### curl\n```bash\ncurl http://192.168.0.115:11434/api/generate \\\n  -d '{\"model\":\"qwen3.5:4b\",\"prompt\":\"Hello!\",\"stream\":false}'\n```\n\n### From Raspberry Pi (192.168.0.148)\n```bash\ncurl http://192.168.0.115:11434/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\":\"qwen3.5:4b\",\"messages\":[{\"role\":\"user\",\"content\":\"Hi!\"}]}'\n```\n\n---\n\n## Boot Menu\n\nAdd to `~/.bashrc` to get a mode selector on every login:\n\n```bash\necho 'source ~/gamma4_models/boot-choice.sh' \u003e\u003e ~/.bashrc\n```\n\n```\n  ╔══════════════════════════════════════════════╗\n  ║         JETSON ORIN — BOOT MODE              ║\n  ╠══════════════════════════════════════════════╣\n  ║  [1] Ubuntu Desktop              ← last      ║\n  ║  [2] AI API  — qwen3.5:4b                    ║\n  ║  [3] AI API  — phi4-mini (fast)              ║\n  ║  [4] AI API  — choose model                  ║\n  ║  [5] Shell only (no desktop/AI)              ║\n  ╚══════════════════════════════════════════════╝\n\n  Auto-starting [1] in 10s ... (press 1-5 to change)\n```\n\n- Remembers your last choice\n- Skip for one session: `JETSON_AI_SKIP_MENU=1 bash`\n- Skipped automatically inside desktop sessions\n\n---\n\n## Test Suite\n\nAuto-detects all installed models, runs 2-prompt benchmark each, checks GPU placement:\n\n```bash\n./test-models.sh            # test all installed models\n./test-models.sh phi4-mini  # test one specific model\n```\n\n```\n  Jetson AI — Model Test Suite\n  Date  : 2026-05-21 12:00\n  Power : MAXN_SUPER\n  RAM   : 7.0G free\n  Models: 5 to test\n  ────────────────────────────────────────────────────────────\n  Model                               Result\n  ────────────────────────────────────────────────────────────\n  qwen2.5:3b                          ✓ PASS  22.1 tok/s  GPU:94%  1.9GB\n  phi4-mini:latest                    ✓ PASS  18.3 tok/s  GPU:91%  2.5GB\n  qwen3.5:4b                          ✓ PASS  13.1 tok/s  GPU:88%  3.4GB\n  gemma4:e2b                          ⚠ WARN  slow (4.8 tok/s, GPU:42%)\n  gemma4:e4b                          ✗ FAIL  CPU fallback (0.3 tok/s, GPU:0%)\n  ────────────────────────────────────────────────────────────\n  ✓ 3 passed    ⚠ 1 warning    ✗ 1 failed\n```\n\n---\n\n## Troubleshooting\n\n| Problem | Symptom | Fix |\n|---|---|---|\n| `sudo: password required` | Scripts pause/fail | Run `./jetson-ai.sh setup` first |\n| CPU fallback | `bench` shows \u003c 2 tok/s | Run `stop` then `start` (stops desktop) |\n| Model load timeout | `start` hangs \u003e90s | Try smaller model: `phi4-mini` |\n| Desktop doesn't restore | Black screen after `stop` | `sudo systemctl start gdm3` |\n| API not reachable from LAN | Connection refused on other device | Check: `systemctl show ollama \\| grep OLLAMA_HOST` |\n| Port already in use | Error on start | `sudo systemctl restart ollama` |\n| No GPU detected | `nvpmodel` not found | JetPack not fully installed |\n\n---\n\n## Comparison vs Alternatives\n\n| | **This setup** | **NanoLLM** | **llama.cpp** |\n|---|---|---|---|\n| Speed | Good (MAXN + flash attn) | Best (TensorRT-LLM) | ~10% faster than ollama |\n| OpenAI API compat | ✓ native | ✗ | needs wrapper |\n| Model switching | 1 command, ~20s | manual | manual |\n| Desktop restore | automatic | manual | manual |\n| Vision / multimodal | ✓ gemma4 | ✓ | partial |\n| Install effort | Low (done) | High (Docker + CUDA builds) | Medium |\n| LAN API server | ✓ built-in | ✗ | needs extra server |\n| Persists across reboots | ✓ systemd | manual | manual |\n\n---\n\n## Files\n\n| File | Purpose |\n|---|---|\n| `jetson-ai.sh` | Main controller — all commands |\n| `boot-choice.sh` | Login menu (desktop ↔ AI API) |\n| `test-models.sh` | Automated model test suite |\n\nState and logs saved to `~/.local/share/jetson-ai/`\n\n---\n\n**Hardware tested:** NVIDIA Jetson Orin Nano 8GB · JetPack 6.4.7 (R36) · CUDA 12.6 · Ollama 0.21+\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkaiser-data%2Fjetson-headless-inference","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkaiser-data%2Fjetson-headless-inference","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkaiser-data%2Fjetson-headless-inference/lists"}