{"id":50103490,"url":"https://github.com/Andyyyy64/whichllm","last_synced_at":"2026-06-09T00:01:30.350Z","repository":{"id":342059334,"uuid":"1172574522","full_name":"Andyyyy64/whichllm","owner":"Andyyyy64","description":"Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly.","archived":false,"fork":false,"pushed_at":"2026-06-03T13:20:47.000Z","size":3159,"stargazers_count":2614,"open_issues_count":19,"forks_count":141,"subscribers_count":8,"default_branch":"main","last_synced_at":"2026-06-03T13:22:21.634Z","etag":null,"topics":["ai","apple-silicon","benchmarks","cli","command-line-tool","gguf","gpu","huggingface","inference","llm","local-llm","ollama","python","vram"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Andyyyy64.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":["Andyyyy64"]}},"created_at":"2026-03-04T13:16:00.000Z","updated_at":"2026-06-03T12:59:42.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Andyyyy64/whichllm","commit_stats":null,"previous_names":["andyyyy64/local-llm-checker","andyyyy64/whatllm"],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/Andyyyy64/whichllm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Andyyyy64%2Fwhichllm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Andyyyy64%2Fwhichllm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Andyyyy64%2Fwhichllm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Andyyyy64%2Fwhichllm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Andyyyy64","download_url":"https://codeload.github.com/Andyyyy64/whichllm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Andyyyy64%2Fwhichllm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34085321,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-08T02:00:07.615Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","apple-silicon","benchmarks","cli","command-line-tool","gguf","gpu","huggingface","inference","llm","local-llm","ollama","python","vram"],"created_at":"2026-05-23T09:00:34.683Z","updated_at":"2026-06-09T00:01:30.327Z","avatar_url":"https://github.com/Andyyyy64.png","language":"Python","funding_links":["https://github.com/sponsors/Andyyyy64"],"categories":["Python"],"sub_categories":[],"readme":"# whichllm\n\n[![PyPI version](https://img.shields.io/pypi/v/whichllm)](https://pypi.org/project/whichllm/)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Tests](https://github.com/Andyyyy64/whichllm/actions/workflows/test.yml/badge.svg)](https://github.com/Andyyyy64/whichllm/actions/workflows/test.yml)\n[![Sponsor](https://img.shields.io/badge/Sponsor-GitHub%20Sponsors-EA4AAA?logo=githubsponsors)](https://github.com/sponsors/Andyyyy64)\n\n**Find the best local LLM that actually runs on your hardware.**\n\nAuto-detects your GPU/CPU/RAM and ranks the top models from HuggingFace that fit your system.\n\n[日本語版はこちら](docs/README.ja.md)\n\n## Quick start\n\nRun the recommendation command once, with no project setup.\n\n```bash\nuvx whichllm@latest\n```\n\nSimulate a GPU before you buy hardware.\n\n```bash\nuvx whichllm@latest --gpu \"RTX 4090\"\n```\n\nInstall it when you use it often.\n\n```bash\nuv tool install whichllm\nuv tool upgrade whichllm  # update an existing install\n```\n\nOther install paths.\n\n```bash\nbrew install andyyyy64/whichllm/whichllm\npip install whichllm\n```\n\n## Common workflows\n\nAfter install, run `whichllm` directly. For one-off runs, replace `whichllm`\nwith `uvx whichllm@latest`.\n\n```bash\n# Best models for this machine\nwhichllm\n\n# Pretend you have a specific GPU\nwhichllm --gpu \"RTX 4090\"\n\n# Compare upgrade candidates\nwhichllm upgrade \"RTX 4090\" \"RTX 5090\" \"H100\"\n\n# Find the GPU needed for a model\nwhichllm plan \"llama 3 70b\"\n\n# Start a chat with a model\nwhichllm run \"qwen 2.5 1.5b gguf\"\n\n# Print copy-paste Python\nwhichllm snippet \"qwen 7b\"\n\n# Return JSON for scripts\nwhichllm --top 1 --json\n```\n\n![demo](assets/demo.gif)\n\n## See it\n\n```text\n$ whichllm --gpu \"RTX 4090\"\n\n#1  Qwen/Qwen3.6-27B     27.8B  Q5_K_M   score 92.8    27 t/s\n#2  Qwen/Qwen3-32B       32.0B  Q4_K_M   score 83.0    31 t/s\n#3  Qwen/Qwen3-30B-A3B   30.0B  Q5_K_M   score 82.7   102 t/s\n```\n\nThe 32B model **fits your card fine** — whichllm still ranks the 27B #1,\nbecause it scores higher on real benchmarks and is a newer generation.\nA size-only \"what fits?\" tool would hand you the bigger one. That gap is\nthe whole point of whichllm. (Note #3: a MoE model at 102 t/s — speed is\nranked on *active* params, quality on *total*.)\n\n## What can I run?\n\nReal top picks (snapshot 2026-05 — your results track **live** HuggingFace\ndata, this is not a static list):\n\n| Hardware | VRAM | Top pick | Speed |\n|---|---|---|---|\n| RTX 5090 | 32 GB | `Qwen3.6-27B` · Q6_K · score 94.7 | ~40 t/s |\n| RTX 4090 / 3090 | 24 GB | `Qwen3.6-27B` · Q5_K_M · score 92.8 | ~27 t/s |\n| RTX 4060 | 8 GB | `Qwen3-14B` · Q3_K_M · score 71.0 | ~22 t/s |\n| Apple M3 Max | 36 GB | `Qwen3.6-27B` · Q5_K_M · score 89.4 | ~9 t/s |\n| CPU only | — | `gpt-oss-20b` (MoE) · Q4_K_M · score 45.2 | ~6 t/s |\n\n`whichllm --gpu \"\u003cyour card\u003e\"` simulates any of these before you buy.\n\n## Why whichllm?\n\nFitting a model into your VRAM is the easy part. The hard part is knowing\n**which of the models that fit is actually the best** — and that is what\nwhichllm is built to get right.\n\n- **Evidence-based ranking, not a size heuristic** — The top pick is\n  chosen from merged real benchmarks (LiveBench, Artificial Analysis,\n  Aider, multimodal/vision, Chatbot Arena ELO, Open LLM Leaderboard) —\n  never \"the biggest model that happens to fit.\"\n- **Recency-aware** — Stale leaderboards are demoted along each model's\n  lineage, so a 2024 model can't outrank a current-generation one on an\n  outdated score. The benchmark snapshot date is printed under every\n  ranking, so a stale recommendation is self-evident instead of silently\n  trusted.\n- **Evidence-graded and guarded** — Every score is tagged\n  `direct` / `variant` / `base` / `interpolated` / `self-reported` and\n  discounted by confidence. Fabricated uploader claims and cross-family\n  inheritance (a small fork borrowing its much larger base's score) are\n  actively rejected.\n- **Architecture-aware estimates** — VRAM = weights + GQA KV cache +\n  activation + overhead; speed is bandwidth-bound with per-quant\n  efficiency, per-backend factors, MoE active-vs-total split, and\n  unified-memory vs discrete-PCIe partial-offload modeling.\n- **One command, scriptable** — `whichllm` prints the answer; add\n  `--json | jq` for pipelines. No TUI, no keybindings to memorize.\n- **Live data** — Models fetched directly from the HuggingFace API, with\n  curated frozen fallbacks for offline or rate-limited use.\n\n## Features\n\n- **Auto-detect hardware** — NVIDIA, AMD, Apple Silicon, CPU-only\n- **Smart ranking** — Scores models by VRAM fit, speed, and benchmark quality\n- **One-command chat** — `whichllm run` downloads and starts a chat session instantly\n- **Code snippets** — `whichllm snippet` prints ready-to-run Python for any model\n- **Live data** — Fetches models directly from HuggingFace (cached for performance)\n- **Benchmark-aware** — Integrates real eval scores with confidence-based dampening\n- **Task profiles** — Filter by general, coding, vision, or math use cases\n- **GPU simulation** — Test with any GPU: `whichllm --gpu \"RTX 4090\"`\n- **Hardware planning** — Reverse lookup: `whichllm plan \"llama 3 70b\"`\n- **Upgrade planning** — Compare your current machine with candidate GPUs\n- **JSON output** — Pipe-friendly: `whichllm --json`\n\n## Run \u0026 Snippet\n\nTry any model with a single command. No manual installs needed — whichllm\ncreates an isolated environment via `uv`, installs dependencies, downloads the\nmodel, and starts an interactive chat.\n\n![run demo](assets/demo-run.gif)\n\n```bash\n# Chat with a model (auto-picks the best GGUF variant)\nwhichllm run \"qwen 2.5 1.5b gguf\"\n\n# Auto-pick the best model for your hardware and chat\nwhichllm run\n\n# CPU-only mode\nwhichllm run \"phi 3 mini gguf\" --cpu-only\n```\n\nWorks with **all model formats**:\n- **GGUF** — via `llama-cpp-python` (lightweight, fast)\n- **AWQ / GPTQ** — via `transformers` + `autoawq` / `auto-gptq`\n- **FP16 / BF16** — via `transformers`\n\nGet a **copy-paste Python snippet** instead:\n\n```bash\nwhichllm snippet \"qwen 7b\"\n```\n\n```python\nfrom llama_cpp import Llama\n\nllm = Llama.from_pretrained(\n    repo_id=\"Qwen/Qwen2.5-7B-Instruct-GGUF\",\n    filename=\"qwen2.5-7b-instruct-q4_k_m.gguf\",\n    n_ctx=4096,\n    n_gpu_layers=-1,\n    verbose=False,\n)\n\noutput = llm.create_chat_completion(\n    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n)\nprint(output[\"choices\"][0][\"message\"][\"content\"])\n```\n\n## Usage\n\n```bash\n# Auto-detect hardware and show best models\nwhichllm\n\n# Simulate a GPU (e.g. planning a purchase)\nwhichllm --gpu \"RTX 4090\"\nwhichllm --gpu \"RTX 5090\"\n# Specify variant\nwhichllm --gpu \"RTX 5060 16\"\n\n# CPU-only mode\nwhichllm --cpu-only\n\n# More results / filters\nwhichllm --top 20\nwhichllm --quant Q4_K_M\nwhichllm --min-speed 30\nwhichllm --evidence base   # allow id/base-model matches\nwhichllm --evidence strict # id-exact only (same as --direct)\nwhichllm --direct\n\n# JSON output\nwhichllm --json\n\n# Force refresh (ignore cache)\nwhichllm --refresh\n\n# Show hardware info only\nwhichllm hardware\n\n# Plan: what GPU do I need for a specific model?\nwhichllm plan \"llama 3 70b\"\nwhichllm plan \"Qwen2.5-72B\" --quant Q8_0\nwhichllm plan \"mistral 7b\" --context-length 32768\n\n# Upgrade: compare your current machine against candidate GPUs\nwhichllm upgrade \"RTX 4090\" \"RTX 5090\" \"H100\"\nwhichllm upgrade \"Apple M4 Max\" --top 5\n\n# Run: download and chat with a model instantly\nwhichllm run \"qwen 2.5 1.5b gguf\"\nwhichllm run                       # auto-pick best for your hardware\n\n# Snippet: print ready-to-run Python code\nwhichllm snippet \"qwen 7b\"\nwhichllm snippet \"llama 3 8b gguf\" --quant Q5_K_M\n```\n\nJSON model rows include `estimated_tok_per_sec`, `speed_confidence`,\n`speed_range_tok_per_sec`, and `speed_notes`. The speed range is a planning\nrange, not a live benchmark.\n\n## Integrations\n\n### Ollama\n\nUse JSON output to feed scripts that map HuggingFace IDs to your local Ollama\nmodel names:\n\n```bash\n# Pick the top HuggingFace model ID\nwhichllm --top 1 --json | jq -r '.models[0].model_id'\n\n# Find the best coding model ID\nwhichllm --profile coding --top 1 --json | jq -r '.models[0].model_id'\n```\n\nOllama model names do not always match HuggingFace repo IDs, so a small mapping\nstep is usually needed before `ollama run`.\n\n### Shell alias\n\nAdd to your `.bashrc` / `.zshrc`:\n\n```bash\nalias bestllm='whichllm --top 1 --json | jq -r \".models[0].model_id\"'\n# Usage: ollama run $(bestllm)\n```\n\n## Scoring\n\nEach model gets a 0-100 score. Benchmark quality and size form the core;\nevidence confidence and runtime fit then scale it, with speed, source\ntrust, and popularity as adjustments.\n\n| Factor | Effect | Description |\n|--------|--------|-------------|\n| Benchmark quality | core | Merged LiveBench / Artificial Analysis / Aider / Vision / Arena ELO / Open LLM Leaderboard, weighted by source confidence |\n| Model size | up to 35 | `log2`-scaled world-knowledge proxy (MoE uses total params) |\n| Quantization | × penalty | Lower-bit quants discounted multiplicatively |\n| Evidence confidence | ×0.55–1.0 | none / self-reported ×0.55, inherited ×0.78, direct full |\n| Runtime fit | ×0.50–1.0 | partial-offload ×0.72, CPU-only ×0.50 |\n| Speed | -8 to +8 | Usability gate vs a fit-dependent tok/s floor; reported with confidence and range metadata |\n| Source trust | -5 to +5 | Official-org bonus, known-repackager penalty |\n| Popularity | tie-breaker | Downloads/likes; weight shrinks as evidence strengthens |\n\nScore markers:\n- **`~`** (yellow) — No direct benchmark; score inherited/interpolated from the model family\n- **`!sr`** (bright yellow) — Uploader-reported benchmark only, not independently verified\n- **`?`** (red) — No benchmark data available\n\nSpeed markers in `--status`:\n- **`~`** (yellow) — Estimated tok/s range is available\n- **`?`** (red) — Low-confidence speed estimate; backend/runtime sensitivity is high\n\n## Documentation\n\n- [CLI reference](docs/cli.md)\n- [How it works](docs/how-it-works.md)\n- [Scoring](docs/scoring.md)\n- [Hardware detection and simulation](docs/hardware.md)\n- [Run and snippet](docs/run-snippet.md)\n- [Troubleshooting](docs/troubleshooting.md)\n\n## How it works\n\n### Data pipeline\n\n1. **Model fetching** — Fetches popular models from HuggingFace API:\n   - Text-generation (downloads + recently updated)\n   - GGUF-filtered (separate query for coverage)\n   - Vision models (`image-text-to-text`) when `--profile vision` or `any`\n2. **Benchmark sources** — *Current tier* (LiveBench, Artificial Analysis\n   Index, Aider) merged live when reachable, plus a curated multimodal /\n   vision index; *frozen tier* (Open LLM Leaderboard v2, Chatbot Arena\n   ELO). Tiers have separate caps and lineage-aware recency demotion so\n   stale leaderboards stop over-rewarding older generations.\n3. **Benchmark evidence** — Five resolution levels, increasingly discounted:\n   - `direct` — Exact model ID match\n   - `variant` — Suffix-stripped or -Instruct variant\n   - `base_model` — Base model from cardData\n   - `line_interp` — Size-aware interpolation within model family\n   - `self_reported` — Uploader-claimed eval (heavily discounted)\n\n   Inheritance is rejected when a model's params diverge more than 2× from\n   its family's dominant member, catching draft / MTP / abliterated forks\n   that share a `family_id` with a much larger base.\n4. **Cache** — `~/.cache/whichllm/`:\n   - `models.json` — 6h TTL\n   - `benchmark.json` — 24h TTL\n\n### Ranking engine\n\n1. **Hardware detection** — NVIDIA (nvidia-ml-py), AMD (dbgpu/ROCm), Apple Silicon (Metal), CPU cores, RAM, disk\n2. **VRAM estimation** — Weights + KV cache + activation + framework overhead (~500MB)\n3. **Compatibility** — Full GPU / Partial Offload / CPU-only; compute capability and OS checks\n4. **Speed** — tok/s from GPU memory bandwidth, quantization, backend, fit type, and MoE active parameters\n5. **Scoring** — Benchmark (with confidence dampening), size, quantization penalty, fit type, speed, popularity, source trust (official vs repackager)\n6. **Backend filter** — Apple Silicon and CPU-only restrict to GGUF for stability; Linux+NVIDIA allows AWQ/GPTQ\n\n### Project structure\n\n```\nsrc/whichllm/\n├── cli.py              # Typer CLI: main, plan, run, snippet, hardware\n├── constants.py        # GPU bandwidth, quantization bytes, compute capability\n├── hardware/\n│   ├── detector.py     # Orchestrates GPU/CPU/RAM detection\n│   ├── nvidia.py       # NVIDIA GPU via nvidia-ml-py\n│   ├── amd.py          # AMD GPU (Linux)\n│   ├── apple.py        # Apple Silicon (Metal)\n│   ├── cpu.py          # CPU name, cores, AVX support\n│   ├── memory.py       # RAM and disk free\n│   ├── gpu_simulator.py # --gpu flag: synthetic GPU from name\n│   └── types.py        # GPUInfo, HardwareInfo\n├── models/\n│   ├── fetcher.py      # HuggingFace API, model parsing, evalResults\n│   ├── benchmark.py    # Arena ELO, Leaderboard (parquet/rows API)\n│   ├── grouper.py      # Family grouping by base_model and name\n│   ├── cache.py        # JSON cache with TTL\n│   └── types.py        # ModelInfo, GGUFVariant, ModelFamily\n├── engine/\n│   ├── vram.py         # VRAM = weights + KV cache + activation + overhead\n│   ├── compatibility.py# Fit type, disk check, compute/OS warnings\n│   ├── performance.py  # tok/s from bandwidth\n│   ├── quantization.py # Bytes per weight, quality penalty, non-GGUF inference\n│   ├── ranker.py       # Scoring, evidence filter, profile/match\n│   └── types.py        # CompatibilityResult\n└── output/\n    └── display.py      # Rich table, JSON output, hardware/plan displays\n```\n\n## Development\n\n```bash\ngit clone https://github.com/Andyyyy64/whichllm.git\ncd whichllm\nuv sync --dev\nuv run whichllm\nuv run pytest\n```\n\n## Contributing\n\nContributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n## Support\n\nIf whichllm helped you find a model or avoid a bad hardware guess,\nsponsoring is appreciated. It helps keep the project maintained: hardware\nreports, packaging, test fixtures, benchmark updates, and support for more\nmachines.\n\nwhichllm will stay open-source either way. Issues and PRs are always welcome.\n\nUseful? A GitHub star helps other people find it, and I'd genuinely like to\nknow what it picked for your rig. Drop it in [Issues](https://github.com/Andyyyy64/whichllm/issues).\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=Andyyyy64/whichllm\u0026type=Date)](https://www.star-history.com/#Andyyyy64/whichllm\u0026Date)\n\n## Requirements\n\n- Python 3.11+\n- NVIDIA GPU detection via `nvidia-ml-py` (included by default)\n- AMD / Apple Silicon detected automatically\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAndyyyy64%2Fwhichllm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAndyyyy64%2Fwhichllm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAndyyyy64%2Fwhichllm/lists"}