{"id":44384050,"url":"https://github.com/kmesiab/qwenvert","last_synced_at":"2026-02-21T15:20:12.214Z","repository":{"id":337469548,"uuid":"1153774982","full_name":"kmesiab/qwenvert","owner":"kmesiab","description":"One click to configure Claude Code to work with local Qwen models","archived":false,"fork":false,"pushed_at":"2026-02-15T01:08:33.000Z","size":499,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-15T01:46:49.620Z","etag":null,"topics":["claude","claude-code","coding-agents","copilot","hugging-face","llm","llm-inference"],"latest_commit_sha":null,"homepage":"https://kmesiab.github.io/qwenvert/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kmesiab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-02-09T16:53:02.000Z","updated_at":"2026-02-15T00:19:59.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/kmesiab/qwenvert","commit_stats":null,"previous_names":["kmesiab/qwenvert"],"tags_count":13,"template":false,"template_full_name":null,"purl":"pkg:github/kmesiab/qwenvert","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kmesiab%2Fqwenvert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kmesiab%2Fqwenvert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kmesiab%2Fqwenvert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kmesiab%2Fqwenvert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kmesiab","download_url":"https://codeload.github.com/kmesiab/qwenvert/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kmesiab%2Fqwenvert/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29684395,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-21T14:31:22.911Z","status":"ssl_error","status_checked_at":"2026-02-21T14:31:22.570Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["claude","claude-code","coding-agents","copilot","hugging-face","llm","llm-inference"],"created_at":"2026-02-12T00:21:04.913Z","updated_at":"2026-02-21T15:20:12.205Z","avatar_url":"https://github.com/kmesiab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Qwenvert\n\n**Run Claude Code with a local LLM on your Mac. Keep your code private.**\n\n[![PyPI version](https://badge.fury.io/py/qwenvert.svg)](https://badge.fury.io/py/qwenvert)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Python Version](https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11%20%7C%203.12-blue)](https://www.python.org/downloads/)\n\nQwenvert lets you use Claude Code CLI with a completely local LLM (Qwen2.5-Coder) instead of Anthropic's API. Your code never leaves your machine.\n\n```text\n┌─────────────┐     ┌──────────────┐     ┌─────────────┐\n│ Claude Code │ --\u003e │   Qwenvert   │ --\u003e │ Local Qwen  │\n│     CLI     │     │   (adapter)  │     │    Model    │\n└─────────────┘     └──────────────┘     └─────────────┘\n                         :8088              (via Ollama)\n```\n\n**Why?** Privacy. Security. Compliance. Zero inference costs. No internet required.\n\n---\n\n## ⚡ 5-Minute Quick Start\n\n### 1. Install\n\n**Requirements:**\n- Mac with M1/M2/M3 chip (8GB RAM minimum)\n- Python 3.9-3.12 (check: `python3 --version`)\n- [Ollama](https://ollama.ai) or llama.cpp\n\n**Install from PyPI:**\n```bash\npip install qwenvert\n```\n\n**Or install from source:**\n```bash\ngit clone https://github.com/kmesiab/qwenvert.git\ncd qwenvert\npip install -e .\n```\n\n\u003e **macOS Users (Python 3.11+):** If you see an \"externally-managed-environment\" error, you have two options:\n\u003e\n\u003e **Option 1 (Recommended for development):**\n\u003e ```bash\n\u003e git clone https://github.com/kmesiab/qwenvert.git\n\u003e cd qwenvert\n\u003e make venv           # Creates .venv virtual environment\n\u003e source .venv/bin/activate\n\u003e make install-dev    # Installs qwenvert + dev dependencies\n\u003e ```\n\u003e\n\u003e **Option 2 (Recommended for end users):**\n\u003e ```bash\n\u003e pipx install qwenvert  # Installs in isolated environment\n\u003e # Install pipx first if needed: brew install pipx\n\u003e ```\n\u003e\n\u003e This is due to [PEP 668](https://peps.python.org/pep-0668/) which protects system Python on modern macOS.\n\n### 2. Setup (One Command - Zero Friction!)\n\n```bash\nqwenvert init\n```\n\nThis will **automatically** (no prompts!):\n- ✅ Detect your hardware (chip, RAM, cooling)\n- ✅ Install llama-server binary if needed (~50MB)\n- ✅ Pick the best model for your Mac\n- ✅ Download the model from HuggingFace (~4GB)\n- ✅ Configure everything automatically\n\n**First run takes 2-5 minutes** (downloads binaries \u0026 models). Subsequent runs are instant.\n\n**Example output:**\n```text\nQwenvert Initialization\n\n✓ Detected: M1 Pro, 16GB RAM, 16 GPU cores, Active cooling\n✓ Selected: Qwen2.5 Coder 7B Q5\n✓ Downloading from HuggingFace...\n✓ Model downloaded: ~/.qwenvert/models/qwen25-coder-7b-q5.gguf (4.2GB)\n✓ Configuration saved: ~/.config/qwenvert/config.yaml\n\nNext step: qwenvert start\n```\n\n### 3. Start Qwenvert\n\n```bash\nqwenvert start\n```\n\n**You'll see:**\n```text\nStarting Qwenvert\n\n✓ Backend: Ollama with qwen2.5-coder:7b\n✓ Backend server: http://localhost:11434 (healthy)\n✓ Qwenvert adapter: http://localhost:8088\n✓ Ready for Claude Code!\n\nConfigure Claude Code:\n  export ANTHROPIC_BASE_URL=http://localhost:8088\n  export ANTHROPIC_API_KEY=local-qwen\n  export ANTHROPIC_MODEL=qwenvert-default\n```\n\nLeave this terminal running.\n\n**Missing Dependencies?** If Ollama isn't installed, qwenvert will offer to install it automatically:\n\n```bash\nqwenvert start\n```\n\nYou'll see:\n```text\n======================================================================\n  Missing Dependency: Ollama\n======================================================================\n\nOllama is not installed (required for running local models)\n\nTo install Ollama using Homebrew:\n  1. Run: brew install ollama\n  2. Wait for installation to complete\n  3. Run: qwenvert init\n\nLearn more: https://ollama.ai\n\n======================================================================\n\nWould you like to install Ollama automatically using Homebrew? [Y/n]:\n```\n\n**Non-interactive mode:**\n```bash\nqwenvert start --auto-install\n```\n\nAutomatically installs missing dependencies via Homebrew without prompting.\n\n\u003e **Note:** Auto-installation only works for supported dependencies (Ollama, llama.cpp) when Homebrew is available.\n\n### 4. Configure Claude Code (New Terminal)\n\n```bash\nexport ANTHROPIC_BASE_URL=http://localhost:8088\nexport ANTHROPIC_API_KEY=local-qwen\nexport ANTHROPIC_MODEL=qwenvert-default\n\nclaude\n```\n\n**That's it!** Claude Code now uses your local model. Your code stays on your machine.\n\n### What Just Happened?\n\n**Without qwenvert (default):**\n```\nClaude Code → api.anthropic.com → Claude Sonnet/Opus\n              (internet)           (cloud)\n              💰 Costs money      ☁️ Code leaves machine\n```\n\n**With qwenvert (configured):**\n```\nClaude Code → localhost:8088 → Ollama → Qwen Model\n              (no internet)     (local)  (your Mac)\n              💰 Free            🔒 Code stays local\n```\n\nClaude Code doesn't know the difference - it just uses whatever `ANTHROPIC_BASE_URL` points to!\n\n---\n\n## 📖 How to Use\n\n### Basic Workflow\n\n```bash\n# Start qwenvert (terminal 1)\nqwenvert start\n\n# Use Claude Code (terminal 2)\nexport ANTHROPIC_BASE_URL=http://localhost:8088\nexport ANTHROPIC_API_KEY=local-qwen\nexport ANTHROPIC_MODEL=qwenvert-default\nclaude\n\n# When done, stop qwenvert\nqwenvert stop\n```\n\n### Make Environment Variables Permanent\n\nAdd to your `~/.zshrc` or `~/.bashrc`:\n\n```bash\n# Qwenvert - Local Claude Code\nexport ANTHROPIC_BASE_URL=http://localhost:8088\nexport ANTHROPIC_API_KEY=local-qwen\nexport ANTHROPIC_MODEL=qwenvert-default\n```\n\nThen reload: `source ~/.zshrc`\n\nNow `claude` will automatically use qwenvert!\n\n### Verify Claude Code is Using Qwenvert\n\nAfter setting environment variables, verify the setup:\n\n```bash\n# Check environment variables are set\necho $ANTHROPIC_BASE_URL\n# Should show: http://localhost:8088\n\necho $ANTHROPIC_API_KEY\n# Should show: local-qwen\n\necho $ANTHROPIC_MODEL\n# Should show: qwenvert-default\n\n# Make sure qwenvert is running\ncurl http://localhost:8088/health\n# Should return: {\"status\":\"healthy\",\"backend\":\"connected\"}\n\n# Test with Claude Code\nclaude\n# In Claude Code, ask: \"What model are you?\"\n# It should respond as Qwen2.5-Coder (though it might say Claude)\n```\n\n**How to tell it's working:**\n- ✅ Claude Code starts without asking for an API key\n- ✅ Responses come quickly (no network delay)\n- ✅ `qwenvert monitor` shows requests appearing\n- ✅ Works offline (disconnect wifi and try)\n\n**If it's NOT working:**\n- ❌ \"Invalid API key\" error → Check `ANTHROPIC_API_KEY=local-qwen`\n- ❌ \"Connection refused\" → Check `ANTHROPIC_BASE_URL` and qwenvert is running\n- ❌ \"Model not found\" → Check `ANTHROPIC_MODEL=qwenvert-default`\n\n---\n\n## 🎯 Common Commands\n\n### Check Status\n\n```bash\nqwenvert status\n```\n\n**Output:**\n```\nQwenvert Status\n\nConfiguration\n  Model:              qwen2.5-coder-7b-q5\n  Backend:            ollama\n  Backend URL:        http://localhost:11434\n  Adapter:            http://localhost:8088\n  Context Length:     32,768 tokens\n\nServer Health:\n  Backend:  ✓ Running\n  Adapter:  ✓ Running\n```\n\n### Monitor Performance (Optional)\n\n```bash\nqwenvert monitor\n```\n\nShows a live dashboard with:\n- Requests per second\n- Token generation speed\n- System resources (CPU, memory, temp)\n- Recent request history\n\n**OpenTelemetry Support**: The monitor now uses OpenTelemetry-compliant metrics. Enable OTLP export for integration with observability platforms:\n\n```bash\n# Enable with local OTLP collector (secure)\nexport OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317\nqwenvert monitor --enable-otel\n```\n\nSee [TELEMETRY_SECURITY.md](./TELEMETRY_SECURITY.md) for complete security details.\n\nPress `Ctrl+C` to exit.\n\n### Binary Management Commands\n\n**Check llama-server installation:**\n```bash\nqwenvert binary info\n```\n\n**Output:**\n```text\n┌──────────────┬────────────────────────────────────────┐\n│ Property     │ Value                                  │\n├──────────────┼────────────────────────────────────────┤\n│ Path         │ ~/.cache/qwenvert/bin/llama-server     │\n│ Version      │ b3600                                  │\n│ Source       │ downloaded                             │\n│ Architecture │ arm64                                  │\n│ Valid        │ ✓ Yes                                  │\n└──────────────┴────────────────────────────────────────┘\n```\n\n**List available versions:**\n```bash\nqwenvert binary list\n```\n\n**Install specific version:**\n```bash\nqwenvert binary install --version b3600\n```\n\n**Update to latest:**\n```bash\nqwenvert binary update\n```\n\n**Verify integrity:**\n```bash\nqwenvert binary verify\n```\n\n**Rollback to backup:**\n```bash\nqwenvert binary rollback\n```\n\n### Detect Available Backends\n\n```bash\nqwenvert backends\n```\n\nShows which backends (llama.cpp, Ollama) are available on your system and recommends the fastest option.\n\n### List Available Models\n\n```bash\nqwenvert models list\n```\n\n**Output:**\n```\nAvailable Models\n\nID                           Size    RAM    Context\nqwen2.5-coder-7b-q4          4.1GB   8GB    32K\nqwen2.5-coder-7b-q5          4.8GB   16GB   32K\nqwen2.5-coder-14b-q4         8.5GB   16GB   32K\nqwen2.5-coder-14b-q5         10GB    32GB   32K\n```\n\n### Clean Up Downloaded Models\n\nRemove downloaded model files to free disk space:\n\n```bash\n# Interactive selection\nqwenvert models clean\n\n# Remove specific model\nqwenvert models clean --model-id qwen2.5-coder-7b-instruct-q4_k_m.gguf\n\n# Remove all models (with confirmation)\nqwenvert models clean --all\n\n# Preview what would be deleted (dry run)\nqwenvert models clean --dry-run\n```\n\n**Example output:**\n```\nModel Cleanup\n\nModels disk usage: 12.3 GB\nAvailable disk space: 45.2 GB\n\nDownloaded models:\n\n  1. qwen2.5-coder-7b-instruct-q4_k_m.gguf (4.2 GB)\n  2. qwen2.5-coder-14b-instruct-q5_k_m.gguf (8.1 GB)\n  3. All models\n  4. Cancel\n\nEnter number(s) separated by commas: 1\n\nModels to be deleted:\n\nFilename                                   Size\nqwen2.5-coder-7b-instruct-q4_k_m.gguf     4.2 GB\n\nTotal space to free: 4.2 GB\n\nDelete these models? [y/N]: y\n\n✓ Cleanup complete! Deleted 1 model(s), freed 4.2 GB\n```\n\n### Check Your Hardware\n\n```bash\nqwenvert hardware\n```\n\n**Output:**\n```\nHardware Information\n\nChip:               M1 Pro\nTotal Memory:       16GB\nGPU Cores:          16\nPerformance Cores:  8\nCooling:            Active (fan)\nRecommended:        32K tokens context\n```\n\n---\n\n## 📦 Dependencies \u0026 Auto-Installation\n\n### Required Dependencies\n\nQwenvert requires one of these backends to run:\n\n- **Ollama** (recommended) - Easy to install via Homebrew: `brew install ollama`\n- **llama.cpp** - Manual build required, see [llama.cpp docs](https://github.com/ggerganov/llama.cpp)\n\n### Supported Auto-Install Dependencies\n\nWhen you run `qwenvert start`, it automatically detects missing dependencies and offers to install them via Homebrew. The following dependencies support auto-installation:\n\n| Dependency | Package Name | Installation Command |\n|------------|--------------|---------------------|\n| Ollama | `ollama` | `brew install ollama` |\n| llama.cpp | `llama.cpp` | (Not yet supported for auto-install) |\n\n\u003e **Security Note:** Auto-installation only works for whitelisted dependencies defined in `ALLOWED_AUTO_INSTALL_DEPENDENCIES`. This prevents accidental installation of arbitrary packages.\n\n### Auto-Install Modes\n\n**Interactive (default):**\n```bash\nqwenvert start\n# Prompts: \"Would you like to install Ollama automatically using Homebrew? [Y/n]:\"\n```\n\n**Non-interactive (CI/automation):**\n```bash\nqwenvert start --auto-install\n# Automatically installs without prompting\n```\n\n**Manual installation (traditional):**\n```bash\n# Install Ollama manually\nbrew install ollama\n\n# Then start qwenvert\nqwenvert start\n```\n\n### Checking Dependencies\n\nTo check if dependencies are installed, qwenvert automatically detects them when you run commands. You can also manually check:\n\n```bash\nwhich ollama        # Check if Ollama is in PATH\nollama --version    # Verify Ollama version\n```\n\n### Adding More Dependencies\n\nCurrently, only Ollama and llama.cpp are supported as backends. Other dependencies (like Homebrew itself) require manual installation.\n\nIf you need support for additional backends, please [open an issue](https://github.com/kmesiab/qwenvert/issues).\n\n---\n\n## 🔧 Advanced Usage\n\n### Use a Specific Model\n\n```bash\n# List models\nqwenvert models list\n\n# Re-initialize with different model\nqwenvert init --model qwen2.5-coder-14b-q5\n\n# Restart\nqwenvert stop\nqwenvert start\n```\n\n### Use llama.cpp Instead of Ollama\n\n```bash\n# Initialize with llama.cpp backend\nqwenvert init --backend llamacpp\n\n# Start (same command)\nqwenvert start\n```\n\n**Why llama.cpp?**\n- More control over inference parameters\n- Slightly faster on some Macs\n- Lower memory overhead\n\n**Why Ollama?** (default)\n- Easier to install\n- Better model management\n- More beginner-friendly\n\n### Custom Context Length\n\n```bash\n# Longer context = more memory\nqwenvert init --context-length 65536  # 64K tokens\n\n# Shorter context = less memory\nqwenvert init --context-length 16384  # 16K tokens\n```\n\n**Rule of thumb:**\n- 8GB Mac: 16K max\n- 16GB Mac: 32K safe\n- 32GB+ Mac: 64K works\n\n---\n\n## ❓ Troubleshooting\n\n### \"Connection refused\" when starting Claude Code\n\n**Check if qwenvert is running:**\n```bash\ncurl http://localhost:8088/health\n```\n\n**Should return:**\n```json\n{\"status\": \"healthy\", \"backend\": \"connected\"}\n```\n\n**If not running:**\n```bash\nqwenvert start\n```\n\n---\n\n### Model download fails\n\n**Problem:** HuggingFace download interrupted\n\n**Solution:**\n```bash\n# Try again (downloads resume automatically)\nqwenvert init\n\n# Or download manually and place in ~/.qwenvert/models/\n```\n\n---\n\n### Slow response times\n\n**Check memory usage:**\n```bash\nqwenvert status\n```\n\n**Solutions:**\n1. **Use smaller model:**\n   ```bash\n   qwenvert init --model qwen2.5-coder-7b-q4\n   ```\n\n2. **Reduce context length:**\n   ```bash\n   qwenvert init --context-length 16384\n   ```\n\n3. **Close other apps** to free RAM\n\n**Expected speeds:**\n- 8GB Mac: 15-20 tokens/sec\n- 16GB Mac: 25-35 tokens/sec\n- 32GB+ Mac: 30-40 tokens/sec\n\n---\n\n### MacBook Air overheating\n\n**Enable thermal pacing:**\n\nEdit `~/.config/qwenvert/config.yaml`:\n```yaml\nthermal_pacing: true\nthermal_threshold: 70  # Celsius\n```\n\nOr re-run init with thermal protection:\n```bash\nqwenvert init --thermal-pacing\n```\n\n---\n\n### Can't install - Python version error\n\n**Problem:** Python 3.13 not supported yet\n\n**Solution:** Use Python 3.12 or earlier\n```bash\n# Check version\npython3 --version\n\n# Install Python 3.12 via Homebrew\nbrew install python@3.12\n\n# Use it\npip3.12 install -e .\n```\n\n---\n\n### Environment variables not persisting\n\n**Problem:** Variables reset when you close terminal\n\n**Solution:** Add to shell config\n\n```bash\n# Open your shell config\nnano ~/.zshrc  # or ~/.bashrc for bash\n\n# Add these lines\nexport ANTHROPIC_BASE_URL=http://localhost:8088\nexport ANTHROPIC_API_KEY=local-qwen\nexport ANTHROPIC_MODEL=qwenvert-default\n\n# Save and reload\nsource ~/.zshrc\n```\n\n---\n\n### \"externally-managed-environment\" error on install\n\n**Problem:** `pip install` fails with error about externally managed environment\n\n**macOS Python 3.11+ Context:** Apple now protects system Python to prevent breaking macOS tools. This is [PEP 668](https://peps.python.org/pep-0668/) in action.\n\n**Solution 1 - Virtual Environment (Recommended for development):**\n```bash\n# Clone the repository\ngit clone https://github.com/kmesiab/qwenvert.git\ncd qwenvert\n\n# Create and activate virtual environment\nmake venv\nsource .venv/bin/activate\n\n# Install\nmake install-dev\n```\n\n**Solution 2 - pipx (Recommended for end users):**\n```bash\n# Install pipx if needed\nbrew install pipx\n\n# Install qwenvert in isolated environment\npipx install qwenvert\n```\n\n**Solution 3 - Disable protection (NOT recommended):**\n```bash\n# This breaks the system protection - avoid unless you know what you're doing\npip install qwenvert --break-system-packages\n```\n\n**Why virtual environments?**\n- Isolated dependencies (won't conflict with other projects)\n- Easy to delete and recreate if something breaks\n- Standard Python best practice\n- Doesn't require disabling system protections\n\n---\n\n## 🔒 Privacy \u0026 Security\n\n### What Data Stays Local?\n\n**Everything.** Qwenvert is designed for security-conscious developers.\n\n✅ **Your code** - Never sent to any server\n✅ **Prompts** - Processed only on your Mac\n✅ **Responses** - Generated locally\n✅ **Model weights** - Stored in `~/.qwenvert/models/`\n\n### How We Guarantee This\n\n1. **Localhost-only binding** - Adapter listens on `127.0.0.1` only (not accessible from network)\n2. **No external calls** - Code explicitly blocks external connections\n3. **Telemetry security** - All telemetry exporters disabled by default; OTLP endpoints validated to be localhost-only (see [TELEMETRY_SECURITY.md](./TELEMETRY_SECURITY.md))\n4. **Test-proven** - 23 dedicated security tests verify isolation and telemetry safety\n5. **Transparent code** - Full source available for audit\n\n**Perfect for:**\n- HIPAA/SOC2 compliance\n- Proprietary code bases\n- Air-gapped development\n- Security research\n- Offline work\n\n---\n\n## 📊 Performance Expectations\n\n### What to Expect\n\n| Mac Type | Model | Speed | Memory | Context |\n|----------|-------|-------|--------|---------|\n| 8GB M1 (Air) | 7B Q4 | 15-20 t/s | ~4GB | 16K tokens |\n| 16GB M1 Pro | 7B Q5 | 25-35 t/s | ~6GB | 32K tokens |\n| 32GB M1 Max | 14B Q5 | 20-30 t/s | ~12GB | 64K tokens |\n\n**t/s** = tokens per second\n\n### Compared to Cloud APIs\n\n| Feature | Qwenvert | Claude API |\n|---------|----------|------------|\n| Speed | 20-35 t/s | 40-60 t/s |\n| Latency | ~0ms (local) | 100-300ms (network) |\n| Cost | $0/month | $15-300/month |\n| Privacy | 100% local | Cloud |\n| Offline | ✅ Yes | ❌ No |\n| Code quality | Good | Excellent |\n\n**Best for:** Security/privacy-critical work, cost-sensitive projects, offline development\n\n**Not ideal for:** Highest code quality, fastest possible responses\n\n---\n\n## 🎓 Understanding Qwenvert\n\n### What Is It?\n\nQwenvert is an **HTTP adapter** that sits between Claude Code CLI and your local LLM:\n\n```\nClaude Code → Qwenvert → Ollama/llama.cpp → Qwen Model\n```\n\n**Not just config** - It's a full translation layer:\n- Translates Anthropic API → Ollama/llama.cpp format\n- Converts responses back to Anthropic format\n- Handles streaming (Server-Sent Events)\n- Manages backend processes\n- Monitors performance\n\n### Why Not Use Ollama Directly?\n\nOllama has basic Anthropic API support, but:\n- ❌ Limited streaming support\n- ❌ Missing some API features\n- ❌ No thermal management\n- ❌ No hardware optimization\n- ❌ Can't switch backends easily\n\nQwenvert provides:\n- ✅ Full Anthropic Messages API\n- ✅ Works with Ollama **or** llama.cpp\n- ✅ Thermal monitoring for MacBook Air\n- ✅ Hardware-aware model selection\n- ✅ Easy to extend with new backends\n\n### Performance \u0026 Backend Comparison\n\nQwenvert supports two backends for running local LLMs: **llama.cpp** (default, faster) and **Ollama** (easier setup). Based on extensive research and benchmarking, **llama.cpp is 3-7x faster** than Ollama for local inference on Apple Silicon.\n\n#### Benchmark Results\n\n| Backend | Throughput | Performance vs Ollama | Best For |\n|---------|-----------|----------------------|----------|\n| **llama.cpp** | ~150 tok/s | **Baseline (fastest)** | Production, performance-critical |\n| Ollama | 20-40 tok/s | 3-7x slower | Quick testing, simple setup |\n| MLX¹ | ~230 tok/s | 1.5x faster than llama.cpp | Advanced users, Python integration |\n\n*Benchmarks from [Comparative Study (2025)](https://arxiv.org/pdf/2511.05502)*\n\n¹ MLX (Apple's ML framework) is fastest but not yet integrated into qwenvert.\n\n#### Why llama.cpp is Faster\n\nllama.cpp provides **direct Metal GPU acceleration** for Apple Silicon, while Ollama adds a Go wrapper layer that introduces overhead:\n\n- **Metal Acceleration**: 2.4x speedup over CPU-only inference ([source](https://github.com/ggml-org/llama.cpp/discussions/4167))\n- **Optimized for Apple Silicon**: Full GPU layer offload (`-ngl 99`)\n- **Continuous Batching**: Better throughput for multiple requests\n- **Lower Memory Overhead**: Direct model access without wrapper\n\n#### Apple Silicon Performance by Model\n\n| Mac Model | RAM | Model Size | llama.cpp Throughput | Expected Response Time |\n|-----------|-----|------------|---------------------|----------------------|\n| M1 Air | 8GB | 1.5B Q4 | 30-40 tok/s | 1-2 seconds |\n| M1 Pro | 16GB | 7B Q4 | 28-35 tok/s | 2-4 seconds |\n| M2 Max | 32GB | 14B Q4 | 22-30 tok/s | 3-5 seconds |\n| M3 Pro/Max | 18GB+ | 7B Q4 | 28-35 tok/s | 2-3 seconds |\n\n*Performance data from [llama.cpp Apple Silicon benchmarks](https://github.com/ggml-org/llama.cpp/discussions/4167)*\n\n#### Research-Backed Optimizations\n\nQwenvert's llama.cpp configuration includes several **research-backed optimizations** for Apple Silicon:\n\n- **Full GPU Offload** (`-ngl 99`): Offloads all model layers to Metal GPU\n- **Performance Core Threading**: Uses only P-cores (E-cores reduce performance)\n- **Continuous Batching** (`--cont-batching`): Improves throughput for concurrent requests\n- **Metal Split Mode**: Optimizes VRAM usage for 7B+ models on limited memory\n\n**Removed flags** (based on community research):\n- ❌ `--mlock`: Causes crashes on macOS ([issue #18152](https://github.com/ggml-org/llama.cpp/issues/18152))\n- ❌ `-fa` (flash attention): Can slow generation unless KV cache fits in VRAM ([discussion](https://github.com/ggml-org/llama.cpp/discussions/15650))\n\n#### Choosing a Backend\n\n**Use llama.cpp (default)** if:\n- ✅ You want maximum performance (3-7x faster)\n- ✅ You're comfortable with command-line tools\n- ✅ You need the best inference speed\n\n**Use Ollama** if:\n- ✅ You prefer simpler setup (one-line install)\n- ✅ You already have Ollama installed\n- ✅ Performance is not critical\n\nTo switch backends:\n```bash\nqwenvert init --backend ollama    # Use Ollama\nqwenvert init --backend llamacpp  # Use llama.cpp (default)\n```\n\n### Architecture\n\n```text\n┌─────────────────────────────────────────────────────────────┐\n│                      Claude Code CLI                         │\n└────────────────────────┬────────────────────────────────────┘\n                         │\n                    POST /v1/messages\n                         │\n┌────────────────────────▼────────────────────────────────────┐\n│                 Qwenvert HTTP Adapter                        │\n│                     (localhost:8088)                         │\n│  • Validates requests                                        │\n│  • Translates Anthropic → Backend format                    │\n│  • Handles streaming (SSE)                                   │\n│  • Monitors performance                                      │\n└────────────────────────┬────────────────────────────────────┘\n                         │\n                Backend-specific API\n                         │\n┌────────────────────────▼────────────────────────────────────┐\n│              Ollama or llama.cpp Server                      │\n│                  (localhost:11434 or :8080)                  │\n└────────────────────────┬────────────────────────────────────┘\n                         │\n                  ┌──────▼───────┐\n                  │  Qwen Model  │\n                  │    (GGUF)    │\n                  └──────────────┘\n```\n\n---\n\n## 🚀 Next Steps\n\n### After Installation\n\n1. **Optimize for your use case:**\n   - Heavy coding? Use Q5 quantization for better quality\n   - Low RAM? Use Q4 quantization to save memory\n   - Need speed? Use llama.cpp backend\n\n2. **Set up convenience aliases:**\n   ```bash\n   # Add to ~/.zshrc\n   alias qw-start='qwenvert start'\n   alias qw-stop='qwenvert stop'\n   alias qw-status='qwenvert status'\n   ```\n\n3. **Monitor performance:**\n   ```bash\n   qwenvert monitor\n   ```\n\n4. **Read advanced docs:**\n   - [ARCHITECTURE.md](./ARCHITECTURE.md) - How it works\n   - [SIMPLIFIED_ARCHITECTURE.md](./docs/SIMPLIFIED_ARCHITECTURE.md) - Beginner-friendly overview\n\n---\n\n## 💡 Tips \u0026 Best Practices\n\n### For Best Performance\n\n1. **Close other apps** when running inference\n2. **Use appropriate model size** for your RAM\n3. **Monitor temperature** on MacBook Air (use `qwenvert monitor`)\n4. **Don't use Rosetta** - qwenvert is native Apple Silicon\n\n### For Best Code Quality\n\n1. **Use Q5 quantization** if you have 16GB+ RAM\n2. **Give it more context** - longer prompts = better results\n3. **Be specific** in your prompts (same as with Claude)\n4. **Iterate** - local models benefit from refinement\n\n### For Development\n\n1. **Keep qwenvert running** in a dedicated terminal\n2. **Check logs** if something seems wrong: `qwenvert status`\n3. **Update models** periodically - new versions improve quality\n4. **Share feedback** - open issues for bugs/improvements\n\n---\n\n## 📊 Performance Benchmarks\n\nMeasure qwenvert performance on your Mac:\n\n```bash\n# Start qwenvert\nqwenvert start\n\n# Run benchmarks (separate terminal)\nmake benchmark\n```\n\n**What it tests:**\n- Different prompt lengths (short, medium, long)\n- Streaming vs non-streaming\n- Different token limits (50, 100, 200)\n- Code generation tasks\n\n**Metrics:**\n- Latency (ms)\n- Throughput (tokens/sec)\n- Time to first token (TTFT)\n- Success rate\n\n**Example output:**\n```\n┌────────────────┬─────────┬──────┬─────────┬────────┬─────────┬────────┐\n│ Benchmark      │ Backend │ Quant│ Latency │ Tokens │ Speed   │ Status │\n├────────────────┼─────────┼──────┼─────────┼────────┼─────────┼────────┤\n│ prompt_short   │ ollama  │ Q4_K │ 1234ms  │ 5      │ 4.1 t/s │   ✓    │\n│ prompt_medium  │ ollama  │ Q4_K │ 2456ms  │ 89     │ 36.2t/s │   ✓    │\n└────────────────┴─────────┴──────┴─────────┴────────┴─────────┴────────┘\n\nSummary:\n  Average latency: 1845ms\n  Average throughput: 32.4 tokens/sec\n```\n\nResults saved to `benchmarks/results/` for tracking over time.\n\nSee [benchmarks/README.md](./benchmarks/README.md) for details.\n\n---\n\n## 🤝 Contributing\n\nWe welcome contributions! Areas where help is needed:\n\n- **Model support** - Add Qwen3-Coder, other model families\n- **Backend support** - MLX, vLLM, TensorRT-LLM\n- **Performance** - Optimization for specific Mac models\n- **Testing** - More edge cases, hardware configurations\n- **Documentation** - Tutorials, examples, translations\n\nSee [CONTRIBUTING.md](./CONTRIBUTING.md) for guidelines.\n\n---\n\n## 📚 More Documentation\n\n- **[ARCHITECTURE.md](./ARCHITECTURE.md)** - System design and component details\n- **[TELEMETRY_SECURITY.md](./TELEMETRY_SECURITY.md)** - OpenTelemetry security guarantees and configuration\n- **[SIMPLIFIED_ARCHITECTURE.md](./docs/SIMPLIFIED_ARCHITECTURE.md)** - Beginner-friendly architecture overview\n- **[AGENTS.md](./AGENTS.md)** - AI agents for development and security auditing\n- **[TASKS.md](./TASKS.md)** - Development roadmap and task tracking\n- **[tests/](./tests/)** - Test suite with 23 dedicated security tests\n\n---\n\n## 🙏 Acknowledgments\n\n- **Qwen Team** (Alibaba) - Excellent Qwen2.5-Coder models\n- **Apple ML Team** - Metal acceleration, unified memory\n- **llama.cpp community** - High-performance inference engine\n- **Ollama team** - Making local LLMs accessible\n- **Anthropic** - Claude Code CLI and Messages API\n\n---\n\n## 📝 License\n\nApache 2.0 License - see [LICENSE](./LICENSE)\n\n---\n\n## ⚠️ Limitations \u0026 Disclaimers\n\n### Known Limitations\n\n- **Mac only** - Designed for M1/M2/M3 Macs (Intel/Windows not supported)\n- **Python 3.9-3.12** - Python 3.13 not yet compatible\n- **Large downloads** - Models are 4-10GB (one-time download)\n- **Code quality** - Good, but not as good as Claude Opus/Sonnet\n- **First run slow** - Model loading takes 10-30 seconds\n\n### Not Affiliated\n\nQwenvert is an independent project and is not affiliated with, endorsed by, or supported by Anthropic. Claude Code is a trademark of Anthropic.\n\n---\n\n## 📖 Research \u0026 Methodology\n\nThis project implements research-backed development practices for AI agent collaboration:\n\n### Repository-Level Instructions\n\nOur [AGENTS.md](./AGENTS.md) file follows findings from:\n\n\u003e **\"Repository-Level Instructions Enhance AI Assistant Completion and Efficiency\"**\n\u003e Li et al., 2025. arXiv:2601.20404\n\u003e https://arxiv.org/abs/2601.20404\n\n**Key findings from the research:**\n- **28.64% reduction** in AI agent task completion time\n- **16.58% reduction** in token usage\n- Repository-level instructions significantly improve code generation accuracy\n\n**How we apply it:**\n- Structured project conventions in [AGENTS.md](./AGENTS.md)\n- Security-critical rules documented upfront\n- File modification requirements clearly specified\n- Specialized agent catalog with use cases\n\nThis approach makes qwenvert development more efficient and maintainable when working with AI coding assistants like Claude Code.\n\n---\n\n**Questions? Issues? Feedback?**\n\nOpen an issue: https://github.com/kmesiab/qwenvert/issues\n\n---\n\n**Built with care for the Mac M1 community** 🚀\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkmesiab%2Fqwenvert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkmesiab%2Fqwenvert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkmesiab%2Fqwenvert/lists"}