https://github.com/rajeevbarnwal/qwen3-coder-local-runner
Run Qwen3-Coder locally with llama.cpp + Unsloth GGUFs
https://github.com/rajeevbarnwal/qwen3-coder-local-runner
ai coding llamacpp openai qwen3
Last synced: 2 months ago
JSON representation
Run Qwen3-Coder locally with llama.cpp + Unsloth GGUFs
- Host: GitHub
- URL: https://github.com/rajeevbarnwal/qwen3-coder-local-runner
- Owner: rajeevbarnwal
- Created: 2025-07-25T08:04:50.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-07-25T10:41:24.000Z (2 months ago)
- Last Synced: 2025-07-25T15:23:01.404Z (2 months ago)
- Topics: ai, coding, llamacpp, openai, qwen3
- Language: Jupyter Notebook
- Homepage:
- Size: 15.6 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Qwen3-Coder-Local-Runner
A guide and scripts to run Qwen3-Coder-480B-A35B locally with llama.cpp + Unsloth GGUFs# Qwen3-Coder-Local-Runner π
[](https://github.com/rajeevbarnwal/Qwen3-Coder-Local-Runner/actions)
> π§ Run Qwen3-Coder-480B-A35B locally using llama.cpp + Unsloth Dynamic GGUFs
[](https://github.com/rajeevbarnwal/Qwen3-Coder-Local-Runner/stargazers)
[](https://github.com/rajeevbarnwal/Qwen3-Coder-Local-Runner/network/members)
[](./LICENSE)
[](https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF)---
## π About This Project
This repo showcases how to run the powerful **Qwen3-Coder-480B-A35B** locally using `llama.cpp` and Unslothβs optimized **GGUF** models. It includes:
- Setup scripts for llama.cpp
- Model download from Hugging Face
- Example inference commands
- Tool-calling demo
- Extended context (1M tokens) config
- Performance optimization tips> π Related blog post: [How I Ran Qwen3-Coder Locally](https://medium.com/@rajeevbarnwal)
---
## π§± Setup Instructions
### 1. Install Prerequisites
```bash
sudo apt update
sudo apt install build-essential cmake curl pciutils libcurl4-openssl-dev -y
```### 2. Clone & Build llama.cpp
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake . -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build build --config Release -j --target llama-cli llama-gguf-split
cp build/bin/llama-* .
```> π‘ Use `-DGGML_CUDA=OFF` for CPU-only.
---
## π₯ Download Qwen3 GGUF Model
```bash
pip install huggingface_hub hf_transfer
``````python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF",
local_dir="Qwen3-Coder",
allow_patterns=["*UD-Q2_K_XL*"]
)
```---
## π§ Run the Model
```bash
./llama-cli \
--model ./Qwen3-Coder/...UD-Q2_K_XL-00001-of-00004.gguf \
--threads -1 \
--ctx-size 16384 \
--n-gpu-layers 99 \
-ot ".ffn_.*_exps.=CPU" \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--repeat-penalty 1.05
```> π§ͺ Offloading MoE layers to CPU reduces VRAM requirements and speeds up generation.
---
## π§° Tool Calling Demo
```python
def get_current_temperature(location: str, unit: str = "celsius"):
return {"temperature": 26.1, "location": location, "unit": unit}
```You can format prompts using the `transformers` tokenizer with ChatML-style templates.
```python
from transformers import AutoTokenizermessages = [...]
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-Coder-480B-A35B-Instruct")
prompt = tokenizer.apply_chat_template(messages, tokenize=False)
```---
## π Extend Context to 1M Tokens
```bash
--cache-type-k q5_1
--flash-attn
```Make sure to use the **YaRN 1M context** GGUFs from Hugging Face.
---
## π Benchmark Highlights
| Benchmark | Qwen3 | Claude 4 | GPT-4.1 | Kimi-K2 |
|------------------|-------|----------|---------|---------|
| Aider Polyglot | 61.8 | 56.4 | 52.4 | 60.0 |
| SWE-Bench (100T) | 67.0 | 68.0 | 48.6 | 65.4 |
| WebArena | 49.9 | 51.1 | 44.3 | 47.4 |
| Mind2Web | 55.8 | 47.4 | 49.6 | 42.7 |π Source: [Unsloth GitHub](https://github.com/unslothai/unsloth)
---
## π Repo Structure
```
Qwen3-Coder-Local-Runner/
βββ llama_cpp_setup.sh
βββ run_qwen.sh
βββ model_download.py
βββ tool_calling_example.py
βββ prompts/
β βββ chat_template.txt
β βββ tool_call_prompt.json
βββ benchmarks.md
βββ assets/
βββ screenshots/
```---
## π License
This repo is licensed under the [MIT License](./LICENSE).
---
## π€ Contributing
Pull requests are welcome. Letβs make Qwen3 easier to run for everyone!
---
## π Credits
- [Unsloth.ai](https://github.com/unslothai/unsloth)
- [Hugging Face](https://huggingface.co/unsloth/Qwen3-Coder-480B-A35B-Instruct-GGUF)
- [llama.cpp by Georgi Gerganov](https://github.com/ggerganov/llama.cpp)---
Happy Hacking! π»