{"id":31505682,"url":"https://github.com/hurbalurba/quick-llama.cpp-server","last_synced_at":"2025-10-02T20:12:33.110Z","repository":{"id":313614914,"uuid":"1051887495","full_name":"HurbaLurba/quick-llama.cpp-server","owner":"HurbaLurba","description":"The framework for posting a more modern cuda image for llama.cpp with cuda13 for just newer cards with RPC support. Started as just learning how to compile llama.cpp custom.","archived":false,"fork":false,"pushed_at":"2025-09-20T01:07:22.000Z","size":122,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-20T03:35:48.024Z","etag":null,"topics":["cuda","cuda13","devops","docker","dockerbuild","gguf","llamacpp","llm","rpc"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HurbaLurba.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-06T23:45:34.000Z","updated_at":"2025-09-20T01:07:25.000Z","dependencies_parsed_at":"2025-09-07T11:26:59.791Z","dependency_job_id":"60fe4a33-bb7f-4264-802a-f3ade4f716aa","html_url":"https://github.com/HurbaLurba/quick-llama.cpp-server","commit_stats":null,"previous_names":["ergonomech/quick-llama.cpp-server","hurbalurba/quick-llama.cpp-server"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/HurbaLurba/quick-llama.cpp-server","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HurbaLurba%2Fquick-llama.cpp-server","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HurbaLurba%2Fquick-llama.cpp-server/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HurbaLurba%2Fquick-llama.cpp-server/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HurbaLurba%2Fquick-llama.cpp-server/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HurbaLurba","download_url":"https://codeload.github.com/HurbaLurba/quick-llama.cpp-server/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HurbaLurba%2Fquick-llama.cpp-server/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278063132,"owners_count":25923599,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-02T02:00:08.890Z","response_time":67,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","cuda13","devops","docker","dockerbuild","gguf","llamacpp","llm","rpc"],"created_at":"2025-10-02T20:12:29.282Z","updated_at":"2025-10-02T20:12:33.104Z","avatar_url":"https://github.com/HurbaLurba.png","language":"Shell","readme":"# LLaMA.cpp Enhanced Docker Image for Modern GPUs\n\n**🎯 What is this?**  \nThis is an enhanced version of the official llama.cpp Docker image, specifically optimized for modern NVIDIA GPUs (RTX 30/40/50 series). It upgrades CUDA from 12.4.0 to 13.0.1 and adds RPC backend support for distributed processing.\n\n**🚀 Why use this instead of the official image?**  \n- **Better RTX 40/50 series support** with CUDA 13.0.1\n- **RPC backend** for distributed inference across multiple machines\n- **Smaller, faster** - only targets modern GPU architectures (no legacy bloat)\n- **Same functionality** as official `ghcr.io/ggml-org/llama.cpp:full-cuda` but enhanced\n\n**📦 Ready to use - No building required!**  \nAvailable on Docker Hub: [`philglod/llamacpp-cuda13-modern-full:latest`](https://hub.docker.com/r/philglod/llamacpp-cuda13-modern-full)\n\n## 🚀 Quick Start (Most Users Start Here!)\n\n### What You Need\n- NVIDIA RTX 30/40/50 series GPU (older GPUs won't work with this optimized build)\n- [Docker Desktop](https://www.docker.com/products/docker-desktop/) with GPU support\n- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)\n\n### Get Started in 2 Minutes\n\n**1. Pull the image:**\n```bash\ndocker pull philglod/llamacpp-cuda13-modern-full:latest\n```\n\n**2. Test it works:**\n```bash\ndocker run --rm --gpus all philglod/llamacpp-cuda13-modern-full:latest --server --help | grep -i cuda\n```\nYou should see your GPU detected like: `Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9`\n\n**3. Start using it!** Choose what you want to do:\n\n#### 🌐 Run a Web Server\n```bash\ndocker run --rm --gpus all -p 8080:8080 \\\n  philglod/llamacpp-cuda13-modern-full:latest \\\n  --server --host 0.0.0.0 --port 8080\n```\nThen visit `http://localhost:8080` for the web interface!\n\n#### 📥 Download \u0026 Convert a Model from HuggingFace\n```bash\nmkdir ./models\ndocker run --rm --gpus all -v $(pwd)/models:/models \\\n  philglod/llamacpp-cuda13-modern-full:latest \\\n  --convert --hf-repo microsoft/Phi-3-mini-4k-instruct --outtype f16\n```\n\n#### 🚀 Run a Complete AI Server\n```bash\n# After converting a model (like above), run a full server:\ndocker run -d --name my-ai-server --gpus all -p 8080:8080 -v $(pwd)/models:/models \\\n  philglod/llamacpp-cuda13-modern-full:latest \\\n  --server --host 0.0.0.0 --port 8080 \\\n  --model /models/Phi-3-mini-4k-instruct-f16.gguf \\\n  --ctx-size 4096 --n-gpu-layers 999\n```\nAccess web UI at `http://localhost:8080` or API at `http://localhost:8080/v1/chat/completions`\n\n## 💡 What Can This Do?\n\nThis image includes everything you need for AI model work:\n\n- **🌐 Web Server** - Run models with a web interface\n- **🔄 Model Conversion** - Convert HuggingFace models to llama.cpp format  \n- **📊 Benchmarking** - Test your GPU performance\n- **💬 Interactive Chat** - Talk to models directly\n- **🔧 All Tools** - Complete llama.cpp toolkit included\n\n## 🎯 Who Should Use This?\n\n### ✅ **Perfect for you if:**\n- You have RTX 30/40/50 series GPU\n- You want the latest CUDA performance improvements  \n- You need RPC support for distributed setups\n- You want a ready-to-use solution (no building required)\n\n### ❌ **Not for you if:**\n- You have older GPUs (GTX 10/20 series, Tesla K80, etc.)\n- You need to customize the build extensively\n- You're fine with the official CUDA 12.4.0 images\n\n### 🔄 **Alternative: Use Official Images**\nFor older GPUs or standard setups: `ghcr.io/ggml-org/llama.cpp:full-cuda`\n\n## 📋 More Usage Examples\n\n### Interactive Chat with a Model\n```bash\ndocker run --rm -it --gpus all -v $(pwd)/models:/models \\\n  philglod/llamacpp-cuda13-modern-full:latest \\\n  --run -m /models/your-model.gguf -p \"Hello, how are you?\"\n```\n\n### Benchmark Your GPU\n```bash\ndocker run --rm --gpus all -v $(pwd)/models:/models \\\n  philglod/llamacpp-cuda13-modern-full:latest \\\n  --bench -m /models/your-model.gguf\n```\n\n### Convert Your Own Model\n```bash\ndocker run --rm --gpus all -v $(pwd)/my-model:/input -v $(pwd)/converted:/output \\\n  philglod/llamacpp-cuda13-modern-full:latest \\\n  --convert --outtype f16 /input/ --output-dir /output/\n```\n\n## 🔧 GPU Compatibility\n\n### ✅ Supported (Modern GPUs Only)\n| Series | Examples | CUDA Compute |\n|--------|----------|--------------|\n| **RTX 30** | 3060, 3070, 3080, 3090 | 8.6 |\n| **RTX 40** | 4060, 4070, 4080, 4090 | 8.9 |\n| **RTX 50** | 5090, etc. | 9.0 |\n\n### ❌ Not Supported (Use Official Images Instead)\n- GTX 10/20 series (Pascal, Turing)\n- Tesla K80, P100, V100 (older data center GPUs)  \n- Any GPU with compute capability below 8.6\n\n## 🏗️ For Developers: Building from Source\n\n**Most users don't need this section!** Only if you want to customize the build.\n\n### Prerequisites\n- Docker with GPU support\n- Git\n- This repository cloned locally\n\n### Build Process\n```bash\n# Clone llama.cpp source\ngit submodule update --init --recursive\n\n# Build the image\ndocker build -t my-custom-llamacpp:latest --target full -f docker/cuda-13.0.1-custom.Dockerfile .\n\n# Test it\ndocker run --rm --gpus all my-custom-llamacpp:latest --help\n```\n\n### Publishing Your Own Version\n```bash\n# Tag for Docker Hub\ndocker tag my-custom-llamacpp:latest YOUR_USERNAME/llamacpp-custom:latest\n\n# Push to Docker Hub\ndocker login\ndocker push YOUR_USERNAME/llamacpp-custom:latest\n```\n\n## 🔍 Technical Details\n\n### Custom CMake Configuration\nBuilt with optimized flags for modern GPUs:\n```cmake\n-DGGML_CUDA=ON                    # CUDA support\n-DGGML_FORCE_CUBLAS=ON            # Force cuBLAS usage\n-DGGML_RPC=ON                     # RPC backend support\n-DCMAKE_CUDA_ARCHITECTURES=\"86;89;90\"  # Modern GPUs only\n```\n\n### Docker Hub Information\n- **Repository**: [`philglod/llamacpp-cuda13-modern-full`](https://hub.docker.com/r/philglod/llamacpp-cuda13-modern-full)\n- **Tags**: `latest`, `4067f07` (specific commit)\n- **Base Images**: `nvidia/cuda:13.0.1-devel-ubuntu24.04`\n\n### System Requirements\n- NVIDIA GPU with compute capability 8.6+\n- NVIDIA Container Toolkit installed\n- Docker with GPU support enabled\n- Sufficient VRAM for your target models\n\n## 🆘 Troubleshooting\n\n### GPU Not Detected?\n```bash\n# Check if NVIDIA Container Toolkit is working:\ndocker run --rm --gpus all nvidia/cuda:12.0-base-ubuntu20.04 nvidia-smi\n```\n\n### Image Won't Start?\nMake sure you're using `--gpus all` flag and have a compatible GPU (RTX 30/40/50 series)\n\n### Performance Issues?\nThis image is optimized for modern GPUs. For older GPUs, use the official images instead.\n\n## 📜 License \u0026 Credits\n\nBased on the official llama.cpp project. See the [llama.cpp repository](https://github.com/ggerganov/llama.cpp) for licensing terms.\n\nSpecial thanks to the llama.cpp team for the excellent foundation this build enhances.","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhurbalurba%2Fquick-llama.cpp-server","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhurbalurba%2Fquick-llama.cpp-server","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhurbalurba%2Fquick-llama.cpp-server/lists"}