{"id":20020447,"url":"https://github.com/shakfu/cyllama","last_synced_at":"2026-04-02T20:23:35.158Z","repository":{"id":260217637,"uuid":"878550857","full_name":"shakfu/cyllama","owner":"shakfu","description":"A thin cython wrapper around llama.cpp, whisper.cpp and stable-diffusion.cpp","archived":false,"fork":false,"pushed_at":"2026-03-28T06:48:20.000Z","size":30838,"stargazers_count":19,"open_issues_count":2,"forks_count":17,"subscribers_count":3,"default_branch":"main","last_synced_at":"2026-03-28T09:56:29.183Z","etag":null,"topics":["cython","cython-wrapper","llama-cpp","python3","stable-diffusion-cpp","whisper-cpp"],"latest_commit_sha":null,"homepage":"https://shakfu.github.io/cyllama/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shakfu.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-10-25T15:44:36.000Z","updated_at":"2026-03-25T20:51:54.000Z","dependencies_parsed_at":"2026-03-28T07:13:59.756Z","dependency_job_id":null,"html_url":"https://github.com/shakfu/cyllama","commit_stats":null,"previous_names":["shakfu/cyllama"],"tags_count":12,"template":false,"template_full_name":null,"purl":"pkg:github/shakfu/cyllama","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shakfu%2Fcyllama","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shakfu%2Fcyllama/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shakfu%2Fcyllama/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shakfu%2Fcyllama/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shakfu","download_url":"https://codeload.github.com/shakfu/cyllama/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shakfu%2Fcyllama/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31315742,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T12:59:32.332Z","status":"ssl_error","status_checked_at":"2026-04-02T12:54:48.875Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cython","cython-wrapper","llama-cpp","python3","stable-diffusion-cpp","whisper-cpp"],"created_at":"2024-11-13T08:32:22.177Z","updated_at":"2026-04-02T20:23:35.152Z","avatar_url":"https://github.com/shakfu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cyllama - Fast, Pythonic AI Inference\n\ncyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem:\n\n- **[llama.cpp](https://github.com/ggml-org/llama.cpp)** - Text generation, chat, embeddings, and text-to-speech\n- **[whisper.cpp](https://github.com/ggerganov/whisper.cpp)** - Speech-to-text transcription and translation\n- **[stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp)** - Image and video generation\n\nIt combines the performance of compiled Cython wrappers with a simple, high-level Python API for cross-modal AI inference.\n\n**[Documentation](https://shakfu.github.io/cyllama/)** | **[PyPI](https://pypi.org/project/cyllama/)** | **[Changelog](CHANGELOG.md)**\n\n## Features\n\n- High-level API -- `complete()`, `chat()`, `LLM` class for quick prototyping / text generation.\n- Streaming -- token-by-token output with callbacks\n- Batch processing -- process multiple prompts 3-10x faster\n- GPU acceleration -- Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform)\n- Speculative decoding -- 2-3x speedup with draft models\n- Agent framework -- ReActAgent, ConstrainedAgent, ContractAgent with tool calling\n- RAG -- retrieval-augmented generation with local embeddings and SQLite-vector\n- Speech recognition -- whisper.cpp transcription and translation\n- Image/Video generation -- stable-diffusion.cpp handles image, image-edit and video models.\n- OpenAI-compatible servers -- EmbeddedServer (C/Mongoose) and PythonServer\n- Framework integrations -- OpenAI API client, LangChain LLM interface\n\n## Installation\n\n### From PyPI\n\n```sh\npip install cyllama\n```\n\nThis installs the cpu-backend for linux and windows. For MacOS, the Metal backend is installed, by default, to take advantage of Apple Silicon.\n\n### GPU-Accelerated Variants\n\nGPU variants are available on PyPI as separate packages (dynamically linked, Linux x86_64 only):\n\n```sh\npip install cyllama-cuda12   # NVIDIA GPU (CUDA 12.4)\npip install cyllama-rocm     # AMD GPU (ROCm 6.3, requires glibc \u003e= 2.35)\npip install cyllama-sycl     # Intel GPU (oneAPI SYCL 2025.3)\npip install cyllama-vulkan   # Cross-platform GPU (Vulkan)\n```\n\nAll variants install the same `cyllama` Python package -- only the compiled backend differs. Install one at a time (they replace each other). GPU variants require the corresponding driver/runtime installed on your system.\n\nYou can verify which backend is active after installation:\n\n```sh\npython -m cyllama info\n```\n\nYou can also query the backend configuration at runtime:\n\n```python\nfrom cyllama import _backend\nprint(_backend.cuda)   # True if built with CUDA\nprint(_backend.metal)  # True if built with Metal\n```\n\n### Build from source with a specific backend\n\n```sh\nGGML_CUDA=1 pip install cyllama --no-binary cyllama\nGGML_VULKAN=1 pip install cyllama --no-binary cyllama\n```\n\n## Quick Start\n\n```python\nfrom cyllama import complete\n\n# One line is all you need\nresponse = complete(\n    \"Explain quantum computing in simple terms\",\n    model_path=\"models/llama.gguf\",\n    temperature=0.7,\n    max_tokens=200\n)\nprint(response)\n```\n\n## Key Features\n\n### Simple by Default, Powerful When Needed\n\n**High-Level API** - Get started in seconds:\n\n```python\nfrom cyllama import complete, chat, LLM\n\n# One-shot completion\nresponse = complete(\"What is Python?\", model_path=\"model.gguf\")\n\n# Multi-turn chat\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n    {\"role\": \"user\", \"content\": \"What is machine learning?\"}\n]\nresponse = chat(messages, model_path=\"model.gguf\")\n\n# Reusable LLM instance (faster for multiple prompts)\nllm = LLM(\"model.gguf\")\nresponse1 = llm(\"Question 1\")\nresponse2 = llm(\"Question 2\")  # Model stays loaded!\n```\n\n**Streaming Support** - Real-time token-by-token output:\n\n```python\nfor chunk in complete(\"Tell me a story\", model_path=\"model.gguf\", stream=True):\n    print(chunk, end=\"\", flush=True)\n```\n\n### Performance Optimized\n\n**Batch Processing** - Process multiple prompts 3-10x faster:\n\n```python\nfrom cyllama import batch_generate\n\nprompts = [\"What is 2+2?\", \"What is 3+3?\", \"What is 4+4?\"]\nresponses = batch_generate(prompts, model_path=\"model.gguf\")\n```\n\n**Speculative Decoding** - 2-3x speedup with draft models:\n\n```python\nfrom cyllama.llama.llama_cpp import Speculative, SpeculativeParams\n\nparams = SpeculativeParams(n_max=16, p_min=0.75)\nspec = Speculative(params, ctx_target)\ndraft_tokens = spec.draft(prompt_tokens, last_token)\n```\n\n**Memory Optimization** - Smart GPU layer allocation:\n\n```python\nfrom cyllama import estimate_gpu_layers\n\nestimate = estimate_gpu_layers(\n    model_path=\"model.gguf\",\n    available_vram_mb=8000\n)\nprint(f\"Recommended GPU layers: {estimate.n_gpu_layers}\")\n```\n\n**N-gram Cache** - 2-10x speedup for repetitive text:\n\n```python\nfrom cyllama.llama.llama_cpp import NgramCache\n\ncache = NgramCache()\ncache.update(tokens, ngram_min=2, ngram_max=4)\ndraft = cache.draft(input_tokens, n_draft=16)\n```\n\n**Response Caching** - Cache LLM responses for repeated prompts:\n\n```python\nfrom cyllama import LLM\n\n# Enable caching with 100 entries and 1 hour TTL\nllm = LLM(\"model.gguf\", cache_size=100, cache_ttl=3600, seed=42)\n\nresponse1 = llm(\"What is Python?\")  # Cache miss - generates response\nresponse2 = llm(\"What is Python?\")  # Cache hit - returns cached response instantly\n\n# Check cache statistics\ninfo = llm.cache_info()  # ResponseCacheInfo(hits=1, misses=1, maxsize=100, currsize=1, ttl=3600)\n\n# Clear cache when needed\nllm.cache_clear()\n```\n\nNote: Caching requires a fixed seed (`seed != -1`) since random seeds produce non-deterministic output. Streaming responses are not cached.\n\n### Framework Integrations\n\n**OpenAI-Compatible API** - Drop-in replacement:\n\n```python\nfrom cyllama.integrations import OpenAIClient\n\nclient = OpenAIClient(model_path=\"model.gguf\")\n\nresponse = client.chat.completions.create(\n    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n    temperature=0.7\n)\nprint(response.choices[0].message.content)\n```\n\n**LangChain Integration** - Seamless ecosystem access:\n\n```python\nfrom cyllama.integrations import CyllamaLLM\nfrom langchain.chains import LLMChain\n\nllm = CyllamaLLM(model_path=\"model.gguf\", temperature=0.7)\nchain = LLMChain(llm=llm, prompt=prompt_template)\nresult = chain.run(topic=\"AI\")\n```\n\n### Agent Framework\n\nCyllama includes a zero-dependency agent framework with three agent architectures:\n\n**ReActAgent** - Reasoning + Acting agent with tool calling:\n\n```python\nfrom cyllama import LLM\nfrom cyllama.agents import ReActAgent, tool\nfrom simpleeval import simple_eval\n\n@tool\ndef calculate(expression: str) -\u003e str:\n    \"\"\"Evaluate a math expression safely.\"\"\"\n    return str(simple_eval(expression))\n\nllm = LLM(\"model.gguf\")\nagent = ReActAgent(llm=llm, tools=[calculate])\nresult = agent.run(\"What is 25 * 4?\")\nprint(result.answer)\n```\n\n**ConstrainedAgent** - Grammar-enforced tool calling for 100% reliability:\n\n```python\nfrom cyllama.agents import ConstrainedAgent\n\nagent = ConstrainedAgent(llm=llm, tools=[calculate])\nresult = agent.run(\"Calculate 100 / 4\")  # Guaranteed valid tool calls\n```\n\n**ContractAgent** - Contract-based agent with C++26-inspired pre/post conditions:\n\n```python\nfrom cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy\n\n@tool\n@pre(lambda args: args['x'] != 0, \"cannot divide by zero\")\n@post(lambda r: r is not None, \"result must not be None\")\ndef divide(a: float, x: float) -\u003e float:\n    \"\"\"Divide a by x.\"\"\"\n    return a / x\n\nagent = ContractAgent(\n    llm=llm,\n    tools=[divide],\n    policy=ContractPolicy.ENFORCE,\n    task_precondition=lambda task: len(task) \u003e 10,\n    answer_postcondition=lambda ans: len(ans) \u003e 0,\n)\nresult = agent.run(\"What is 100 divided by 4?\")\n```\n\nSee [Agents Overview](docs/agents_overview.md) for detailed agent documentation.\n\n### Speech Recognition\n\n**Whisper Transcription** - Transcribe audio files with timestamps:\n\n```python\nfrom cyllama.whisper import WhisperContext, WhisperFullParams\nimport numpy as np\n\n# Load model and audio\nctx = WhisperContext(\"models/ggml-base.en.bin\")\nsamples = load_audio_as_16khz_float32(\"audio.wav\")  # Your audio loading function\n\n# Transcribe\nparams = WhisperFullParams()\nctx.full(samples, params)\n\n# Get results\nfor i in range(ctx.full_n_segments()):\n    start = ctx.full_get_segment_t0(i) / 100.0\n    end = ctx.full_get_segment_t1(i) / 100.0\n    text = ctx.full_get_segment_text(i)\n    print(f\"[{start:.2f}s - {end:.2f}s] {text}\")\n```\n\nSee [Whisper docs](docs/whisper.md) for full documentation.\n\n### Stable Diffusion\n\n**Image Generation** - Generate images from text using stable-diffusion.cpp:\n\n```python\nfrom cyllama.sd import text_to_image\n\n# Simple text-to-image\nimages = text_to_image(\n    model_path=\"models/sd_xl_turbo_1.0.q8_0.gguf\",\n    prompt=\"a photo of a cute cat\",\n    width=512,\n    height=512,\n    sample_steps=4,\n    cfg_scale=1.0\n)\nimages[0].save(\"output.png\")\n```\n\n**Advanced Generation** - Full control with SDContext:\n\n```python\nfrom cyllama.sd import SDContext, SDContextParams, SampleMethod, Scheduler\n\nparams = SDContextParams()\nparams.model_path = \"models/sd_xl_turbo_1.0.q8_0.gguf\"\nparams.n_threads = 4\n\nctx = SDContext(params)\nimages = ctx.generate(\n    prompt=\"a beautiful mountain landscape\",\n    negative_prompt=\"blurry, ugly\",\n    width=512,\n    height=512,\n    sample_method=SampleMethod.EULER,\n    scheduler=Scheduler.DISCRETE\n)\n```\n\n**CLI Tool** - Command-line interface:\n\n```bash\n# Text to image\npython -m cyllama.sd txt2img \\\n    --model models/sd_xl_turbo_1.0.q8_0.gguf \\\n    --prompt \"a beautiful sunset\" \\\n    --output sunset.png\n\n# Image to image\npython -m cyllama.sd img2img \\\n    --model models/sd-v1-5.gguf \\\n    --init-img input.png \\\n    --prompt \"oil painting style\" \\\n    --strength 0.7\n\n# Show system info\npython -m cyllama.sd info\n```\n\nSupports SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, z-image-turbo, video generation (Wan/CogVideoX), LoRA, ControlNet, inpainting, and ESRGAN upscaling. See [Stable Diffusion docs](docs/stable_diffusion.md) for full documentation.\n\n### RAG (Retrieval-Augmented Generation)\n\n**Simple RAG** - Query your documents with LLMs:\n\n```python\nfrom cyllama.rag import RAG\n\n# Create RAG instance with embedding and generation models\nrag = RAG(\n    embedding_model=\"models/bge-small-en-v1.5-q8_0.gguf\",\n    generation_model=\"models/llama.gguf\"\n)\n\n# Add documents\nrag.add_texts([\n    \"Python is a high-level programming language.\",\n    \"Machine learning is a subset of artificial intelligence.\",\n    \"Neural networks are inspired by biological neurons.\"\n])\n\n# Query\nresponse = rag.query(\"What is Python?\")\nprint(response.text)\n```\n\n**Load Documents** - Support for multiple file formats:\n\n```python\nfrom cyllama.rag import RAG, load_directory\n\nrag = RAG(\n    embedding_model=\"models/bge-small-en-v1.5-q8_0.gguf\",\n    generation_model=\"models/llama.gguf\"\n)\n\n# Load all documents from a directory\ndocuments = load_directory(\"docs/\", glob=\"**/*.md\")\nrag.add_documents(documents)\n\nresponse = rag.query(\"How do I configure the system?\")\n```\n\n**Hybrid Search** - Combine vector and keyword search:\n\n```python\nfrom cyllama.rag import RAG, HybridStore, Embedder\n\nembedder = Embedder(\"models/bge-small-en-v1.5-q8_0.gguf\")\nstore = HybridStore(\"knowledge.db\", embedder)\n\nstore.add_texts([\"Document content...\"])\n\n# Hybrid search with configurable weights\nresults = store.search(\"query\", k=5, vector_weight=0.7, fts_weight=0.3)\n```\n\n**Embedding Cache** - Speed up repeated queries with LRU caching:\n\n```python\nfrom cyllama.rag import Embedder\n\n# Enable cache with 1000 entries\nembedder = Embedder(\"models/bge-small-en-v1.5-q8_0.gguf\", cache_size=1000)\n\nembedder.embed(\"hello\")  # Cache miss\nembedder.embed(\"hello\")  # Cache hit - instant return\n\ninfo = embedder.cache_info()\nprint(f\"Hits: {info.hits}, Misses: {info.misses}\")\n```\n\n**Agent Integration** - Use RAG as an agent tool:\n\n```python\nfrom cyllama import LLM\nfrom cyllama.agents import ReActAgent\nfrom cyllama.rag import RAG, create_rag_tool\n\nrag = RAG(\n    embedding_model=\"models/bge-small-en-v1.5-q8_0.gguf\",\n    generation_model=\"models/llama.gguf\"\n)\nrag.add_texts([\"Your knowledge base...\"])\n\n# Create a tool from the RAG instance\nsearch_tool = create_rag_tool(rag)\n\nllm = LLM(\"models/llama.gguf\")\nagent = ReActAgent(llm=llm, tools=[search_tool])\nresult = agent.run(\"Find information about X in the knowledge base\")\n```\n\nSupports text chunking, multiple embedding pooling strategies, LRU caching for repeated queries, async operations, reranking, and SQLite-vector for persistent storage.\n\n### Common Utilities\n\n**GGUF File Manipulation** - Inspect and modify model files:\n\n```python\nfrom cyllama.llama.llama_cpp import GGUFContext\n\nctx = GGUFContext.from_file(\"model.gguf\")\nmetadata = ctx.get_all_metadata()\nprint(f\"Model: {metadata['general.name']}\")\n```\n\n**Structured Output** - JSON schema to grammar conversion (pure Python, no C++ dependency):\n\n```python\nfrom cyllama.llama.llama_cpp import json_schema_to_grammar\n\nschema = {\"type\": \"object\", \"properties\": {\"name\": {\"type\": \"string\"}}}\ngrammar = json_schema_to_grammar(schema)\n```\n\n**Huggingface Model Downloads**:\n\n```python\nfrom cyllama.llama.llama_cpp import download_model, list_cached_models, get_hf_file\n\n# Download from HuggingFace (saves to ~/.cache/llama.cpp/)\ndownload_model(\"bartowski/Llama-3.2-1B-Instruct-GGUF:latest\")\n\n# Or with explicit parameters\ndownload_model(hf_repo=\"bartowski/Llama-3.2-1B-Instruct-GGUF:latest\")\n\n# Download specific file to custom path\ndownload_model(\n    hf_repo=\"bartowski/Llama-3.2-1B-Instruct-GGUF\",\n    hf_file=\"Llama-3.2-1B-Instruct-Q8_0.gguf\",\n    model_path=\"./models/my_model.gguf\"\n)\n\n# Get file info without downloading\ninfo = get_hf_file(\"bartowski/Llama-3.2-1B-Instruct-GGUF:latest\")\nprint(info)  # {'repo': '...', 'gguf_file': '...', 'mmproj_file': '...'}\n\n# List cached models\nmodels = list_cached_models()\n```\n\n## What's Inside\n\n### Text Generation (llama.cpp)\n\n- [x] **Full llama.cpp API** - Complete Cython wrapper with strong typing\n- [x] **High-Level API** - Simple, Pythonic interface (`LLM`, `complete`, `chat`)\n- [x] **Streaming Support** - Token-by-token generation with callbacks\n- [x] **Batch Processing** - Efficient parallel inference\n- [x] **Multimodal** - LLAVA and vision-language models\n- [x] **Speculative Decoding** - 2-3x inference speedup with draft models\n\n### Speech Recognition (whisper.cpp)\n\n- [x] **Full whisper.cpp API** - Complete Cython wrapper\n- [x] **High-Level API** - Simple `transcribe()` function\n- [x] **Multiple Formats** - WAV, MP3, FLAC, and more\n- [x] **Language Detection** - Automatic or specified language\n- [x] **Timestamps** - Word and segment-level timing\n\n### Image \u0026 Video Generation (stable-diffusion.cpp)\n\n- [x] **Full stable-diffusion.cpp API** - Complete Cython wrapper\n- [x] **Text-to-Image** - SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2\n- [x] **Image-to-Image** - Transform existing images\n- [x] **Inpainting** - Mask-based editing\n- [x] **ControlNet** - Guided generation with edge/pose/depth\n- [x] **Video Generation** - Wan, CogVideoX models\n- [x] **Upscaling** - ESRGAN 4x upscaling\n\n### Cross-Cutting Features\n\n- [x] **GPU Acceleration** - Metal, CUDA, Vulkan backends\n- [x] **Memory Optimization** - Smart GPU layer allocation\n- [x] **Agent Framework** - ReActAgent, ConstrainedAgent, ContractAgent\n- [x] **Framework Integration** - OpenAI API, LangChain, FastAPI\n\n## Why Cyllama?\n\n**Performance**: Compiled Cython wrappers with minimal overhead\n\n- Strong type checking at compile time\n- Zero-copy data passing where possible\n- Efficient memory management\n- Native integration with llama.cpp optimizations\n\n**Simplicity**: From 50 lines to 1 line for basic generation\n\n- Intuitive, Pythonic API design\n- Automatic resource management\n- Sensible defaults, full control when needed\n\n**Production-Ready**: Battle-tested and comprehensive\n\n- 1150+ passing tests with extensive coverage\n- Comprehensive documentation and examples\n- Proper error handling and logging\n- Framework integration for real applications\n\n**Up-to-Date**: Tracks bleeding-edge llama.cpp\n\n- Regular updates with latest features\n- All high-priority APIs wrapped\n- Performance optimizations included\n\n## Status\n\n**Current Version**: 0.2.1 (Mar 2026)\n**llama.cpp Version**: b8429\n**Build System**: scikit-build-core + CMake\n**Test Coverage**: 1150+ tests passing\n**Platform**: macOS (tested), Linux (tested), Windows (tested)\n\n### Recent Releases\n\n- **v0.2.1** (Mar 2026) - Code quality hardening: GIL release for whisper/encode, async stream fixes, memory-aware embedding cache, CI robustness, 30+ bug fixes, 1150+ tests\n- **v0.2.0** (Mar 2026) - Dynamic-linked GPU wheels (CUDA, ROCm, SYCL, Vulkan) on PyPI, unified ggml, sqlite-vector vendored\n- **v0.1.21** (Mar 2026) - GPU wheel builds: CUDA + ROCm, sqlite-vector bundled\n- **v0.1.20** (Feb 2026) - Update llama.cpp + stable-diffusion.cpp\n- **v0.1.19** (Dev 2025) - Metal fix for stable-diffusion.cpp\n- **v0.1.18** (Dec 2025) - Remaining stable-diffusion.cpp wrapped\n- **v0.1.16** (Dec 2025) - Response class, Async API, Chat templates\n- **v0.1.12** (Nov 2025) - Initial wrapper of stable-diffusion.cpp\n- **v0.1.11** (Nov 2025) - ACP support, build improvements\n- **v0.1.10** (Nov 2025) - Agent Framework, bug fixes\n- **v0.1.9** (Nov 2025) - High-level APIs, integrations, batch processing, comprehensive documentation\n- **v0.1.8** (Nov 2025) - Speculative decoding API\n- **v0.1.7** (Nov 2025) - GGUF, JSON Schema, Downloads, N-gram Cache\n- **v0.1.6** (Nov 2025) - Multimodal test fixes\n- **v0.1.5** (Oct 2025) - Mongoose server, embedded server\n- **v0.1.4** (Oct 2025) - Memory estimation, performance optimizations\n\nSee [CHANGELOG.md](CHANGELOG.md) for complete release history.\n\n## Building from Source\n\nTo build `cyllama` from source:\n\n1. A recent version of `python3` (currently testing on python 3.13)\n\n2. Git clone the latest version of `cyllama`:\n\n    ```sh\n    git clone https://github.com/shakfu/cyllama.git\n    cd cyllama\n    ```\n\n3. We use [uv](https://github.com/astral-sh/uv) for package management:\n\n    If you don't have it see the link above to install it, otherwise:\n\n    ```sh\n    uv sync\n    ```\n\n4. Type `make` in the terminal.\n\n    This will:\n\n    1. Download and build `llama.cpp`, `whisper.cpp` and `stable-diffusion.cpp`\n    2. Install them into the `thirdparty` folder\n    3. Build `cyllama` using scikit-build-core + CMake\n\n### Build Commands\n\n```sh\n# Full build (default: static linking, builds llama.cpp from source)\nmake              # Build dependencies + editable install\n\n# Dynamic linking (downloads pre-built llama.cpp release)\nmake build-dynamic  # No source compilation needed for llama.cpp\n\n# Build wheel for distribution\nmake wheel        # Creates wheel in dist/\nmake dist         # Creates sdist + wheel in dist/\n\n# Backend-specific builds\nmake build-metal  # macOS Metal (default on macOS)\nmake build-cuda   # NVIDIA CUDA\nmake build-vulkan # Vulkan (cross-platform)\nmake build-cpu    # CPU only\n\n# Clean and rebuild\nmake clean        # Remove build artifacts\nmake reset        # Full reset including thirdparty\nmake remake       # Clean rebuild with tests\n\n# Code quality\nmake lint         # Lint with ruff (auto-fix)\nmake format       # Format with ruff\nmake typecheck    # Type check with mypy\nmake qa           # Run all: lint, typecheck, format\n\n# Memory leak detection\nmake leaks        # RSS-growth leak check (10 cycles, 20% threshold)\n\n# Publishing\nmake check        # Validate wheels with twine\nmake publish      # Upload to PyPI\nmake publish-test # Upload to TestPyPI\n```\n\n### GPU Acceleration\n\nBy default, cyllama builds with Metal support on macOS and CPU-only on Linux. To enable other GPU backends (CUDA, Vulkan, etc.):\n\n```sh\n# CUDA (NVIDIA GPUs)\nmake build-cuda\n\n# Vulkan (Cross-platform GPU)\nmake build-vulkan\n\n# Multiple backends\nexport GGML_CUDA=1 GGML_VULKAN=1\nmake build\n```\n\nSee [Build Backends](docs/build_backends.md) for comprehensive backend build instructions.\n\n### Multi-GPU Configuration\n\nFor systems with multiple GPUs, cyllama provides full control over GPU selection and model splitting:\n\n```python\nfrom cyllama import LLM, GenerationConfig\n\n# Use a specific GPU (GPU index 1)\nllm = LLM(\"model.gguf\", main_gpu=1)\n\n# Multi-GPU with layer splitting (default mode)\nllm = LLM(\"model.gguf\", split_mode=1, n_gpu_layers=99)\n\n# Multi-GPU with tensor parallelism (row splitting)\nllm = LLM(\"model.gguf\", split_mode=2, n_gpu_layers=99)\n\n# Custom tensor split: 30% GPU 0, 70% GPU 1\nllm = LLM(\"model.gguf\", tensor_split=[0.3, 0.7])\n\n# Full configuration via GenerationConfig\nconfig = GenerationConfig(\n    main_gpu=0,\n    split_mode=1,          # 0=NONE, 1=LAYER, 2=ROW\n    tensor_split=[1, 2],   # 1/3 GPU0, 2/3 GPU1\n    n_gpu_layers=99\n)\nllm = LLM(\"model.gguf\", config=config)\n```\n\n**Split Modes:**\n\n- `0` (NONE): Single GPU only, uses `main_gpu`\n- `1` (LAYER): Split layers and KV cache across GPUs (default)\n- `2` (ROW): Tensor parallelism - split layers with row-wise distribution\n\n## Testing\n\nThe `tests` directory in this repo provides extensive examples of using cyllama.\n\nHowever, as a first step, you should download a smallish llm in the `.gguf` model from [huggingface](https://huggingface.co/models?search=gguf). A good small model to start and which is assumed by tests is [Llama-3.2-1B-Instruct-Q8_0.gguf](https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf). `cyllama` expects models to be stored in a `models` folder in the cloned `cyllama` directory. So to create the `models` directory if doesn't exist and download this model, you can just type:\n\n```sh\nmake download\n```\n\nThis basically just does:\n\n```sh\ncd cyllama\nmkdir models \u0026\u0026 cd models\nwget https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf\n```\n\nNow you can test it using `llama-cli` or `llama-simple`:\n\n```sh\nbin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \\\n -p \"Is mathematics discovered or invented?\"\n```\n\nWith 1150+ passing tests, the library is ready for both quick prototyping and production use:\n\n```sh\nmake test  # Run full test suite\n```\n\nYou can also explore interactively:\n\n```python\npython3 -i scripts/start.py\n\n\u003e\u003e\u003e from cyllama import complete\n\u003e\u003e\u003e response = complete(\"What is 2+2?\", model_path=\"models/Llama-3.2-1B-Instruct-Q8_0.gguf\")\n\u003e\u003e\u003e print(response)\n```\n\n## Documentation\n\nFull documentation is available at [https://shakfu.github.io/cyllama/](https://shakfu.github.io/cyllama/) (built with MkDocs).\n\nTo serve docs locally: `make docs-serve`\n\n- **[User Guide](docs/user_guide.md)** - Comprehensive guide covering all features\n- **[API Reference](docs/api_reference.md)** - Complete API documentation\n- **[Cookbook](docs/cookbook.md)** - Practical recipes and patterns\n- **[Changelog](CHANGELOG.md)** - Complete release history\n- **Examples** - See `tests/examples/` for working code samples\n\n## Roadmap\n\n### Completed\n\n- [x] Full llama.cpp API wrapper with Cython\n- [x] High-level API (`LLM`, `complete`, `chat`)\n- [x] Async API support (`AsyncLLM`, `complete_async`, `chat_async`)\n- [x] Response class with stats and serialization\n- [x] Built-in chat template system (llama.cpp templates)\n- [x] Batch processing utilities\n- [x] OpenAI-compatible API client\n- [x] LangChain integration\n- [x] Speculative decoding\n- [x] GGUF file manipulation\n- [x] JSON schema to grammar conversion\n- [x] Model download helper\n- [x] N-gram cache\n- [x] OpenAI-compatible servers (PythonServer, EmbeddedServer, LlamaServer)\n- [x] Whisper.cpp integration\n- [x] Multimodal support (LLAVA)\n- [x] Memory estimation utilities\n- [x] Agent Framework (ReActAgent, ConstrainedAgent, ContractAgent)\n- [x] Stable Diffusion (stable-diffusion.cpp) - image/video generation\n- [x] RAG utilities (text chunking, document processing)\n\n### Future\n\n- [ ] Web UI for testing\n\n## Contributing\n\nContributions are welcome! Please see the [User Guide](docs/user_guide.md) for development guidelines.\n\n## License\n\nThis project wraps [llama.cpp](https://github.com/ggml-org/llama.cpp), [whisper.cpp](https://github.com/ggml-org/whisper.cpp), and [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) which all follow the MIT licensing terms, as does cyllama.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshakfu%2Fcyllama","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshakfu%2Fcyllama","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshakfu%2Fcyllama/lists"}