https://github.com/shakfu/cyllama

A thin cython wrapper around llama.cpp, whisper.cpp and stable-diffusion.cpp
https://github.com/shakfu/cyllama
cython cython-wrapper llama-cpp python3 stable-diffusion-cpp whisper-cpp
Last synced: 3 months ago
JSON representation
A thin cython wrapper around llama.cpp, whisper.cpp and stable-diffusion.cpp
Host: GitHub
URL: https://github.com/shakfu/cyllama
Owner: shakfu
License: mit
Created: 2024-10-25T15:44:36.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2026-03-28T06:48:20.000Z (3 months ago)
Last Synced: 2026-03-28T09:56:29.183Z (3 months ago)
Topics: cython, cython-wrapper, llama-cpp, python3, stable-diffusion-cpp, whisper-cpp
Language: Python
Homepage: https://shakfu.github.io/cyllama/
Size: 29.4 MB
Stars: 19
Watchers: 3
Forks: 17
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

          # cyllama - Fast, Pythonic AI Inference

cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art `.cpp` ecosystem:

- **[llama.cpp](https://github.com/ggml-org/llama.cpp)** - Text generation, chat, embeddings, and text-to-speech

- **[whisper.cpp](https://github.com/ggerganov/whisper.cpp)** - Speech-to-text transcription and translation

- **[stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp)** - Image and video generation

It combines the performance of compiled Cython wrappers with a simple, high-level Python API for cross-modal AI inference.

**[Documentation](https://shakfu.github.io/cyllama/)** | **[PyPI](https://pypi.org/project/cyllama/)** | **[Changelog](CHANGELOG.md)**

## Features

- High-level API -- `complete()`, `chat()`, `LLM` class for quick prototyping / text generation.

- Streaming -- token-by-token output with callbacks

- Batch processing -- process multiple prompts 3-10x faster

- GPU acceleration -- Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform)

- Speculative decoding -- 2-3x speedup with draft models

- Agent framework -- ReActAgent, ConstrainedAgent, ContractAgent with tool calling

- RAG -- retrieval-augmented generation with local embeddings and SQLite-vector

- Speech recognition -- whisper.cpp transcription and translation

- Image/Video generation -- stable-diffusion.cpp handles image, image-edit and video models.

- OpenAI-compatible servers -- EmbeddedServer (C/Mongoose) and PythonServer

- Framework integrations -- OpenAI API client, LangChain LLM interface

## Installation

### From PyPI

```sh

pip install cyllama

```

This installs the cpu-backend for linux and windows. For MacOS, the Metal backend is installed, by default, to take advantage of Apple Silicon.

### GPU-Accelerated Variants

GPU variants are available on PyPI as separate packages (dynamically linked, Linux x86_64 only):

```sh

pip install cyllama-cuda12   # NVIDIA GPU (CUDA 12.4)

pip install cyllama-rocm     # AMD GPU (ROCm 6.3, requires glibc >= 2.35)

pip install cyllama-sycl     # Intel GPU (oneAPI SYCL 2025.3)

pip install cyllama-vulkan   # Cross-platform GPU (Vulkan)

```

All variants install the same `cyllama` Python package -- only the compiled backend differs. Install one at a time (they replace each other). GPU variants require the corresponding driver/runtime installed on your system.

You can verify which backend is active after installation:

```sh

python -m cyllama info

```

You can also query the backend configuration at runtime:

```python

from cyllama import _backend

print(_backend.cuda)   # True if built with CUDA

print(_backend.metal)  # True if built with Metal

```

### Build from source with a specific backend

```sh

GGML_CUDA=1 pip install cyllama --no-binary cyllama

GGML_VULKAN=1 pip install cyllama --no-binary cyllama

```

## Quick Start

```python

from cyllama import complete

# One line is all you need

response = complete(

    "Explain quantum computing in simple terms",

    model_path="models/llama.gguf",

    temperature=0.7,

    max_tokens=200

)

print(response)

```

## Key Features

### Simple by Default, Powerful When Needed

**High-Level API** - Get started in seconds:

```python

from cyllama import complete, chat, LLM

# One-shot completion

response = complete("What is Python?", model_path="model.gguf")

# Multi-turn chat

messages = [

    {"role": "system", "content": "You are a helpful assistant."},

    {"role": "user", "content": "What is machine learning?"}

]

response = chat(messages, model_path="model.gguf")

# Reusable LLM instance (faster for multiple prompts)

llm = LLM("model.gguf")

response1 = llm("Question 1")

response2 = llm("Question 2")  # Model stays loaded!

```

**Streaming Support** - Real-time token-by-token output:

```python

for chunk in complete("Tell me a story", model_path="model.gguf", stream=True):

    print(chunk, end="", flush=True)

```

### Performance Optimized

**Batch Processing** - Process multiple prompts 3-10x faster:

```python

from cyllama import batch_generate

prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]

responses = batch_generate(prompts, model_path="model.gguf")

```

**Speculative Decoding** - 2-3x speedup with draft models:

```python

from cyllama.llama.llama_cpp import Speculative, SpeculativeParams

params = SpeculativeParams(n_max=16, p_min=0.75)

spec = Speculative(params, ctx_target)

draft_tokens = spec.draft(prompt_tokens, last_token)

```

**Memory Optimization** - Smart GPU layer allocation:

```python

from cyllama import estimate_gpu_layers

estimate = estimate_gpu_layers(

    model_path="model.gguf",

    available_vram_mb=8000

)

print(f"Recommended GPU layers: {estimate.n_gpu_layers}")

```

**N-gram Cache** - 2-10x speedup for repetitive text:

```python

from cyllama.llama.llama_cpp import NgramCache

cache = NgramCache()

cache.update(tokens, ngram_min=2, ngram_max=4)

draft = cache.draft(input_tokens, n_draft=16)

```

**Response Caching** - Cache LLM responses for repeated prompts:

```python

from cyllama import LLM

# Enable caching with 100 entries and 1 hour TTL

llm = LLM("model.gguf", cache_size=100, cache_ttl=3600, seed=42)

response1 = llm("What is Python?")  # Cache miss - generates response

response2 = llm("What is Python?")  # Cache hit - returns cached response instantly

# Check cache statistics

info = llm.cache_info()  # ResponseCacheInfo(hits=1, misses=1, maxsize=100, currsize=1, ttl=3600)

# Clear cache when needed

llm.cache_clear()

```

Note: Caching requires a fixed seed (`seed != -1`) since random seeds produce non-deterministic output. Streaming responses are not cached.

### Framework Integrations

**OpenAI-Compatible API** - Drop-in replacement:

```python

from cyllama.integrations import OpenAIClient

client = OpenAIClient(model_path="model.gguf")

response = client.chat.completions.create(

    messages=[{"role": "user", "content": "Hello!"}],

    temperature=0.7

)

print(response.choices[0].message.content)

```

**LangChain Integration** - Seamless ecosystem access:

```python

from cyllama.integrations import CyllamaLLM

from langchain.chains import LLMChain

llm = CyllamaLLM(model_path="model.gguf", temperature=0.7)

chain = LLMChain(llm=llm, prompt=prompt_template)

result = chain.run(topic="AI")

```

### Agent Framework

Cyllama includes a zero-dependency agent framework with three agent architectures:

**ReActAgent** - Reasoning + Acting agent with tool calling:

```python

from cyllama import LLM

from cyllama.agents import ReActAgent, tool

from simpleeval import simple_eval

@tool

def calculate(expression: str) -> str:

    """Evaluate a math expression safely."""

    return str(simple_eval(expression))

llm = LLM("model.gguf")

agent = ReActAgent(llm=llm, tools=[calculate])

result = agent.run("What is 25 * 4?")

print(result.answer)

```

**ConstrainedAgent** - Grammar-enforced tool calling for 100% reliability:

```python

from cyllama.agents import ConstrainedAgent

agent = ConstrainedAgent(llm=llm, tools=[calculate])

result = agent.run("Calculate 100 / 4")  # Guaranteed valid tool calls

```

**ContractAgent** - Contract-based agent with C++26-inspired pre/post conditions:

```python

from cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy

@tool

@pre(lambda args: args['x'] != 0, "cannot divide by zero")

@post(lambda r: r is not None, "result must not be None")

def divide(a: float, x: float) -> float:

    """Divide a by x."""

    return a / x

agent = ContractAgent(

    llm=llm,

    tools=[divide],

    policy=ContractPolicy.ENFORCE,

    task_precondition=lambda task: len(task) > 10,

    answer_postcondition=lambda ans: len(ans) > 0,

)

result = agent.run("What is 100 divided by 4?")

```

See [Agents Overview](docs/agents_overview.md) for detailed agent documentation.

### Speech Recognition

**Whisper Transcription** - Transcribe audio files with timestamps:

```python

from cyllama.whisper import WhisperContext, WhisperFullParams

import numpy as np

# Load model and audio

ctx = WhisperContext("models/ggml-base.en.bin")

samples = load_audio_as_16khz_float32("audio.wav")  # Your audio loading function

# Transcribe

params = WhisperFullParams()

ctx.full(samples, params)

# Get results

for i in range(ctx.full_n_segments()):

    start = ctx.full_get_segment_t0(i) / 100.0

    end = ctx.full_get_segment_t1(i) / 100.0

    text = ctx.full_get_segment_text(i)

    print(f"[{start:.2f}s - {end:.2f}s] {text}")

```

See [Whisper docs](docs/whisper.md) for full documentation.

### Stable Diffusion

**Image Generation** - Generate images from text using stable-diffusion.cpp:

```python

from cyllama.sd import text_to_image

# Simple text-to-image

images = text_to_image(

    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",

    prompt="a photo of a cute cat",

    width=512,

    height=512,

    sample_steps=4,

    cfg_scale=1.0

)

images[0].save("output.png")

```

**Advanced Generation** - Full control with SDContext:

```python

from cyllama.sd import SDContext, SDContextParams, SampleMethod, Scheduler

params = SDContextParams()

params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"

params.n_threads = 4

ctx = SDContext(params)

images = ctx.generate(

    prompt="a beautiful mountain landscape",

    negative_prompt="blurry, ugly",

    width=512,

    height=512,

    sample_method=SampleMethod.EULER,

    scheduler=Scheduler.DISCRETE

)

```

**CLI Tool** - Command-line interface:

```bash

# Text to image

python -m cyllama.sd txt2img \

    --model models/sd_xl_turbo_1.0.q8_0.gguf \

    --prompt "a beautiful sunset" \

    --output sunset.png

# Image to image

python -m cyllama.sd img2img \

    --model models/sd-v1-5.gguf \

    --init-img input.png \

    --prompt "oil painting style" \

    --strength 0.7

# Show system info

python -m cyllama.sd info

```

Supports SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, z-image-turbo, video generation (Wan/CogVideoX), LoRA, ControlNet, inpainting, and ESRGAN upscaling. See [Stable Diffusion docs](docs/stable_diffusion.md) for full documentation.

### RAG (Retrieval-Augmented Generation)

**Simple RAG** - Query your documents with LLMs:

```python

from cyllama.rag import RAG

# Create RAG instance with embedding and generation models

rag = RAG(

    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",

    generation_model="models/llama.gguf"

)

# Add documents

rag.add_texts([

    "Python is a high-level programming language.",

    "Machine learning is a subset of artificial intelligence.",

    "Neural networks are inspired by biological neurons."

])

# Query

response = rag.query("What is Python?")

print(response.text)

```

**Load Documents** - Support for multiple file formats:

```python

from cyllama.rag import RAG, load_directory

rag = RAG(

    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",

    generation_model="models/llama.gguf"

)

# Load all documents from a directory

documents = load_directory("docs/", glob="**/*.md")

rag.add_documents(documents)

response = rag.query("How do I configure the system?")

```

**Hybrid Search** - Combine vector and keyword search:

```python

from cyllama.rag import RAG, HybridStore, Embedder

embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf")

store = HybridStore("knowledge.db", embedder)

store.add_texts(["Document content..."])

# Hybrid search with configurable weights

results = store.search("query", k=5, vector_weight=0.7, fts_weight=0.3)

```

**Embedding Cache** - Speed up repeated queries with LRU caching:

```python

from cyllama.rag import Embedder

# Enable cache with 1000 entries

embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf", cache_size=1000)

embedder.embed("hello")  # Cache miss

embedder.embed("hello")  # Cache hit - instant return

info = embedder.cache_info()

print(f"Hits: {info.hits}, Misses: {info.misses}")

```

**Agent Integration** - Use RAG as an agent tool:

```python

from cyllama import LLM

from cyllama.agents import ReActAgent

from cyllama.rag import RAG, create_rag_tool

rag = RAG(

    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",

    generation_model="models/llama.gguf"

)

rag.add_texts(["Your knowledge base..."])

# Create a tool from the RAG instance

search_tool = create_rag_tool(rag)

llm = LLM("models/llama.gguf")

agent = ReActAgent(llm=llm, tools=[search_tool])

result = agent.run("Find information about X in the knowledge base")

```

Supports text chunking, multiple embedding pooling strategies, LRU caching for repeated queries, async operations, reranking, and SQLite-vector for persistent storage.

### Common Utilities

**GGUF File Manipulation** - Inspect and modify model files:

```python

from cyllama.llama.llama_cpp import GGUFContext

ctx = GGUFContext.from_file("model.gguf")

metadata = ctx.get_all_metadata()

print(f"Model: {metadata['general.name']}")

```

**Structured Output** - JSON schema to grammar conversion (pure Python, no C++ dependency):

```python

from cyllama.llama.llama_cpp import json_schema_to_grammar

schema = {"type": "object", "properties": {"name": {"type": "string"}}}

grammar = json_schema_to_grammar(schema)

```

**Huggingface Model Downloads**:

```python

from cyllama.llama.llama_cpp import download_model, list_cached_models, get_hf_file

# Download from HuggingFace (saves to ~/.cache/llama.cpp/)

download_model("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Or with explicit parameters

download_model(hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Download specific file to custom path

download_model(

    hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF",

    hf_file="Llama-3.2-1B-Instruct-Q8_0.gguf",

    model_path="./models/my_model.gguf"

)

# Get file info without downloading

info = get_hf_file("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

print(info)  # {'repo': '...', 'gguf_file': '...', 'mmproj_file': '...'}

# List cached models

models = list_cached_models()

```

## What's Inside

### Text Generation (llama.cpp)

- [x] **Full llama.cpp API** - Complete Cython wrapper with strong typing

- [x] **High-Level API** - Simple, Pythonic interface (`LLM`, `complete`, `chat`)

- [x] **Streaming Support** - Token-by-token generation with callbacks

- [x] **Batch Processing** - Efficient parallel inference

- [x] **Multimodal** - LLAVA and vision-language models

- [x] **Speculative Decoding** - 2-3x inference speedup with draft models

### Speech Recognition (whisper.cpp)

- [x] **Full whisper.cpp API** - Complete Cython wrapper

- [x] **High-Level API** - Simple `transcribe()` function

- [x] **Multiple Formats** - WAV, MP3, FLAC, and more

- [x] **Language Detection** - Automatic or specified language

- [x] **Timestamps** - Word and segment-level timing

### Image & Video Generation (stable-diffusion.cpp)

- [x] **Full stable-diffusion.cpp API** - Complete Cython wrapper

- [x] **Text-to-Image** - SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2

- [x] **Image-to-Image** - Transform existing images

- [x] **Inpainting** - Mask-based editing

- [x] **ControlNet** - Guided generation with edge/pose/depth

- [x] **Video Generation** - Wan, CogVideoX models

- [x] **Upscaling** - ESRGAN 4x upscaling

### Cross-Cutting Features

- [x] **GPU Acceleration** - Metal, CUDA, Vulkan backends

- [x] **Memory Optimization** - Smart GPU layer allocation

- [x] **Agent Framework** - ReActAgent, ConstrainedAgent, ContractAgent

- [x] **Framework Integration** - OpenAI API, LangChain, FastAPI

## Why Cyllama?

**Performance**: Compiled Cython wrappers with minimal overhead

- Strong type checking at compile time

- Zero-copy data passing where possible

- Efficient memory management

- Native integration with llama.cpp optimizations

**Simplicity**: From 50 lines to 1 line for basic generation

- Intuitive, Pythonic API design

- Automatic resource management

- Sensible defaults, full control when needed

**Production-Ready**: Battle-tested and comprehensive

- 1150+ passing tests with extensive coverage

- Comprehensive documentation and examples

- Proper error handling and logging

- Framework integration for real applications

**Up-to-Date**: Tracks bleeding-edge llama.cpp

- Regular updates with latest features

- All high-priority APIs wrapped

- Performance optimizations included

## Status

**Current Version**: 0.2.1 (Mar 2026)

**llama.cpp Version**: b8429

**Build System**: scikit-build-core + CMake

**Test Coverage**: 1150+ tests passing

**Platform**: macOS (tested), Linux (tested), Windows (tested)

### Recent Releases

- **v0.2.1** (Mar 2026) - Code quality hardening: GIL release for whisper/encode, async stream fixes, memory-aware embedding cache, CI robustness, 30+ bug fixes, 1150+ tests

- **v0.2.0** (Mar 2026) - Dynamic-linked GPU wheels (CUDA, ROCm, SYCL, Vulkan) on PyPI, unified ggml, sqlite-vector vendored

- **v0.1.21** (Mar 2026) - GPU wheel builds: CUDA + ROCm, sqlite-vector bundled

- **v0.1.20** (Feb 2026) - Update llama.cpp + stable-diffusion.cpp

- **v0.1.19** (Dev 2025) - Metal fix for stable-diffusion.cpp

- **v0.1.18** (Dec 2025) - Remaining stable-diffusion.cpp wrapped

- **v0.1.16** (Dec 2025) - Response class, Async API, Chat templates

- **v0.1.12** (Nov 2025) - Initial wrapper of stable-diffusion.cpp

- **v0.1.11** (Nov 2025) - ACP support, build improvements

- **v0.1.10** (Nov 2025) - Agent Framework, bug fixes

- **v0.1.9** (Nov 2025) - High-level APIs, integrations, batch processing, comprehensive documentation

- **v0.1.8** (Nov 2025) - Speculative decoding API

- **v0.1.7** (Nov 2025) - GGUF, JSON Schema, Downloads, N-gram Cache

- **v0.1.6** (Nov 2025) - Multimodal test fixes

- **v0.1.5** (Oct 2025) - Mongoose server, embedded server

- **v0.1.4** (Oct 2025) - Memory estimation, performance optimizations

See [CHANGELOG.md](CHANGELOG.md) for complete release history.

## Building from Source

To build `cyllama` from source:

1. A recent version of `python3` (currently testing on python 3.13)

2. Git clone the latest version of `cyllama`:

    ```sh

    git clone https://github.com/shakfu/cyllama.git

    cd cyllama

    ```

3. We use [uv](https://github.com/astral-sh/uv) for package management:

    If you don't have it see the link above to install it, otherwise:

    ```sh

    uv sync

    ```

4. Type `make` in the terminal.

    This will:

    1. Download and build `llama.cpp`, `whisper.cpp` and `stable-diffusion.cpp`

    2. Install them into the `thirdparty` folder

    3. Build `cyllama` using scikit-build-core + CMake

### Build Commands

```sh

# Full build (default: static linking, builds llama.cpp from source)

make              # Build dependencies + editable install

# Dynamic linking (downloads pre-built llama.cpp release)

make build-dynamic  # No source compilation needed for llama.cpp

# Build wheel for distribution

make wheel        # Creates wheel in dist/

make dist         # Creates sdist + wheel in dist/

# Backend-specific builds

make build-metal  # macOS Metal (default on macOS)

make build-cuda   # NVIDIA CUDA

make build-vulkan # Vulkan (cross-platform)

make build-cpu    # CPU only

# Clean and rebuild

make clean        # Remove build artifacts

make reset        # Full reset including thirdparty

make remake       # Clean rebuild with tests

# Code quality

make lint         # Lint with ruff (auto-fix)

make format       # Format with ruff

make typecheck    # Type check with mypy

make qa           # Run all: lint, typecheck, format

# Memory leak detection

make leaks        # RSS-growth leak check (10 cycles, 20% threshold)

# Publishing

make check        # Validate wheels with twine

make publish      # Upload to PyPI

make publish-test # Upload to TestPyPI

```

### GPU Acceleration

By default, cyllama builds with Metal support on macOS and CPU-only on Linux. To enable other GPU backends (CUDA, Vulkan, etc.):

```sh

# CUDA (NVIDIA GPUs)

make build-cuda

# Vulkan (Cross-platform GPU)

make build-vulkan

# Multiple backends

export GGML_CUDA=1 GGML_VULKAN=1

make build

```

See [Build Backends](docs/build_backends.md) for comprehensive backend build instructions.

### Multi-GPU Configuration

For systems with multiple GPUs, cyllama provides full control over GPU selection and model splitting:

```python

from cyllama import LLM, GenerationConfig

# Use a specific GPU (GPU index 1)

llm = LLM("model.gguf", main_gpu=1)

# Multi-GPU with layer splitting (default mode)

llm = LLM("model.gguf", split_mode=1, n_gpu_layers=99)

# Multi-GPU with tensor parallelism (row splitting)

llm = LLM("model.gguf", split_mode=2, n_gpu_layers=99)

# Custom tensor split: 30% GPU 0, 70% GPU 1

llm = LLM("model.gguf", tensor_split=[0.3, 0.7])

# Full configuration via GenerationConfig

config = GenerationConfig(

    main_gpu=0,

    split_mode=1,          # 0=NONE, 1=LAYER, 2=ROW

    tensor_split=[1, 2],   # 1/3 GPU0, 2/3 GPU1

    n_gpu_layers=99

)

llm = LLM("model.gguf", config=config)

```

**Split Modes:**

- `0` (NONE): Single GPU only, uses `main_gpu`

- `1` (LAYER): Split layers and KV cache across GPUs (default)

- `2` (ROW): Tensor parallelism - split layers with row-wise distribution

## Testing

The `tests` directory in this repo provides extensive examples of using cyllama.

However, as a first step, you should download a smallish llm in the `.gguf` model from [huggingface](https://huggingface.co/models?search=gguf). A good small model to start and which is assumed by tests is [Llama-3.2-1B-Instruct-Q8_0.gguf](https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf). `cyllama` expects models to be stored in a `models` folder in the cloned `cyllama` directory. So to create the `models` directory if doesn't exist and download this model, you can just type:

```sh

make download

```

This basically just does:

```sh

cd cyllama

mkdir models && cd models

wget https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf

```

Now you can test it using `llama-cli` or `llama-simple`:

```sh

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \

 -p "Is mathematics discovered or invented?"

```

With 1150+ passing tests, the library is ready for both quick prototyping and production use:

```sh

make test  # Run full test suite

```

You can also explore interactively:

```python

python3 -i scripts/start.py

>>> from cyllama import complete

>>> response = complete("What is 2+2?", model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf")

>>> print(response)

```

## Documentation

Full documentation is available at [https://shakfu.github.io/cyllama/](https://shakfu.github.io/cyllama/) (built with MkDocs).

To serve docs locally: `make docs-serve`

- **[User Guide](docs/user_guide.md)** - Comprehensive guide covering all features

- **[API Reference](docs/api_reference.md)** - Complete API documentation

- **[Cookbook](docs/cookbook.md)** - Practical recipes and patterns

- **[Changelog](CHANGELOG.md)** - Complete release history

- **Examples** - See `tests/examples/` for working code samples

## Roadmap

### Completed

- [x] Full llama.cpp API wrapper with Cython

- [x] High-level API (`LLM`, `complete`, `chat`)

- [x] Async API support (`AsyncLLM`, `complete_async`, `chat_async`)

- [x] Response class with stats and serialization

- [x] Built-in chat template system (llama.cpp templates)

- [x] Batch processing utilities

- [x] OpenAI-compatible API client

- [x] LangChain integration

- [x] Speculative decoding

- [x] GGUF file manipulation

- [x] JSON schema to grammar conversion

- [x] Model download helper

- [x] N-gram cache

- [x] OpenAI-compatible servers (PythonServer, EmbeddedServer, LlamaServer)

- [x] Whisper.cpp integration

- [x] Multimodal support (LLAVA)

- [x] Memory estimation utilities

- [x] Agent Framework (ReActAgent, ConstrainedAgent, ContractAgent)

- [x] Stable Diffusion (stable-diffusion.cpp) - image/video generation

- [x] RAG utilities (text chunking, document processing)

### Future

- [ ] Web UI for testing

## Contributing

Contributions are welcome! Please see the [User Guide](docs/user_guide.md) for development guidelines.

## License

This project wraps [llama.cpp](https://github.com/ggml-org/llama.cpp), [whisper.cpp](https://github.com/ggml-org/whisper.cpp), and [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) which all follow the MIT licensing terms, as does cyllama.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shakfu/cyllama

Awesome Lists containing this project

README