{"id":24192707,"url":"https://github.com/tangledgroup/llama-cpp-cffi","last_synced_at":"2025-09-21T16:32:19.253Z","repository":{"id":247208858,"uuid":"824473596","full_name":"tangledgroup/llama-cpp-cffi","owner":"tangledgroup","description":"llama.cpp cffi python binding","archived":false,"fork":false,"pushed_at":"2025-01-09T09:31:22.000Z","size":3633,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-09T10:37:56.817Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tangledgroup.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-05T07:58:48.000Z","updated_at":"2025-01-09T09:31:26.000Z","dependencies_parsed_at":"2024-12-03T17:21:30.332Z","dependency_job_id":"1b1891b5-043e-476e-bf83-ca796f0a08ce","html_url":"https://github.com/tangledgroup/llama-cpp-cffi","commit_stats":null,"previous_names":["mtasic85/llama-cpp-cffi","tangledgroup/llama-cpp-cffi"],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tangledgroup%2Fllama-cpp-cffi","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tangledgroup%2Fllama-cpp-cffi/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tangledgroup%2Fllama-cpp-cffi/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tangledgroup%2Fllama-cpp-cffi/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tangledgroup","download_url":"https://codeload.github.com/tangledgroup/llama-cpp-cffi/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":233770325,"owners_count":18727553,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-13T16:19:57.498Z","updated_at":"2025-09-21T16:32:19.234Z","avatar_url":"https://github.com/tangledgroup.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# llama-cpp-cffi\n\n\u003c!--\n[![Build][build-image]]()\n[![Status][status-image]][pypi-project-url]\n[![Stable Version][stable-ver-image]][pypi-project-url]\n[![Coverage][coverage-image]]()\n[![Python][python-ver-image]][pypi-project-url]\n[![License][mit-image]][mit-url]\n--\u003e\n[![PyPI](https://img.shields.io/pypi/v/llama-cpp-cffi)](https://pypi.org/project/llama-cpp-cffi/)\n[![Supported Versions](https://img.shields.io/pypi/pyversions/llama-cpp-cffi)](https://pypi.org/project/llama-cpp-cffi)\n[![PyPI Downloads](https://img.shields.io/pypi/dm/llama-cpp-cffi)](https://pypistats.org/packages/llama-cpp-cffi)\n[![Github Downloads](https://img.shields.io/github/downloads/tangledgroup/llama-cpp-cffi/total.svg?label=Github%20Downloads)]()\n[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n\n**Python** 3.10+ binding for [llama.cpp](https://github.com/ggerganov/llama.cpp) using **cffi**. Supports **CPU**, **Vulkan 1.x** (AMD, Intel and Nvidia GPUs) and **CUDA 12.8** (Nvidia GPUs) runtimes, **x86_64** (and soon **aarch64**) platforms.\n\nNOTE: Currently supported operating system is **Linux** (`manylinux_2_28` and `musllinux_1_2`), but we are working on both **Windows** and **macOS** versions.\n\n## News\n\n- **Mar 04 2025, v0.4.38**: Conditional Structured Output using `CompletionsOptions.grammar_ignore_until`\n- **Feb 28 2025, v0.4.36**: CUDA 12.8.0 for x86_64; CUDA ARCHITECTURES: `50; 61, 70, 75, 80, 86, 89, 90, 100, 101, 120`\n- **Feb 17 2025, v0.4.21**: CUDA 12.8.0 for x86_64; CUDA ARCHITECTURES: `61, 70, 75, 80, 86, 89, 90, 100, 101, 120`\n- **Jan 15 2025, v0.4.15**: Dynamically load/unload models while executing prompts in parallel.\n- **Jan 14 2025, v0.4.14**: Modular llama.cpp build using `cmake` build system. Deprecated `make` build system.\n- **Jan 01 2025, v0.3.1**: OpenAI compatible API, **text** and **vision** models. Added support for **Qwen2-VL** models. Hot-swap of models on demand in server/API.\n- **Dec 09 2024, v0.2.0**: Low-level and high-level APIs: llama, llava, clip and ggml API.\n- **Nov 27 2024, v0.1.22**: Support for Multimodal models such as **llava** and **minicpmv**.\n\n## Install\n\nBasic library install:\n\n```bash\npip install llama-cpp-cffi\n```\n\nIn case you want [OpenAI © Chat Completions API](https://platform.openai.com/docs/overview) compatible API:\n\n```bash\npip install llama-cpp-cffi[openai]\n```\n\n**IMPORTANT:** If you want to take advantage of **Nvidia** GPU acceleration, make sure that you have installed **CUDA 12**. If you don't have `CUDA 12.X.Y` installed follow instructions here: https://developer.nvidia.com/cuda-downloads .\n\nGPU Compute Capability: `50;61;70;75;80;86;89;90;100;101;120` covering from most of GPUs from **GeForce GTX 1050** to **Nvidia H100** and **Nvidia Blackwell**. [GPU Compute Capability](https://developer.nvidia.com/cuda-gpus).\n\n## LLM Example\n\n```python\nfrom llama import Model\n\n#\n# first define and load/init model\n#\nmodel = Model(\n    creator_hf_repo='HuggingFaceTB/SmolLM2-1.7B-Instruct',\n    hf_repo='bartowski/SmolLM2-1.7B-Instruct-GGUF',\n    hf_file='SmolLM2-1.7B-Instruct-Q4_K_M.gguf',\n)\n\nmodel.init(n_ctx=8 * 1024, gpu_layers=99)\n\n#\n# messages\n#\nmessages = [\n    {'role': 'system', 'content': 'You are a helpful assistant.'},\n    {'role': 'user', 'content': '1 + 1 = ?'},\n    {'role': 'assistant', 'content': '2'},\n    {'role': 'user', 'content': 'Evaluate 1 + 2 in Python.'},\n]\n\ncompletions = model.completions(\n    messages=messages,\n    predict=1 * 1024,\n    temp=0.7,\n    top_p=0.8,\n    top_k=100,\n)\n\nfor chunk in completions:\n    print(chunk, flush=True, end='')\n\n#\n# prompt\n#\nprompt='Evaluate 1 + 2 in Python. Result in Python is'\n\ncompletions = model.completions(\n    prompt=prompt,\n    predict=1 * 1024,\n    temp=0.7,\n    top_p=0.8,\n    top_k=100,\n)\n\nfor chunk in completions:\n    print(chunk, flush=True, end='')\n```\n\n### References\n- `examples/llm.py`\n- `examples/demo_text.py`\n\n## VLM Example\n\n```python\nfrom llama import Model\n\n#\n# first define and load/init model\n#\nmodel = Model( # 1.87B\n    creator_hf_repo='vikhyatk/moondream2',\n    hf_repo='vikhyatk/moondream2',\n    hf_file='moondream2-text-model-f16.gguf',\n    mmproj_hf_file='moondream2-mmproj-f16.gguf',\n)\n\nmodel.init(n_ctx=8 * 1024, gpu_layers=99)\n\n#\n# prompt\n#\nprompt = 'Describe this image.'\nimage = 'examples/llama-1.png'\n\ncompletions = model.completions(\n    prompt=prompt,\n    image=image,\n    predict=1 * 1024,\n)\n\nfor chunk in completions:\n    print(chunk, flush=True, end='')\n```\n\n### References\n- `examples/vlm.py`\n- `examples/demo_llava.py`\n- `examples/demo_minicpmv.py`\n- `examples/demo_qwen2vl.py`\n\n## API\n\n### Server - llama-cpp-cffi + OpenAI API\n\nRun server first:\n\n```bash\npython -B -u -m llama.server\n# or\npython -B -u -m gunicorn --bind '0.0.0.0:11434' --timeout 900 --workers 1 --worker-class aiohttp.GunicornWebWorker 'llama.server:build_app()'\n```\n\n### Client - llama-cpp-cffi API / curl\n\n```bash\n#\n# llm\n#\ncurl -XPOST 'http://localhost:11434/api/1.0/completions' \\\n-H \"Content-Type: application/json\" \\\n-d '{\n    \"gpu_layers\": 99,\n    \"prompt\": \"Evaluate 1 + 2 in Python.\"\n}'\n\ncurl -XPOST 'http://localhost:11434/api/1.0/completions' \\\n-H \"Content-Type: application/json\" \\\n-d '{\n    \"creator_hf_repo\": \"HuggingFaceTB/SmolLM2-1.7B-Instruct\",\n    \"hf_repo\": \"bartowski/SmolLM2-1.7B-Instruct-GGUF\",\n    \"hf_file\": \"SmolLM2-1.7B-Instruct-Q4_K_M.gguf\",\n    \"gpu_layers\": 99,\n    \"prompt\": \"Evaluate 1 + 2 in Python.\"\n}'\n\ncurl -XPOST 'http://localhost:11434/api/1.0/completions' \\\n-H \"Content-Type: application/json\" \\\n-d '{\n    \"creator_hf_repo\": \"Qwen/Qwen2.5-0.5B-Instruct\",\n    \"hf_repo\": \"Qwen/Qwen2.5-0.5B-Instruct-GGUF\",\n    \"hf_file\": \"qwen2.5-0.5b-instruct-q4_k_m.gguf\",\n    \"gpu_layers\": 99,\n    \"prompt\": \"Evaluate 1 + 2 in Python.\"\n}'\n\ncurl -XPOST 'http://localhost:11434/api/1.0/completions' \\\n-H \"Content-Type: application/json\" \\\n-d '{\n    \"creator_hf_repo\": \"Qwen/Qwen2.5-7B-Instruct\",\n    \"hf_repo\": \"bartowski/Qwen2.5-7B-Instruct-GGUF\",\n    \"hf_file\": \"Qwen2.5-7B-Instruct-Q4_K_M.gguf\",\n    \"gpu_layers\": 99,\n    \"prompt\": \"Evaluate 1 + 2 in Python.\"\n}'\n\n#\n# vlm - example 1\n#\nimage_path=\"examples/llama-1.jpg\"\nmime_type=$(file -b --mime-type \"$image_path\")\nbase64_data=$(base64 -w 0 \"$image_path\")\n\ncat \u003c\u003c EOF \u003e /tmp/temp.json\n{\n    \"creator_hf_repo\": \"Qwen/Qwen2-VL-2B-Instruct\",\n    \"hf_repo\": \"bartowski/Qwen2-VL-2B-Instruct-GGUF\",\n    \"hf_file\": \"Qwen2-VL-2B-Instruct-Q4_K_M.gguf\",\n    \"mmproj_hf_file\": \"mmproj-Qwen2-VL-2B-Instruct-f16.gguf\",\n    \"gpu_layers\": 99,\n    \"prompt\": \"Describe this image.\",\n    \"image\": \"data:$mime_type;base64,$base64_data\"\n}\nEOF\n\ncurl -XPOST 'http://localhost:11434/api/1.0/completions' \\\n-H \"Content-Type: application/json\" \\\n--data-binary \"@/tmp/temp.json\"\n\n#\n# vlm - example 2\n#\nimage_path=\"examples/llama-1.jpg\"\nmime_type=$(file -b --mime-type \"$image_path\")\nbase64_data=$(base64 -w 0 \"$image_path\")\n\ncat \u003c\u003c EOF \u003e /tmp/temp.json\n{\n    \"creator_hf_repo\": \"Qwen/Qwen2-VL-2B-Instruct\",\n    \"hf_repo\": \"bartowski/Qwen2-VL-2B-Instruct-GGUF\",\n    \"hf_file\": \"Qwen2-VL-2B-Instruct-Q4_K_M.gguf\",\n    \"mmproj_hf_file\": \"mmproj-Qwen2-VL-2B-Instruct-f16.gguf\",\n    \"gpu_layers\": 99,\n    \"messages\": [\n        {\"role\": \"user\", \"content\": [\n            {\"type\": \"text\", \"text\": \"Describe this image.\"},\n            {\n                \"type\": \"image_url\",\n                \"image_url\": {\"url\": \"data:$mime_type;base64,$base64_data\"}\n            }\n        ]}\n    ]\n}\nEOF\n\ncurl -XPOST 'http://localhost:11434/api/1.0/completions' \\\n-H \"Content-Type: application/json\" \\\n--data-binary \"@/tmp/temp.json\"\n```\n\n### Client - OpenAI © compatible Chat Completions API\n\n```bash\n#\n# text\n#\ncurl -XPOST 'http://localhost:11434/v1/chat/completions' \\\n-H \"Content-Type: application/json\" \\\n-d '{\n    \"model\": \"HuggingFaceTB/SmolLM2-1.7B-Instruct:bartowski/SmolLM2-1.7B-Instruct-GGUF:SmolLM2-1.7B-Instruct-Q4_K_M.gguf\",\n    \"messages\": [\n        {\n            \"role\": \"user\",\n            \"content\": \"Evaluate 1 + 2 in Python.\"\n        }\n    ],\n    \"n_ctx\": 8192,\n    \"gpu_layers\": 99\n}'\n\n#\n# image\n#\nimage_path=\"examples/llama-1.jpg\"\nmime_type=$(file -b --mime-type \"$image_path\")\nbase64_data=$(base64 -w 0 \"$image_path\")\n\ncat \u003c\u003c EOF \u003e /tmp/temp.json\n{\n    \"model\": \"Qwen/Qwen2-VL-2B-Instruct:bartowski/Qwen2-VL-2B-Instruct-GGUF:Qwen2-VL-2B-Instruct-Q4_K_M.gguf:mmproj-Qwen2-VL-2B-Instruct-f16.gguf\",\n    \"messages\": [\n        {\"role\": \"user\", \"content\": [\n            {\"type\": \"text\", \"text\": \"Describe this image.\"},\n            {\n                \"type\": \"image_url\",\n                \"image_url\": {\"url\": \"data:$mime_type;base64,$base64_data\"}\n            }\n        ]}\n    ],\n    \"n_ctx\": 8192,\n    \"gpu_layers\": 99\n}\nEOF\n\ncurl -XPOST 'http://localhost:11434/v1/chat/completions' \\\n-H \"Content-Type: application/json\" \\\n--data-binary \"@/tmp/temp.json\"\n\n#\n# Client Python API for OpenAI\n#\npython -B examples/demo_openai.py\n```\n\n### References\n- `examples/demo_openai.py`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftangledgroup%2Fllama-cpp-cffi","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftangledgroup%2Fllama-cpp-cffi","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftangledgroup%2Fllama-cpp-cffi/lists"}