{"id":20115399,"url":"https://github.com/amperecomputingai/llama-cpp-python","last_synced_at":"2025-03-02T19:19:08.292Z","repository":{"id":223971437,"uuid":"762043533","full_name":"AmpereComputingAI/llama-cpp-python","owner":"AmpereComputingAI","description":null,"archived":false,"fork":false,"pushed_at":"2024-04-16T14:44:15.000Z","size":1547,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-01-13T06:26:10.576Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AmpereComputingAI.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-23T01:07:39.000Z","updated_at":"2024-09-24T12:47:46.000Z","dependencies_parsed_at":"2024-11-13T18:48:03.187Z","dependency_job_id":null,"html_url":"https://github.com/AmpereComputingAI/llama-cpp-python","commit_stats":null,"previous_names":["amperecomputingai/llama-cpp-python"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmpereComputingAI%2Fllama-cpp-python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmpereComputingAI%2Fllama-cpp-python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmpereComputingAI%2Fllama-cpp-python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmpereComputingAI%2Fllama-cpp-python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AmpereComputingAI","download_url":"https://codeload.github.com/AmpereComputingAI/llama-cpp-python/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241557347,"owners_count":19981910,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-13T18:35:05.465Z","updated_at":"2025-03-02T19:19:08.272Z","avatar_url":"https://github.com/AmpereComputingAI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🦙 Python Bindings for [`llama.cpp`](https://github.com/ggerganov/llama.cpp)\n\n[![Documentation Status](https://readthedocs.org/projects/llama-cpp-python/badge/?version=latest)](https://llama-cpp-python.readthedocs.io/en/latest/?badge=latest)\n[![Tests](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml/badge.svg?branch=main)](https://github.com/abetlen/llama-cpp-python/actions/workflows/test.yaml)\n[![PyPI](https://img.shields.io/pypi/v/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)\n[![PyPI - License](https://img.shields.io/pypi/l/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/llama-cpp-python)](https://pypi.org/project/llama-cpp-python/)\n\nSimple Python bindings for **@ggerganov's** [`llama.cpp`](https://github.com/ggerganov/llama.cpp) library.\nThis package provides:\n\n- Low-level access to C API via `ctypes` interface.\n- High-level Python API for text completion\n    - OpenAI-like API\n    - [LangChain compatibility](https://python.langchain.com/docs/integrations/llms/llamacpp)\n    - [LlamaIndex compatibility](https://docs.llamaindex.ai/en/stable/examples/llm/llama_2_llama_cpp.html)\n- OpenAI compatible web server\n    - [Local Copilot replacement](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion)\n    - [Function Calling support](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling)\n    - [Vision API support](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models)\n    - [Multiple Models](https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support)\n\nDocumentation is available at [https://llama-cpp-python.readthedocs.io/en/latest](https://llama-cpp-python.readthedocs.io/en/latest).\n\n## Installation\n\nRequirements:\n\n  - Python 3.8+\n  - C compiler\n      - Linux: gcc or clang\n      - Windows: Visual Studio or MinGW\n      - MacOS: Xcode\n\nTo install the package, run:\n\n```bash\npip install llama-cpp-python\n```\n\nThis will also build `llama.cpp` from source and install it alongside this python package.\n\nIf this fails, add `--verbose` to the `pip install` see the full cmake build log.\n\n### Installation Configuration\n\n`llama.cpp` supports a number of hardware acceleration backends to speed up inference as well as backend specific options. See the [llama.cpp README](https://github.com/ggerganov/llama.cpp#build) for a full list.\n\nAll `llama.cpp` cmake build options can be set via the `CMAKE_ARGS` environment variable or via the `--config-settings / -C` cli flag during installation.\n\n\u003cdetails open\u003e\n\u003csummary\u003eEnvironment Variables\u003c/summary\u003e\n\n```bash\n# Linux and Mac\nCMAKE_ARGS=\"-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS\" \\\n  pip install llama-cpp-python\n```\n\n```powershell\n# Windows\n$env:CMAKE_ARGS = \"-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS\"\npip install llama-cpp-python\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eCLI / requirements.txt\u003c/summary\u003e\n\nThey can also be set via `pip install -C / --config-settings` command and saved to a `requirements.txt` file:\n\n```bash\npip install --upgrade pip # ensure pip is up to date\npip install llama-cpp-python \\\n  -C cmake.args=\"-DLLAMA_BLAS=ON;-DLLAMA_BLAS_VENDOR=OpenBLAS\"\n```\n\n```txt\n# requirements.txt\n\nllama-cpp-python -C cmake.args=\"-DLLAMA_BLAS=ON;-DLLAMA_BLAS_VENDOR=OpenBLAS\"\n```\n\n\u003c/details\u003e\n\nhttps://github.com/abetlen/llama-cpp-python/releases/download/${VERSION}/\n\n### Supported Backends\n\nBelow are some common backends, their build commands and any additional environment variables required.\n\n\u003cdetails open\u003e\n\u003csummary\u003eOpenBLAS (CPU)\u003c/summary\u003e\n\nTo install with OpenBLAS, set the `LLAMA_BLAS` and `LLAMA_BLAS_VENDOR` environment variables before installing:\n\n```bash\nCMAKE_ARGS=\"-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS\" pip install llama-cpp-python\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003ecuBLAS (CUDA)\u003c/summary\u003e\n\nTo install with cuBLAS, set the `LLAMA_CUBLAS=on` environment variable before installing:\n\n```bash\nCMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" pip install llama-cpp-python\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eMetal\u003c/summary\u003e\n\nTo install with Metal (MPS), set the `LLAMA_METAL=on` environment variable before installing:\n\n```bash\nCMAKE_ARGS=\"-DLLAMA_METAL=on\" pip install llama-cpp-python\n```\n\n\u003c/details\u003e\n\u003cdetails\u003e\n\n\u003csummary\u003eCLBlast (OpenCL)\u003c/summary\u003e\n\nTo install with CLBlast, set the `LLAMA_CLBLAST=on` environment variable before installing:\n\n```bash\nCMAKE_ARGS=\"-DLLAMA_CLBLAST=on\" pip install llama-cpp-python\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003ehipBLAS (ROCm)\u003c/summary\u003e\n\nTo install with hipBLAS / ROCm support for AMD cards, set the `LLAMA_HIPBLAS=on` environment variable before installing:\n\n```bash\nCMAKE_ARGS=\"-DLLAMA_HIPBLAS=on\" pip install llama-cpp-python\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eVulkan\u003c/summary\u003e\n\nTo install with Vulkan support, set the `LLAMA_VULKAN=on` environment variable before installing:\n\n```bash\nCMAKE_ARGS=\"-DLLAMA_VULKAN=on\" pip install llama-cpp-python\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eKompute\u003c/summary\u003e\n\nTo install with Kompute support, set the `LLAMA_KOMPUTE=on` environment variable before installing:\n\n```bash\nCMAKE_ARGS=\"-DLLAMA_KOMPUTE=on\" pip install llama-cpp-python\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eSYCL\u003c/summary\u003e\n\nTo install with SYCL support, set the `LLAMA_SYCL=on` environment variable before installing:\n\n```bash\nsource /opt/intel/oneapi/setvars.sh   \nCMAKE_ARGS=\"-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx\" pip install llama-cpp-python\n```\n\u003c/details\u003e\n\n\n### Windows Notes\n\n\u003cdetails\u003e\n\u003csummary\u003eError: Can't find 'nmake' or 'CMAKE_C_COMPILER'\u003c/summary\u003e\n\nIf you run into issues where it complains it can't find `'nmake'` `'?'` or CMAKE_C_COMPILER, you can extract w64devkit as [mentioned in llama.cpp repo](https://github.com/ggerganov/llama.cpp#openblas) and add those manually to CMAKE_ARGS before running `pip` install:\n\n```ps\n$env:CMAKE_GENERATOR = \"MinGW Makefiles\"\n$env:CMAKE_ARGS = \"-DLLAMA_OPENBLAS=on -DCMAKE_C_COMPILER=C:/w64devkit/bin/gcc.exe -DCMAKE_CXX_COMPILER=C:/w64devkit/bin/g++.exe\"\n```\n\nSee the above instructions and set `CMAKE_ARGS` to the BLAS backend you want to use.\n\u003c/details\u003e\n\n### MacOS Notes\n\nDetailed MacOS Metal GPU install documentation is available at [docs/install/macos.md](https://llama-cpp-python.readthedocs.io/en/latest/install/macos/)\n\n\u003cdetails\u003e\n\u003csummary\u003eM1 Mac Performance Issue\u003c/summary\u003e\n\nNote: If you are using Apple Silicon (M1) Mac, make sure you have installed a version of Python that supports arm64 architecture. For example:\n\n```bash\nwget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh\nbash Miniforge3-MacOSX-arm64.sh\n```\n\nOtherwise, while installing it will build the llama.cpp x86 version which will be 10x slower on Apple Silicon (M1) Mac.\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eM Series Mac Error: `(mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))`\u003c/summary\u003e\n\nTry installing with\n\n```bash\nCMAKE_ARGS=\"-DCMAKE_OSX_ARCHITECTURES=arm64 -DCMAKE_APPLE_SILICON_PROCESSOR=arm64 -DLLAMA_METAL=on\" pip install --upgrade --verbose --force-reinstall --no-cache-dir llama-cpp-python\n```\n\u003c/details\u003e\n\n### Upgrading and Reinstalling\n\nTo upgrade and rebuild `llama-cpp-python` add `--upgrade --force-reinstall --no-cache-dir` flags to the `pip install` command to ensure the package is rebuilt from source.\n\n## High-level API\n\n[API Reference](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#high-level-api)\n\nThe high-level API provides a simple managed interface through the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.\n\nBelow is a short example demonstrating how to use the high-level API to for basic text completion:\n\n```python\n\u003e\u003e\u003e from llama_cpp import Llama\n\u003e\u003e\u003e llm = Llama(\n      model_path=\"./models/7B/llama-model.gguf\",\n      # n_gpu_layers=-1, # Uncomment to use GPU acceleration\n      # seed=1337, # Uncomment to set a specific seed\n      # n_ctx=2048, # Uncomment to increase the context window\n)\n\u003e\u003e\u003e output = llm(\n      \"Q: Name the planets in the solar system? A: \", # Prompt\n      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window\n      stop=[\"Q:\", \"\\n\"], # Stop generating just before the model would generate a new question\n      echo=True # Echo the prompt back in the output\n) # Generate a completion, can also call create_completion\n\u003e\u003e\u003e print(output)\n{\n  \"id\": \"cmpl-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx\",\n  \"object\": \"text_completion\",\n  \"created\": 1679561337,\n  \"model\": \"./models/7B/llama-model.gguf\",\n  \"choices\": [\n    {\n      \"text\": \"Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.\",\n      \"index\": 0,\n      \"logprobs\": None,\n      \"finish_reason\": \"stop\"\n    }\n  ],\n  \"usage\": {\n    \"prompt_tokens\": 14,\n    \"completion_tokens\": 28,\n    \"total_tokens\": 42\n  }\n}\n```\n\nText completion is available through the [`__call__`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__call__) and [`create_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) methods of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.\n\n### Pulling models from Hugging Face Hub\n\nYou can download `Llama` models in `gguf` format directly from Hugging Face using the [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) method.\nYou'll need to install the `huggingface-hub` package to use this feature (`pip install huggingface-hub`).\n\n```python\nllm = Llama.from_pretrained(\n    repo_id=\"Qwen/Qwen1.5-0.5B-Chat-GGUF\",\n    filename=\"*q8_0.gguf\",\n    verbose=False\n)\n```\n\nBy default [`from_pretrained`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.from_pretrained) will download the model to the huggingface cache directory, you can then manage installed model files with the [`huggingface-cli`](https://huggingface.co/docs/huggingface_hub/en/guides/cli) tool.\n\n### Chat Completion\n\nThe high-level API also provides a simple interface for chat completion.\n\nNote that `chat_format` option must be set for the particular model you are using.\n\n```python\n\u003e\u003e\u003e from llama_cpp import Llama\n\u003e\u003e\u003e llm = Llama(\n      model_path=\"path/to/llama-2/llama-model.gguf\",\n      chat_format=\"llama-2\"\n)\n\u003e\u003e\u003e llm.create_chat_completion(\n      messages = [\n          {\"role\": \"system\", \"content\": \"You are an assistant who perfectly describes images.\"},\n          {\n              \"role\": \"user\",\n              \"content\": \"Describe this image in detail please.\"\n          }\n      ]\n)\n```\n\nChat completion is available through the [`create_chat_completion`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) method of the [`Llama`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama) class.\n\nFor OpenAI API v1 compatibility, you use the [`create_chat_completion_openai_v1`](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion_openai_v1) method which will return pydantic models instead of dicts.\n\n\n### JSON and JSON Schema Mode\n\nTo constrain chat responses to only valid JSON or a specific JSON Schema use the `response_format` argument in [`create_chat_completion`](http://localhost:8000/api-reference/#llama_cpp.Llama.create_chat_completion).\n\n#### JSON Mode\n\nThe following example will constrain the response to valid JSON strings only.\n\n```python\n\u003e\u003e\u003e from llama_cpp import Llama\n\u003e\u003e\u003e llm = Llama(model_path=\"path/to/model.gguf\", chat_format=\"chatml\")\n\u003e\u003e\u003e llm.create_chat_completion(\n    messages=[\n        {\n            \"role\": \"system\",\n            \"content\": \"You are a helpful assistant that outputs in JSON.\",\n        },\n        {\"role\": \"user\", \"content\": \"Who won the world series in 2020\"},\n    ],\n    response_format={\n        \"type\": \"json_object\",\n    },\n    temperature=0.7,\n)\n```\n\n#### JSON Schema Mode\n\nTo constrain the response further to a specific JSON Schema add the schema to the `schema` property of the `response_format` argument.\n\n```python\n\u003e\u003e\u003e from llama_cpp import Llama\n\u003e\u003e\u003e llm = Llama(model_path=\"path/to/model.gguf\", chat_format=\"chatml\")\n\u003e\u003e\u003e llm.create_chat_completion(\n    messages=[\n        {\n            \"role\": \"system\",\n            \"content\": \"You are a helpful assistant that outputs in JSON.\",\n        },\n        {\"role\": \"user\", \"content\": \"Who won the world series in 2020\"},\n    ],\n    response_format={\n        \"type\": \"json_object\",\n        \"schema\": {\n            \"type\": \"object\",\n            \"properties\": {\"team_name\": {\"type\": \"string\"}},\n            \"required\": [\"team_name\"],\n        },\n    },\n    temperature=0.7,\n)\n```\n\n### Function Calling\n\nThe high-level API also provides a simple interface for function calling. This is possible through the `functionary` pre-trained models chat format or through the generic `chatml-function-calling` chat format.\n\nThe gguf-converted files for functionary can be found here: [functionary-7b-v1](https://huggingface.co/abetlen/functionary-7b-v1-GGUF)\n\n```python\n\u003e\u003e\u003e from llama_cpp import Llama\n\u003e\u003e\u003e llm = Llama(model_path=\"path/to/functionary/llama-model.gguf\", chat_format=\"functionary\")\n\u003e\u003e\u003e # or \n\u003e\u003e\u003e llm = Llama(model_path=\"path/to/chatml/llama-model.gguf\", chat_format=\"chatml-function-calling\")\n\u003e\u003e\u003e llm.create_chat_completion(\n      messages = [\n        {\n          \"role\": \"system\",\n          \"content\": \"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant calls functions with appropriate input when necessary\"\n\n        },\n        {\n          \"role\": \"user\",\n          \"content\": \"Extract Jason is 25 years old\"\n        }\n      ],\n      tools=[{\n        \"type\": \"function\",\n        \"function\": {\n          \"name\": \"UserDetail\",\n          \"parameters\": {\n            \"type\": \"object\",\n            \"title\": \"UserDetail\",\n            \"properties\": {\n              \"name\": {\n                \"title\": \"Name\",\n                \"type\": \"string\"\n              },\n              \"age\": {\n                \"title\": \"Age\",\n                \"type\": \"integer\"\n              }\n            },\n            \"required\": [ \"name\", \"age\" ]\n          }\n        }\n      }],\n      tool_choice=[{\n        \"type\": \"function\",\n        \"function\": {\n          \"name\": \"UserDetail\"\n        }\n      }]\n)\n```\n\n### Multi-modal Models\n\n`llama-cpp-python` supports the llava1.5 family of multi-modal models which allow the language model to\nread information from both text and images.\n\nYou'll first need to download one of the available multi-modal models in GGUF format:\n\n- [llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)\n- [llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)\n- [bakllava-1-7b](https://huggingface.co/mys/ggml_bakllava-1)\n\nThen you'll need to use a custom chat handler to load the clip model and process the chat messages and images.\n\n```python\n\u003e\u003e\u003e from llama_cpp import Llama\n\u003e\u003e\u003e from llama_cpp.llama_chat_format import Llava15ChatHandler\n\u003e\u003e\u003e chat_handler = Llava15ChatHandler(clip_model_path=\"path/to/llava/mmproj.bin\")\n\u003e\u003e\u003e llm = Llama(\n  model_path=\"./path/to/llava/llama-model.gguf\",\n  chat_handler=chat_handler,\n  n_ctx=2048, # n_ctx should be increased to accomodate the image embedding\n  logits_all=True,# needed to make llava work\n)\n\u003e\u003e\u003e llm.create_chat_completion(\n    messages = [\n        {\"role\": \"system\", \"content\": \"You are an assistant who perfectly describes images.\"},\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\"type\": \"image_url\", \"image_url\": {\"url\": \"https://.../image.png\"}},\n                {\"type\" : \"text\", \"text\": \"Describe this image in detail please.\"}\n            ]\n        }\n    ]\n)\n```\n\n### Speculative Decoding\n\n`llama-cpp-python` supports speculative decoding which allows the model to generate completions based on a draft model.\n\nThe fastest way to use speculative decoding is through the `LlamaPromptLookupDecoding` class.\n\nJust pass this as a draft model to the `Llama` class during initialization.\n\n```python\nfrom llama_cpp import Llama\nfrom llama_cpp.llama_speculative import LlamaPromptLookupDecoding\n\nllama = Llama(\n    model_path=\"path/to/model.gguf\",\n    draft_model=LlamaPromptLookupDecoding(num_pred_tokens=10) # num_pred_tokens is the number of tokens to predict 10 is the default and generally good for gpu, 2 performs better for cpu-only machines.\n)\n```\n\n### Embeddings\n\nTo generate text embeddings use [`create_embedding`](http://localhost:8000/api-reference/#llama_cpp.Llama.create_embedding).\n\n```python\nimport llama_cpp\n\nllm = llama_cpp.Llama(model_path=\"path/to/model.gguf\", embeddings=True)\n\nembeddings = llm.create_embedding(\"Hello, world!\")\n\n# or create multiple embeddings at once\n\nembeddings = llm.create_embedding([\"Hello, world!\", \"Goodbye, world!\"])\n```\n\n### Adjusting the Context Window\n\nThe context window of the Llama models determines the maximum number of tokens that can be processed at once. By default, this is set to 512 tokens, but can be adjusted based on your requirements.\n\nFor instance, if you want to work with larger contexts, you can expand the context window by setting the n_ctx parameter when initializing the Llama object:\n\n```python\nllm = Llama(model_path=\"./models/7B/llama-model.gguf\", n_ctx=2048)\n```\n\n## OpenAI Compatible Web Server\n\n`llama-cpp-python` offers a web server which aims to act as a drop-in replacement for the OpenAI API.\nThis allows you to use llama.cpp compatible models with any OpenAI compatible client (language libraries, services, etc).\n\nTo install the server package and get started:\n\n```bash\npip install llama-cpp-python[server]\npython3 -m llama_cpp.server --model models/7B/llama-model.gguf\n```\n\nSimilar to Hardware Acceleration section above, you can also install with GPU (cuBLAS) support like this:\n\n```bash\nCMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip install llama-cpp-python[server]\npython3 -m llama_cpp.server --model models/7B/llama-model.gguf --n_gpu_layers 35\n```\n\nNavigate to [http://localhost:8000/docs](http://localhost:8000/docs) to see the OpenAPI documentation.\n\nTo bind to `0.0.0.0` to enable remote connections, use `python3 -m llama_cpp.server --host 0.0.0.0`.\nSimilarly, to change the port (default is 8000), use `--port`.\n\nYou probably also want to set the prompt format. For chatml, use\n\n```bash\npython3 -m llama_cpp.server --model models/7B/llama-model.gguf --chat_format chatml\n```\n\nThat will format the prompt according to how model expects it. You can find the prompt format in the model card.\nFor possible options, see [llama_cpp/llama_chat_format.py](llama_cpp/llama_chat_format.py) and look for lines starting with \"@register_chat_format\".\n\n### Web Server Features\n\n- [Local Copilot replacement](https://llama-cpp-python.readthedocs.io/en/latest/server/#code-completion)\n- [Function Calling support](https://llama-cpp-python.readthedocs.io/en/latest/server/#function-calling)\n- [Vision API support](https://llama-cpp-python.readthedocs.io/en/latest/server/#multimodal-models)\n- [Multiple Models](https://llama-cpp-python.readthedocs.io/en/latest/server/#configuration-and-multi-model-support)\n\n## Docker image\n\nA Docker image is available on [GHCR](https://ghcr.io/abetlen/llama-cpp-python). To run the server:\n\n```bash\ndocker run --rm -it -p 8000:8000 -v /path/to/models:/models -e MODEL=/models/llama-model.gguf ghcr.io/abetlen/llama-cpp-python:latest\n```\n\n[Docker on termux (requires root)](https://gist.github.com/FreddieOliveira/efe850df7ff3951cb62d74bd770dce27) is currently the only known way to run this on phones, see [termux support issue](https://github.com/abetlen/llama-cpp-python/issues/389)\n\n## Low-level API\n\n[API Reference](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#low-level-api)\n\nThe low-level API is a direct [`ctypes`](https://docs.python.org/3/library/ctypes.html) binding to the C API provided by `llama.cpp`.\nThe entire low-level API can be found in [llama_cpp/llama_cpp.py](https://github.com/abetlen/llama-cpp-python/blob/master/llama_cpp/llama_cpp.py) and directly mirrors the C API in [llama.h](https://github.com/ggerganov/llama.cpp/blob/master/llama.h).\n\nBelow is a short example demonstrating how to use the low-level API to tokenize a prompt:\n\n```python\n\u003e\u003e\u003e import llama_cpp\n\u003e\u003e\u003e import ctypes\n\u003e\u003e\u003e llama_cpp.llama_backend_init(False) # Must be called once at the start of each program\n\u003e\u003e\u003e params = llama_cpp.llama_context_default_params()\n# use bytes for char * params\n\u003e\u003e\u003e model = llama_cpp.llama_load_model_from_file(b\"./models/7b/llama-model.gguf\", params)\n\u003e\u003e\u003e ctx = llama_cpp.llama_new_context_with_model(model, params)\n\u003e\u003e\u003e max_tokens = params.n_ctx\n# use ctypes arrays for array params\n\u003e\u003e\u003e tokens = (llama_cpp.llama_token * int(max_tokens))()\n\u003e\u003e\u003e n_tokens = llama_cpp.llama_tokenize(ctx, b\"Q: Name the planets in the solar system? A: \", tokens, max_tokens, llama_cpp.c_bool(True))\n\u003e\u003e\u003e llama_cpp.llama_free(ctx)\n```\n\nCheck out the [examples folder](examples/low_level_api) for more examples of using the low-level API.\n\n## Documentation\n\nDocumentation is available via [https://llama-cpp-python.readthedocs.io/](https://llama-cpp-python.readthedocs.io/).\nIf you find any issues with the documentation, please open an issue or submit a PR.\n\n## Development\n\nThis package is under active development and I welcome any contributions.\n\nTo get started, clone the repository and install the package in editable / development mode:\n\n```bash\ngit clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git\ncd llama-cpp-python\n\n# Upgrade pip (required for editable mode)\npip install --upgrade pip\n\n# Install with pip\npip install -e .\n\n# if you want to use the fastapi / openapi server\npip install -e .[server]\n\n# to install all optional dependencies\npip install -e .[all]\n\n# to clear the local build cache\nmake clean\n```\n\nYou can also test out specific commits of `lama.cpp` by checking out the desired commit in the `vendor/llama.cpp` submodule and then running `make clean` and `pip install -e .` again. Any changes in the `llama.h` API will require\nchanges to the `llama_cpp/llama_cpp.py` file to match the new API (additional changes may be required elsewhere).\n\n## FAQ\n\n### Are there pre-built binaries / binary wheels available?\n\nThe recommended installation method is to install from source as described above.\nThe reason for this is that `llama.cpp` is built with compiler optimizations that are specific to your system.\nUsing pre-built binaries would require disabling these optimizations or supporting a large number of pre-built binaries for each platform.\n\nThat being said there are some pre-built binaries available through the Releases as well as some community provided wheels.\n\nIn the future, I would like to provide pre-built binaries and wheels for common platforms and I'm happy to accept any useful contributions in this area.\nThis is currently being tracked in [#741](https://github.com/abetlen/llama-cpp-python/issues/741)\n\n### How does this compare to other Python bindings of `llama.cpp`?\n\nI originally wrote this package for my own use with two goals in mind:\n\n- Provide a simple process to install `llama.cpp` and access the full C API in `llama.h` from Python\n- Provide a high-level Python API that can be used as a drop-in replacement for the OpenAI API so existing apps can be easily ported to use `llama.cpp`\n\nAny contributions and changes to this package will be made with these goals in mind.\n\n## License\n\nThis project is licensed under the terms of the MIT license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famperecomputingai%2Fllama-cpp-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famperecomputingai%2Fllama-cpp-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famperecomputingai%2Fllama-cpp-python/lists"}