{"id":14964808,"url":"https://github.com/c0sogi/llama-api","last_synced_at":"2025-10-05T10:30:47.467Z","repository":{"id":182581691,"uuid":"668722849","full_name":"c0sogi/llama-api","owner":"c0sogi","description":"An OpenAI-like LLaMA inference API","archived":false,"fork":false,"pushed_at":"2023-09-17T16:41:20.000Z","size":680,"stargazers_count":113,"open_issues_count":16,"forks_count":9,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-09-30T07:02:29.001Z","etag":null,"topics":["api","exllama","fastapi","llama","llamacpp"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/c0sogi.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-20T12:57:35.000Z","updated_at":"2025-08-07T12:29:12.000Z","dependencies_parsed_at":"2024-09-13T22:42:39.413Z","dependency_job_id":"a019d46b-8aa1-4346-86d8-db05c9a5b7dc","html_url":"https://github.com/c0sogi/llama-api","commit_stats":{"total_commits":108,"total_committers":4,"mean_commits":27.0,"dds":"0.14814814814814814","last_synced_commit":"6b254fdaab2ac2337e6b93d910b41a96f8de2a80"},"previous_names":["c0sogi/llama-api"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/c0sogi/llama-api","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/c0sogi%2Fllama-api","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/c0sogi%2Fllama-api/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/c0sogi%2Fllama-api/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/c0sogi%2Fllama-api/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/c0sogi","download_url":"https://codeload.github.com/c0sogi/llama-api/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/c0sogi%2Fllama-api/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278395909,"owners_count":25979690,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-04T02:00:05.491Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api","exllama","fastapi","llama","llamacpp"],"created_at":"2024-09-24T13:33:48.915Z","updated_at":"2025-10-05T10:30:47.153Z","avatar_url":"https://github.com/c0sogi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"**Python 3.8 / 3.9 / 3.10 / 3.11** on **Windows / Linux / MacOS**\n\n[![Python 3.8 / 3.9 / 3.10 / 3.11 on Windows / Linux / MacOS](https://github.com/c0sogi/llama-api/actions/workflows/ci.yml/badge.svg)](https://github.com/c0sogi/llama-api/actions/workflows/ci.yml)\n\n---\n## About this repository\nThis project aims to provide a simple way to run **LLama.cpp** and **Exllama** models as a OpenAI-like API server.\n\nYou can use this server to run the models in your own application, or use it as a standalone API server!\n\n## Before you start\n\n1. **Python 3.8 / 3.9 / 3.10 / 3.11** is required to run the server. You can download it from https://www.python.org/downloads/\n\n2. **llama.cpp**: To use cuBLAS(for nvidia gpus) version of llama.cpp, and if you are **Windows** user, download [CUDA Toolkit 11.8](https://developer.nvidia.com/cuda-11-8-0-download-archive).\n\n3. **ExLlama**: To use ExLlama, install the prerequisites of this [repository](https://github.com/turboderp/exllama). Maybe **Windows** user needs to install both [MSVC 2022](https://visualstudio.microsoft.com/downloads/) and [CUDA Toolkit 11.8](https://developer.nvidia.com/cuda-11-8-0-download-archive).\n\n\n\n## How to run server\n\nAll required packages will be installed automatically with this command.\n\n```bash\npython -m main --install-pkgs\n```\n\nIf you already have all required packages installed, you can skip the installation with this command.\n```bash\npython -m main\n```\nOptions:\n```b\nusage: main.py [-h] [--port PORT] [--max-workers MAX_WORKERS]\n               [--max-semaphores MAX_SEMAPHORES]\n               [--max-tokens-limit MAX_TOKENS_LIMIT] [--api-key API_KEY]\n               [--no-embed] [--tunnel] [--install-pkgs] [--force-cuda]\n               [--skip-torch-install] [--skip-tf-install] [--skip-compile]\n               [--no-cache-dir] [--upgrade]\n\noptions:\n  -h, --help            show this help message and exit\n  --port PORT, -p PORT  Port to run the server on; default is 8000\n  --max-workers MAX_WORKERS, -w MAX_WORKERS\n                        Maximum number of process workers to run; default is 1\n  --max-semaphores MAX_SEMAPHORES, -s MAX_SEMAPHORES\n                        Maximum number of process semaphores to permit;\n                        default is 1\n  --max-tokens-limit MAX_TOKENS_LIMIT, -l MAX_TOKENS_LIMIT\n                        Set the maximum number of tokens to `max_tokens`. This\n                        is needed to limit the number of tokens\n                        generated.Default is None, which means no limit.        \n  --api-key API_KEY, -k API_KEY\n                        API key to use for the server\n  --no-embed            Disable embeddings endpoint\n  --tunnel, -t          Tunnel the server through cloudflared\n  --install-pkgs, -i    Install all required packages before running the        \n                        server\n  --force-cuda, -c      Force CUDA version of pytorch to be used when\n                        installing pytorch. e.g. torch==2.0.1+cu118\n  --skip-torch-install, --no-torch\n                        Skip installing pytorch, if `install-pkgs` is set       \n  --skip-tf-install, --no-tf\n                        Skip installing tensorflow, if `install-pkgs` is set    \n  --skip-compile, --no-compile\n                        Skip compiling the shared library of LLaMA C++ code     \n  --no-cache-dir, --no-cache\n                        Disable caching of pip installs, if `install-pkgs` is   \n                        set\n  --upgrade, -u         Upgrade all packages and repositories before running    \n                        the server\n```\n\n### Unique features\n\n1. **On-Demand Model Loading**\n   - The project tries to load the model defined in `model_definitions.py` into the worker process when it is sent along with the request JSON body. The worker continually uses the cached model and when a request for a different model comes in, it unloads the existing model and loads the new one. \n\n2. **Parallelism and Concurrency Enabled**\n   - Due to the internal operation of the process pool, both parallelism and concurrency are secured. The `--max-workers $NUM_WORKERS` option needs to be provided when starting the server. This, however, only applies when requests are made simultaneously for different models. If requests are made for the same model, they will wait until a slot becomes available due to the semaphore.\n\n3. **Auto Dependency Installation**\n   - The project automatically do git clones and installs the required dependencies, including **pytorch** and **tensorflow**, when the server is started. This is done by checking the `pyproject.toml` or `requirements.txt` file in the root directory of this project or other repositories. `pyproject.toml` will be parsed into `requirements.txt` with `poetry`. If you want to add more dependencies, simply add them to the file.\n\n\n## How can I get the models?\n\n   #### 1. **Automatic download** (_Recommended_)\n   \u003e ![image](contents/auto-download-model.png)\n\n   - Just set **model_path** of your own model defintion in `model_definitions.py` as actual **huggingface repository** and run the server. The server will automatically download the model from HuggingFace.co, when the model is requested for the first time.\n   \n   #### 2. **Manual download**\n   \u003e ![image](contents/example-models.png)\n\n   - You can download the models manually if you want. I prefer to use the [following link](https://huggingface.co/TheBloke) to download the models\n\n\n\n1. For **LLama.cpp** models: Download the **gguf** file from the GGML model page. Choose quantization method you prefer. The gguf file name will be the **model_path**.\n\n   The LLama.cpp model must be put here as a **gguf** file, in `models/ggml/`.\n\n   For example, if you downloaded a q4_k_m quantized model from [this link](https://huggingface.co/TheBloke/MythoMax-L2-Kimiko-v2-13B-GGUF),\n   The path of the model has to be **mythomax-l2-kimiko-v2-13b.Q4_K_M.gguf**.\n\n     *Available quantizations: q4_0, q4_1, q5_0, q5_1, q8_0, q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K*\n\n2. For **Exllama** models: Download three files from the GPTQ model page: **config.json / tokenizer.model / \\*.safetensors** and put them in a folder. The folder name will be the **model_path**.\n\n   The Exllama GPTQ model must be put here as a **folder**, in `models/gptq/`.\n\n   For example, if you downloaded 3 files from [this link](https://huggingface.co/TheBloke/orca_mini_7B-GPTQ/tree/main),\n\n   - orca-mini-7b-GPTQ-4bit-128g.no-act.order.safetensors\n   - tokenizer.model\n   - config.json\n\n   then you need to put them in a folder.\n   The path of the model has to be the folder name. Let's say, **orca_mini_7b**, which contains the 3 files.\n\n\n## Where to define the models\nDefine llama.cpp \u0026 exllama models in `model_definitions.py`. You can define all necessary parameters to load the models there. Refer to the example in the file.\nor, you can define the models in python script file that includes `model` and `def` in the file name. e.g. `my_model_def.py`.\nThe file must include at least one llm model (LlamaCppModel or ExLlamaModel) definition.\nAlso, you can define `openai_replacement_models` dictionary in the file to replace the openai models with your own models. For example, \n\n```python\n# my_model_def.py\nfrom llama_api.schemas.models import LlamaCppModel, ExLlamaModel\n\n# `my_ggml` and `my_ggml2` is the same definition of same model.\nmy_ggml = LlamaCppModel(model_path=\"TheBloke/MythoMax-L2-Kimiko-v2-13B-GGUF\", max_total_tokens=4096)\nmy_ggml2 = LlamaCppModel(model_path=\"models/ggml/mythomax-l2-kimiko-v2-13b.Q4_K_M.gguf\", max_total_tokens=4096)\n\n# `my_gptq` and `my_gptq2` is the same definition of same model.\nmy_gptq = ExLlamaModel(model_path=\"TheBloke/orca_mini_7B-GPTQ\", max_total_tokens=8192)\nmy_gptq2 = ExLlamaModel(model_path=\"models/gptq/orca_mini_7b\", max_total_tokens=8192)\n\n# You can replace the openai models with your own models.\nopenai_replacement_models = {\"gpt-3.5-turbo\": \"my_ggml\", \"gpt-4\": \"my_gptq2\"}\n\n```\nThe RoPE frequency and scaling factor will be automatically calculated and set if you don't set them in the model definition. Assuming that you are using Llama2 model.\n\n\n## Usage: Langchain integration\n\nLangchain allows you to incorporate custom language models seamlessly. This guide will walk you through setting up your own custom model, replacing OpenAI models, and running text or chat completions.\n\n1. Defining Your Custom Model\n\nFirst, you need to define your custom language model in a Python file, for instance, `my_model_def.py`. This file should include the definition of your custom model.\n\n```python\n# my_model_def.py\nfrom llama_api.schemas.models import LlamaCppModel, ExllamaModel\n\nmythomax_l2_13b_gptq = ExllamaModel(\n    model_path=\"TheBloke/MythoMax-L2-13B-GPTQ\",  # automatic download\n    max_total_tokens=4096,\n)\n```\n\nIn the example above, we've defined a custom model named `mythomax_l2_13b_gptq` using the `ExllamaModel` class.\n\n2. Replacing OpenAI Models\n\nYou can replace an OpenAI model with your custom model using the `openai_replacement_models` dictionary. Add your custom model to this dictionary in the `my_model_def.py` file.\n\n```python\n# my_model_def.py (Continued)\nopenai_replacement_models = {\"gpt-3.5-turbo\": \"mythomax_l2_13b_gptq\"}\n```\n\nHere, we replaced the `gpt-3.5-turbo` model with our custom `mythomax_l2_13b_gptq` model.\n\n3. Running Text/Chat Completions\n\nFinally, you can utilize your custom model in Langchain for performing text and chat completions.\n\n```python\n# langchain_test.py\nfrom langchain.chat_models import ChatOpenAI\nfrom os import environ\n\nenviron[\"OPENAI_API_KEY\"] = \"Bearer foo\"\n\nchat_model = ChatOpenAI(\n    model=\"gpt-3.5-turbo\",\n    openai_api_base=\"http://localhost:8000/v1\",\n)\nprint(chat_model.predict(\"hi!\"))\n```\n\nNow, running the `langchain_test.py` file will make use of your custom model for completions. \nNote that 'function call' feature will only work for LlamaCppModel.\nThat's it! You've successfully integrated a custom model into Langchain. Enjoy your enhanced text and chat completions!\n\n## Usage: Text Completion\nNow, you can send a request to the server.\n\n```python\nimport requests\n\nurl = \"http://localhost:8000/v1/completions\"\npayload = {\n    \"model\": \"my_ggml\",\n    \"prompt\": \"Hello, my name is\",\n    \"max_tokens\": 30,\n    \"top_p\": 0.9,\n    \"temperature\": 0.9,\n    \"stop\": [\"\\n\"]\n}\nresponse = requests.post(url, json=payload)\nprint(response.json())\n\n# Output:\n# {'id': 'cmpl-243b22e4-6215-4833-8960-c1b12b49aa60', 'object': 'text_completion', 'created': 1689857470, 'model': 'D:/llama-api/models/ggml/mythomax-l2-kimiko-v2-13b.Q4_K_M.gguf', 'choices': [{'text': \" John and I'm excited to share with you how I built a 6-figure online business from scratch! In this video series, I will\", 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 6, 'completion_tokens': 30, 'total_tokens': 36}}\n```\n\n## Usage: Chat Completion\n\n```python\nimport requests\n\nurl = \"http://localhost:8000/v1/chat/completions\"\npayload = {\n    \"model\": \"gpt-4\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Hello there!\"}],\n    \"max_tokens\": 30,\n    \"top_p\": 0.9,\n    \"temperature\": 0.9,\n    \"stop\": [\"\\n\"]\n}\nresponse = requests.post(url, json=payload)\nprint(response.json())\n\n# Output:\n# {'id': 'chatcmpl-da87a0b1-0f20-4e10-b731-ba483e13b450', 'object': 'chat.completion', 'created': 1689868843, 'model': 'D:/llama-api/models/gptq/orca_mini_7b', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': \" Hi there! Sure, I'd be happy to help you with that. What can I assist you with?\"}, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 11, 'completion_tokens': 23, 'total_tokens': 34}}\n```\n\n\n## Usage: Vector Embedding\n\nYou can also use the server to get embeddings of a text.\nFor sentence encoder(e.g. universal-sentence-encoder/4), **TensorFlow Hub** is used. For the other models, embedding model will automatically be downloaded from **HuggingFace**, and inference will be done using **Transformers** and **Pytorch**.\n```python\nimport requests\n\nurl = \"http://localhost:8000/v1/embeddings\"\npayload = {\n  \"model\": \"intfloat/e5-large-v2\",  # You can also use `universal-sentence-encoder/4`\n  \"input\": \"hello world!\"\n}\nresponse = requests.post(url, json=payload)\nprint(response.json())\n\n# Output:\n# {'object': 'list', 'model': 'intfloat/e5-large-v2', 'data': [{'index': 0, 'object': 'embedding', 'embedding': [0.28619545698165894, -0.8573919534683228, ...,  1.0349756479263306]}], 'usage': {'prompt_tokens': -1, 'total_tokens': -1}}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fc0sogi%2Fllama-api","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fc0sogi%2Fllama-api","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fc0sogi%2Fllama-api/lists"}