{"id":14964804,"url":"https://github.com/1b5d/llm-api","last_synced_at":"2026-03-11T03:03:08.005Z","repository":{"id":152111307,"uuid":"622709062","full_name":"1b5d/llm-api","owner":"1b5d","description":"Run any Large Language Model behind a unified API","archived":false,"fork":false,"pushed_at":"2023-11-13T14:37:33.000Z","size":55,"stargazers_count":171,"open_issues_count":5,"forks_count":28,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-02-18T09:54:22.572Z","etag":null,"topics":["chatgpt","gptq","huggingface","langchain","llama","llamacpp","llm","llm-inference","machine-learning","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/1b5d.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-02T22:15:29.000Z","updated_at":"2026-02-05T01:24:47.000Z","dependencies_parsed_at":null,"dependency_job_id":"998fca41-0065-4f89-8af6-99e5e2f648e9","html_url":"https://github.com/1b5d/llm-api","commit_stats":{"total_commits":31,"total_committers":1,"mean_commits":31.0,"dds":0.0,"last_synced_commit":"c247d50ef644e48905213dc1391bf50c4b4f61b1"},"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/1b5d/llm-api","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1b5d%2Fllm-api","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1b5d%2Fllm-api/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1b5d%2Fllm-api/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1b5d%2Fllm-api/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/1b5d","download_url":"https://codeload.github.com/1b5d/llm-api/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1b5d%2Fllm-api/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30368595,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-10T21:41:54.280Z","status":"online","status_checked_at":"2026-03-11T02:00:07.027Z","response_time":84,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chatgpt","gptq","huggingface","langchain","llama","llamacpp","llm","llm-inference","machine-learning","python"],"created_at":"2024-09-24T13:33:48.488Z","updated_at":"2026-03-11T03:03:07.985Z","avatar_url":"https://github.com/1b5d.png","language":"Python","funding_links":[],"categories":["Building"],"sub_categories":["LLM Models"],"readme":"# LLM API\n\nWelcome to the LLM-API project! This endeavor opens the door to the exciting world of Large Language Models (LLMs) by offering a versatile API that allows you to effortlessly run a variety of LLMs on different consumer hardware configurations. Whether you prefer to operate these powerful models within Docker containers or directly on your local machine, this solution is designed to adapt to your preferences.\n\nWith LLM-API, all you need to get started is a simple YAML configuration file. the app streamlines the process by automatically downloading the model of your choice and executing it seamlessly. Once initiated, the model becomes accessible through a unified and intuitive API.\n\nThere is also a client that's reminiscent of OpenAI's approach, making it easy to harness the capabilities of your chosen LLM. You can find the Python at [llm-api-python](https://github.com/1b5d/llm-api-python)\n\nIn addition to this, a LangChain integration exists, further expanding the possibilities and potential applications of LLM-API. You can explore this integration at [langchain-llm-api](https://github.com/1b5d/langchain-llm-api)\n\nWhether you're a developer, researcher, or enthusiast, the LLM-API project simplifies the use of Large Language Models, making their power and potential accessible to all.\n\nLLM enthusiasts, developers, researchers, and creators are invited to join this growing community. Your contributions, ideas, and feedback are invaluable in shaping the future of LLM-API. Whether you want to collaborate on improving the core functionality, develop new integrations, or suggest enhancements, your expertise is highly appreciated\n\n### Tested with\n\n- [x] Different Llama based-models in different versions such as (Llama, Alpaca, Vicuna, Llama 2 ) on CPU using llama.cpp\n- [x] Llama \u0026 Llama 2 quantized models using GPTQ-for-LLaMa\n- [x] Generic huggingface pipeline e.g. gpt-2, MPT\n- [x] Mistral 7b\n- [x] Several quantized models using AWQ\n- [x] OpenAI-like interface using [llm-api-python](https://github.com/1b5d/llm-api-python)\n- [ ] Support RWKV-LM\n\n# Usage\n\nTo run LLM-API on a local machine, you must have a functioning Docker engine. The following steps outline the process for running LLM-API:\n\n1. **Create a Configuration File**: Begin by creating a `config.yaml` file with the configurations as described below (use the examples in `config.yaml.example`).\n\n```\nmodels_dir: /models     # dir inside the container\nmodel_family: llama     # also `gptq_llama` or `huggingface`\nsetup_params:\n  key: value\nmodel_params:\n  key: value\n```\n\n`setup_params` and `model_params` are model specific, see below for model specific configs.\n\nYou can override any of the above mentioned configs using environment vars prefixed with `LLM_API_` for example: `LLM_API_MODELS_DIR=/models`\n\n2. **Run LLM-API Using Docker**: Execute the following command in your terminal:\n\n```\ndocker run -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 --ulimit memlock=16000000000 1b5d/llm-api\n```\n\nThis command launches a Docker container and mounts your local directory for models, the configuration file, and maps port 8000 for API access.\n\n**Alternatively**, you can use the provided `docker-compose.yaml` file within this repository and run the application using Docker Compose. To do so, execute the following command:\n\n```\ndocker compose up\n```\n\nUpon the first run, LLM-API will download the model from Hugging Face, based on the configurations defined in the `setup_params` of your `config.yaml` file. It will then name the local model file accordingly. Subsequent runs will reference the same local model file and load it into memory for seamless operation\n\n\n## Endpoints\n\nThe LLM-API provides a standardized set of endpoints that are applicable across all Large Language Models (LLMs). These endpoints enable you to interact with the models effectively. Here are the primary endpoints:\n\n### 1. Generate Text\n\n- **POST /generate**\n  - Request Example:\n    ```json\n    {\n        \"prompt\": \"What is the capital of France?\",\n        \"params\": {\n            // Additional parameters...\n        }\n    }\n    ```\n  - Description: Use this endpoint to generate text based on a given prompt. You can include additional parameters for fine-tuning and customization.\n\n### 2. Async Text Generation\n\n- **POST /agenerate**\n  - Request Example:\n    ```json\n    {\n        \"prompt\": \"What is the capital of France?\",\n        \"params\": {\n            // Additional parameters...\n        }\n    }\n    ```\n  - Description: This endpoint is designed for asynchronous text generation. It allows you to initiate text generation tasks that can run in the background while your application continues to operate.\n\n### 3. Text Embeddings\n\n- **POST /embeddings**\n  - Request Example:\n    ```json\n    {\n        \"text\": \"What is the capital of France?\"\n    }\n    ```\n  - Description: Use this endpoint to obtain embeddings for a given text. This is valuable for various natural language processing tasks such as semantic similarity and text analysis.\n\n\n## Huggingface transformers\n\nGenerally models for which can be inferenced using transformer's `AutoConfig`, `AutoModelForCausalLM` and `AutoTokenizer` can run using the `model_family: huggingface` config, the following is an example (runs one of the MPT models):\n\n```\nmodels_dir: /models\nmodel_family: huggingface\nsetup_params:\n  repo_id: \u003crepo_id\u003e\n  tokenizer_repo_id: \u003crepo_id\u003e\n  trust_remote_code: True\n  config_params:\n    init_device: cuda:0\n    attn_config:\n      attn_impl: triton\nmodel_params:\n  device_map: \"cuda:0\"\n  trust_remote_code: True\n  torch_dtype: torch.bfloat16\n```\n\nLeverage the flexibility of LLM-API by configuring various attributes using the following methods:\n\n1. Pass specific configuration attributes within the `config_params` to fine-tune `AutoConfig`. These attributes allow you to tailor the behavior of your language model further.\n\n2. Modify the model's parameters directly by specifying them within the `model_params`. These parameters are passed into the `AutoModelForCausalLM.from_pretrained` and `AutoTokenizer.from_pretrained` initialization calls.\n\nHere's an example of how you can use parameters in the `generate` (or `agenerate`) endpoints, but remember, you can pass any arguments accepted by [transformer's GenerationConfig](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig):\n\n```\nPOST /generate\n\ncurl --location 'localhost:8000/generate' \\\n--header 'Content-Type: application/json' \\\n--data '{\n    \"prompt\": \"What is the capital of paris\",\n    \"params\": {\n        \"max_length\": 25,\n        \"max_new_tokens\": 25,\n        \"do_sample\": true,\n        \"top_k\": 40,\n        \"top_p\": 0.95\n    }\n}'\n```\n\nIf you're looking to accelerate inference using a GPU, the `1b5d/llm-api:latest-gpu` image is designed for this purpose. When running the Docker image using Compose, consider utilizing a dedicated Compose file for GPU support:\n\n```\ndocker compose -f docker-compose.gpu.yaml up\n```\n\n**Note**: currenty only `linux/amd64` architecture is supported for gpu images\n\n## Llama models on CPU - using llama.cpp\n\nUtilizing Llama on a CPU is made simple by configuring the model usage in a local `config.yaml` file. Below are the possible configurations:\n\n```\nmodels_dir: /models\nmodel_family: llama\nsetup_params:\n  repo_id: user/repo_id\n  filename: ggml-model-q4_0.bin\nmodel_params:\n  n_ctx: 512\n  n_parts: -1\n  n_gpu_layers: 0\n  seed: -1\n  use_mmap: True\n  n_threads: 8\n  n_batch: 2048\n  last_n_tokens_size: 64\n  lora_base: null\n  lora_path: null\n  low_vram: False\n  tensor_split: null\n  rope_freq_base: 10000.0\n  rope_freq_scale: 1.0\n  verbose: True\n```\n\nEnsure to specify the repo_id and filename parameters to point to a Hugging Face repository where the desired model is hosted. The application will then handle the download for you.\n\nRunning in this mode can be done using the docker image `1b5d/llm-api:latest`, several images are also available to support different BLAS backends:\n- OpenBLAS: `1b5d/llm-api:latest-openblas`\n- cuBLAS: `1b5d/llm-api:latest-cublas`\n- CLBlast: `1b5d/llm-api:latest-clblast`\n- hipBLAS: `1b5d/llm-api:latest-hipblas`\n\n\nThe following example demonstrates the various parameters that can be sent to the Llama generate and agenerate endpoints:\n\n```\nPOST /generate\n\ncurl --location 'localhost:8000/generate' \\\n--header 'Content-Type: application/json' \\\n--data '{\n    \"prompt\": \"What is the capital of paris\",\n    \"params\": {\n        \"suffix\": null or string,\n        \"max_tokens\": 128,\n        \"temperature\": 0.8,\n        \"top_p\": 0.95,\n        \"logprobs\": null or integer,\n        \"echo\": False,\n        \"stop\": [\"\\Q\"],\n        \"frequency_penalty: 0.0,\n        \"presence_penalty\": 0.0,\n        \"repeat_penalty\": 1.1\n        \"top_k\": 40,\n    }\n}'\n```\n\n## AWQ quantized models - using AutoAWQ\n\nAWQ quantization is supported using the AutoAWQ implementation, below is an example config\n\n```\nmodels_dir: /models\nmodel_family: autoawq\nsetup_params:\n  repo_id: \u003crepo id\u003e\n  tokenizer_repo_id: \u003crepo id\u003e\n  filename: \u003cmodel file name\u003e\nmodel_params:\n  trust_remote_code: False\n  fuse_layers: False\n  safetensors: True\n  device_map: \"cuda:0\"\n```\n\nTo run this model, the gpu supported docker image is needed `1b5d/llm-api:latest-gpu`\n\n```\ndocker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:latest-gpu\n```\n\nOr you can use the docker-compose.gpu.yaml file available in this repo:\n\n```\ndocker compose -f docker-compose.gpu.yaml up\n```\n\n## Llama on GPU - using GPTQ-for-LLaMa\n\n**Important Note**: Before running Llama or Llama 2 on GPU, make sure to install the [NVIDIA Driver](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html) on your host machine. You can verify the NVIDIA environment by executing the following command:\n\n```\ndocker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu20.04 nvidia-smi\n```\n\nYou should see a table displaying the current NVIDIA driver version and related information, confirming the proper setup.\n\nWhen running the Llama model with GPTQ-for-LLaMa 4-bit quantization, you can use a specialized Docker image designed for this purpose, `1b5d/llm-api:latest-gpu`, as an alternative to the default image. You can run this mode using a separate Docker Compose file:\n\n```\ndocker compose -f docker-compose.gpu.yaml up\n```\n\nOr by directly running the container:\n\n```\ndocker run --gpus all -v $PWD/models/:/models:rw -v $PWD/config.yaml:/llm-api/config.yaml:ro -p 8000:8000 1b5d/llm-api:latest-gpu\n```\n\n**Important Note**: The `llm-api:x.x.x-gptq-llama-cuda` and `llm-api:x.x.x-gptq-llama-triton` images have been deprecated. Please switch to the `1b5d/llm-api:latest-gpu` image when GPU support is required\n\nExample config file:\n\n```\nmodels_dir: /models\nmodel_family: gptq_llama\nsetup_params:\n  repo_id: user/repo_id\n  filename: \u003cmodel.safetensors or model.pt\u003e\nmodel_params:\n  group_size: 128\n  wbits: 4\n  cuda_visible_devices: \"0\"\n  device: \"cuda:0\"\n```\n\nExample request:\n\n```\nPOST /generate\n\ncurl --location 'localhost:8000/generate' \\\n--header 'Content-Type: application/json' \\\n--data '{\n    \"prompt\": \"What is the capital of paris\",\n    \"params\": {\n        \"temp\": 0.8,\n        \"top_p\": 0.95,\n        \"min_length\": 10,\n        \"max_length\": 50\n    }\n}'\n```\n\n# Credits\n\n- [llama.cpp](https://github.com/ggerganov/llama.cpp) for making it possible to run Llama models on CPU. \n- [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for the python bindings lib for `llama.cpp`.\n- [GPTQ-for-LLaMa](https://github.com/qwopqwop200/GPTQ-for-LLaMa) for providing a GPTQ quantization implementation for Llama based models.\n- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) for providing an implementation for AWQ quantization\n- Huggingface for the great ecosystem of tooling they provide.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F1b5d%2Fllm-api","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F1b5d%2Fllm-api","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F1b5d%2Fllm-api/lists"}