{"id":17637881,"url":"https://github.com/iaalm/llama-api-server","last_synced_at":"2025-05-16T08:04:39.771Z","repository":{"id":152007143,"uuid":"624859013","full_name":"iaalm/llama-api-server","owner":"iaalm","description":"A OpenAI API compatible REST server for llama.","archived":false,"fork":false,"pushed_at":"2025-02-24T12:05:59.000Z","size":155,"stargazers_count":207,"open_issues_count":21,"forks_count":11,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-05-11T01:37:20.789Z","etag":null,"topics":["language-model","llama","llm","openai","openai-api","privatization","rest-api","selfhost"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iaalm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":null,"patreon":null,"open_collective":null,"ko_fi":"iaalm","tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"lfx_crowdfunding":null,"custom":null}},"created_at":"2023-04-07T12:39:59.000Z","updated_at":"2025-05-05T13:46:28.000Z","dependencies_parsed_at":null,"dependency_job_id":"5c2aa98e-4972-4f69-b38c-ecdedbac9bc7","html_url":"https://github.com/iaalm/llama-api-server","commit_stats":{"total_commits":92,"total_committers":3,"mean_commits":"30.666666666666668","dds":0.3695652173913043,"last_synced_commit":"1cdcf59bacb79934c20724df6b947534596160c1"},"previous_names":[],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iaalm%2Fllama-api-server","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iaalm%2Fllama-api-server/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iaalm%2Fllama-api-server/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iaalm%2Fllama-api-server/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iaalm","download_url":"https://codeload.github.com/iaalm/llama-api-server/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253672735,"owners_count":21945482,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["language-model","llama","llm","openai","openai-api","privatization","rest-api","selfhost"],"created_at":"2024-10-23T03:06:33.893Z","updated_at":"2025-05-16T08:04:36.356Z","avatar_url":"https://github.com/iaalm.png","language":"Python","funding_links":["https://ko-fi.com/iaalm"],"categories":[],"sub_categories":[],"readme":"🎭🦙 llama-api-server\n=======\n\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Release](https://github.com/iaalm/llama-api-server/actions/workflows/release.yml/badge.svg)](https://github.com/iaalm/llama-api-server/actions/workflows/release.yml)\n[![PyPI version](https://badge.fury.io/py/llama-api-server.svg)](https://badge.fury.io/py/llama-api-server)\n\nThis project is under active deployment. Breaking changes could be made any time.\n\nLlama as a Service! This project try to build a REST-ful API server compatible to OpenAI API using open source backends like llama/llama2.\n\nWith this project, many common GPT tools/framework can compatible with your own model.\n\n# 🚀Get started\n\n## Try it online!\n\nFollow instruction in [this collab notebook](https://colab.research.google.com/drive/1uF77533sIQLA_EIG83jjfayvqJiyalqp) to play it online.\nThanks [anythingbutme](https://github.com/anythingbutme) for building it!\n\n## Prepare model\n\n### llama.cpp\nIf you you don't have quantized llama.cpp, you need to follow [instruction](https://github.com/ggerganov/llama.cpp#usage) to prepare model.\n\n### pyllama\nIf you you don't have quantize pyllama, you need to follow [instruction](https://github.com/juncongmoo/pyllama#-quantize-llama-to-run-in-a-4gb-gpu) to prepare model.\n\n\n## Install\nUse following script to download package from [PyPI](https://pypi.org/project/llama-api-server) and generates model config file `config.yml` and security token file `tokens.txt`.\n```\npip install llama-api-server\n\n# to run wth pyllama\npip install llama-api-server[pyllama]\n\ncat \u003e config.yml \u003c\u003c EOF\nmodels:\n  completions:\n    # completions and chat_completions use same model\n    text-ada-002:\n      type: llama_cpp\n      params:\n        path: /absolute/path/to/your/7B/ggml-model-q4_0.bin\n    text-davinci-002:\n      type: pyllama_quant\n      params:\n        path: /absolute/path/to/your/pyllama-7B4b.pt\n    text-davinci-003:\n      type: pyllama\n      params:\n        ckpt_dir: /absolute/path/to/your/7B/\n        tokenizer_path: /absolute/path/to/your/tokenizer.model\n      # keep to 1 instance to speed up loading of model\n  embeddings:\n    text-embedding-davinci-002:\n      type: pyllama_quant\n      params:\n        path: /absolute/path/to/your/pyllama-7B4b.pt\n      min_instance: 1\n      max_instance: 1\n      idle_timeout: 3600\n    text-embedding-ada-002:\n      type: llama_cpp\n      params:\n        path: /absolute/path/to/your/7B/ggml-model-q4_0.bin\nEOF\n\necho \"SOME_TOKEN\" \u003e tokens.txt\n\n# start web server\npython -m llama_api_server\n# or visible across the network\npython -m llama_api_server --host=0.0.0.0\n\n```\n\n## Call with openai-python\n```\nexport OPENAI_API_KEY=SOME_TOKEN\nexport OPENAI_API_BASE=http://127.0.0.1:5000/v1\n\nopenai api completions.create -e text-ada-002 -p \"hello?\"\n# or using chat\nopenai api chat_completions.create -e text-ada-002 -g user \"hello?\"\n# or calling embedding\ncurl -X POST http://127.0.0.1:5000/v1/embeddings -H 'Content-Type: application/json' -d '{\"model\":\"text-embedding-ada-002\", \"input\":\"It is good.\"}'  -H \"Authorization: Bearer SOME_TOKEN\"\n```\n\n# 🛣️Roadmap\n\n### Tested with\n- [X] [openai-python](https://github.com/openai/openai-python)\n    - [X] OPENAI\\_API\\_TYPE=default\n    - [X] OPENAI\\_API\\_TYPE=azure\n- [X] [llama-index](https://github.com/jerryjliu/llama_index)\n\n### Supported APIs\n- [X] Completions\n    - [X] set `temperature`, `top_p`, and `top_k`\n    - [X] set `max_tokens`\n    - [X] set `echo`\n    - [ ] set `stop`\n    - [ ] set `stream`\n    - [ ] set `n`\n    - [ ] set `presence_penalty` and `frequency_penalty`\n    - [ ] set `logit_bias`\n- [X] Embeddings\n    - [X] batch process\n- [X] Chat\n    - [ ] Prefix cache for chat\n- [ ] List model\n\n### Supported backends\n- [X] [llama.cpp](https://github.com/ggerganov/llama.cpp) via [llamacpp-python](https://github.com/thomasantony/llamacpp-python)\n- [X] [llama](https://github.com/facebookresearch/llama) via [pyllama](https://github.com/juncongmoo/pyllama)\n    - [X] Without Quantization\n    - [X] With Quantization\n    - [X] Support LLAMA2\n\n### Others\n- [X] Performance parameters like `n_batch` and `n_thread`\n- [X] Token auth\n- [ ] Documents\n- [ ] Intergration tests\n- [ ] A tool to download/prepare pretrain model\n- [ ] Make config.ini and token file configable\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiaalm%2Fllama-api-server","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiaalm%2Fllama-api-server","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiaalm%2Fllama-api-server/lists"}