{"id":15021727,"url":"https://github.com/evilfreelancer/docker-llama.cpp-rpc","last_synced_at":"2025-04-10T19:43:17.943Z","repository":{"id":257131138,"uuid":"844727977","full_name":"EvilFreelancer/docker-llama.cpp-rpc","owner":"EvilFreelancer","description":"Данный проект основан на llama.cpp и компилирует только RPC-сервер, а так же вспомогательные утилиты, работающие в режиме RPC-клиента, необходимые для реализации распределённого инференса конвертированных в GGUF формат Больших Языковых Моделей (БЯМ) и Эмбеддинговых Моделей.","archived":false,"fork":false,"pushed_at":"2025-02-23T09:16:37.000Z","size":297,"stargazers_count":15,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-24T17:21:24.734Z","etag":null,"topics":["ai","docker","docker-compose","embedding","grpc","llamacpp","llm","rpc"],"latest_commit_sha":null,"homepage":"","language":"Dockerfile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EvilFreelancer.png","metadata":{"files":{"readme":"README.en.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-19T21:14:24.000Z","updated_at":"2025-02-23T09:16:40.000Z","dependencies_parsed_at":"2024-09-15T01:18:48.961Z","dependency_job_id":"092e7916-6292-4120-a11f-ecafce79071d","html_url":"https://github.com/EvilFreelancer/docker-llama.cpp-rpc","commit_stats":{"total_commits":23,"total_committers":1,"mean_commits":23.0,"dds":0.0,"last_synced_commit":"68f778d4e7d92e6a3bca3e7369f742760d8df75a"},"previous_names":["evilfreelancer/docker-llama.cpp-rpc"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvilFreelancer%2Fdocker-llama.cpp-rpc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvilFreelancer%2Fdocker-llama.cpp-rpc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvilFreelancer%2Fdocker-llama.cpp-rpc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EvilFreelancer%2Fdocker-llama.cpp-rpc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EvilFreelancer","download_url":"https://codeload.github.com/EvilFreelancer/docker-llama.cpp-rpc/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248281423,"owners_count":21077423,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","docker","docker-compose","embedding","grpc","llamacpp","llm","rpc"],"created_at":"2024-09-24T19:56:57.223Z","updated_at":"2025-04-10T19:43:17.921Z","avatar_url":"https://github.com/EvilFreelancer.png","language":"Dockerfile","funding_links":[],"categories":[],"sub_categories":[],"readme":"# llama.cpp RPC-server in Docker\n\n[Русский](./README.md) | [中文](./README.zh.md) | **English**\n\nThis project is based on [llama.cpp](https://github.com/ggerganov/llama.cpp) and compiles only\nthe [RPC](https://github.com/ggerganov/llama.cpp/tree/master/examples/rpc) server, along with auxiliary utilities\noperating in RPC client mode, which are necessary for implementing distributed inference of Large Language Models (LLMs)\nand Embedding Models converted into the GGUF format.\n\n## Overview\n\nThe general architecture of an application using the RPC server looks as follows:\n\n![schema](./assets/schema.png)\n\nInstead of `llama-server`, you can use `llama-cli` or `llama-embedding`, which are included in the standard container\npackage.\n\nDocker images are built with support for the following architectures:\n\n* **CPU-only** - amd64, arm64, arm/v7\n* **CUDA** - amd64\n\nUnfortunately, CUDA builds for arm64 fail due to an error, so they are temporarily disabled.\n\n## Environment Variables\n\n| Name               | Default                                    | Description                                                                                      |\n|--------------------|--------------------------------------------|--------------------------------------------------------------------------------------------------|\n| APP_MODE           | backend                                    | Container operation mode, available options: server, backend, and none                           |\n| APP_BIND           | 0.0.0.0                                    | Interface to bind to                                                                             |\n| APP_PORT           | `8080` for `server`, `50052` for `backend` | Port number on which the server is running                                                       |\n| APP_MEM            | 1024                                       | Amount of MiB of RAM available to the client; in CUDA mode, this is the amount of GPU memory     | \n| APP_RPC_BACKENDS   | backend-cuda:50052,backend-cpu:50052       | Comma-separated addresses of backends that the container will try to connect to in `server` mode |\n| APP_MODEL          | /app/models/TinyLlama-1.1B-q4_0.gguf       | Path to the model weights inside the container                                                   | \n| APP_REPEAT_PENALTY | 1.0                                        | Repeat penalty                                                                                   |\n| APP_GPU_LAYERS     | 99                                         | Number of layers offloaded to the backend                                                        |\n\n## Example of docker-compose.yml\n\nIn this example, `llama-server` (container `main`) is launched and the\nmodel [TinyLlama-1.1B-q4_0.gguf](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/tree/main), which was\npreviously downloaded to the `./models` directory located at the same level as `docker-compose.yml`, is initialized. The\n`./models` directory is then mounted inside the `main` container and is available at the path `/app/models`.\n\n```yaml\nversion: \"3.9\"\n\nservices:\n\n  main:\n    image: evilfreelancer/llama.cpp-rpc:latest\n    restart: unless-stopped\n    volumes:\n      - ./models:/app/models\n    environment:\n      # Operation mode (RPC client in API server format)\n      APP_MODE: server\n      # Path to the model weights, preloaded inside the container\n      APP_MODEL: /app/models/TinyLlama-1.1B-q4_0.gguf\n      # Addresses of the RPC servers the client will interact with\n      APP_RPC_BACKENDS: backend-cuda:50052,backend-cpu:50052\n    ports:\n      - \"127.0.0.1:8080:8080\"\n\n  backend-cpu:\n    image: evilfreelancer/llama.cpp-rpc:latest\n    restart: unless-stopped\n    environment:\n      # Operation mode (RPC server)\n      APP_MODE: backend\n      # Amount of system RAM available to the RPC server (in Megabytes)\n      APP_MEM: 2048\n\n  backend-cuda:\n    image: evilfreelancer/llama.cpp-rpc:latest-cuda\n    restart: \"unless-stopped\"\n    environment:\n      # Operation mode (RPC server)\n      APP_MODE: backend\n      # Amount of GPU memory available to the RPC server (in Megabytes)\n      APP_MEM: 1024\n    deploy:\n      resources:\n        reservations:\n          devices:\n            - driver: nvidia\n              count: 1\n              capabilities: [ gpu ]\n```\n\nA complete example is available in [docker-compose.dist.yml](./docker-compose.dist.yml).\n\nAs a result, we obtain the following diagram:\n\n![schema-example](./assets/schema-example.png)\n\nOnce launched, you can make HTTP requests like this:\n\n```shell\ncurl \\\n    --request POST \\\n    --url http://localhost:8080/completion \\\n    --header \"Content-Type: application/json\" \\\n    --data '{\"prompt\": \"Building a website can be done in 10 simple steps:\"}'\n```\n\n## Manual Docker Build\n\nBuilding containers in CPU-only mode:\n\n```shell\ndocker build ./llama.cpp/\n```\n\nBuilding the container for CUDA:\n\n```shell\ndocker build ./llama.cpp/ --file ./llama.cpp/Dockerfile.cuda\n```\n\nUsing the build argument `LLAMACPP_VERSION`, you can specify the tag version, branch name, or commit hash to build the\ncontainer from. By default, the `master` branch is specified in the container.\n\n```shell\n# Build the container from the tag https://github.com/ggerganov/llama.cpp/releases/tag/b3700\ndocker build ./llama.cpp/ --build-arg LLAMACPP_VERSION=b3700\n```\n\n```shell\n# Build the container from the master branch\ndocker build ./llama.cpp/ --build-arg LLAMACPP_VERSION=master\n# or simply\ndocker build ./llama.cpp/\n```\n\n## Manual Build Using Docker Compose\n\nAn example of docker-compose.yml that performs image building with an explicit tag specification:\n\n```yaml\nversion: \"3.9\"\n\nservices:\n\n  main:\n    restart: \"unless-stopped\"\n    build:\n      context: ./llama.cpp\n      args:\n        - LLAMACPP_VERSION=b3700\n    volumes:\n      - ./models:/app/models\n    environment:\n      APP_MODE: none\n    ports:\n      - \"8080:8080\"\n\n  backend:\n    restart: \"unless-stopped\"\n    build:\n      context: ./llama.cpp\n      args:\n        - LLAMACPP_VERSION=b3700\n    environment:\n      APP_MODE: backend\n    ports:\n      - \"50052:50052\"\n```\n\n## Links\n\n- https://github.com/ggerganov/ggml/pull/761\n- https://github.com/ggerganov/llama.cpp/issues/7293\n- https://github.com/ggerganov/llama.cpp/pull/6829\n- https://github.com/ggerganov/llama.cpp/tree/master/examples/rpc\n- https://github.com/mudler/LocalAI/commit/fdb45153fed10d8a2c775633e952fdf02de60461\n- https://github.com/mudler/LocalAI/pull/2324\n- https://github.com/ollama/ollama/issues/4643\n\n## Citing\n\n```text\n[Pavel Rykov]. (2024). llama.cpp RPC-server in Docker. GitHub. https://github.com/EvilFreelancer/docker-llama.cpp-rpc\n```\n\n```text\n@misc{pavelrykov2024llamacpprpc,\n  author = {Pavel Rykov},\n  title  = {llama.cpp RPC-server in Docker},\n  year   = {2024},\n  url    = {https://github.com/EvilFreelancer/docker-llama.cpp-rpc}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevilfreelancer%2Fdocker-llama.cpp-rpc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fevilfreelancer%2Fdocker-llama.cpp-rpc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fevilfreelancer%2Fdocker-llama.cpp-rpc/lists"}