{"id":21040558,"url":"https://github.com/pixelspark/poly","last_synced_at":"2025-05-15T16:33:20.502Z","repository":{"id":197443319,"uuid":"698559328","full_name":"pixelspark/poly","owner":"pixelspark","description":"A single-binary, GPU-accelerated LLM server (HTTP and WebSocket API) written in Rust","archived":true,"fork":false,"pushed_at":"2024-01-14T19:00:58.000Z","size":1815,"stargazers_count":79,"open_issues_count":0,"forks_count":8,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-24T20:14:51.668Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pixelspark.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-09-30T09:21:48.000Z","updated_at":"2025-02-17T15:30:04.000Z","dependencies_parsed_at":null,"dependency_job_id":"6b3ab82a-3d7f-41d8-a0f5-655b1262bc19","html_url":"https://github.com/pixelspark/poly","commit_stats":null,"previous_names":["pixelspark/poly"],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pixelspark%2Fpoly","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pixelspark%2Fpoly/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pixelspark%2Fpoly/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pixelspark%2Fpoly/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pixelspark","download_url":"https://codeload.github.com/pixelspark/poly/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254377440,"owners_count":22061140,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T13:47:09.872Z","updated_at":"2025-05-15T16:33:20.086Z","avatar_url":"https://github.com/pixelspark.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Poly\n\nPoly is a versatile LLM serving back-end. What it offers:\n\n- High-performance, efficient and reliable serving of multiple local LLM models\n- Optional GPU acceleration through either CUDA or Metal\n- Configurable LLM completion tasks (prompts, recall, stop tokens, etc.)\n- Streaming completion responses through HTTP SSE, chat using WebSockets\n- Biased sampling of completion output using JSON schema\n- Memory retrieval using vector databases (either built-in file based, or external such as Qdrant)\n- Accepts and automatically chunks PDF and DOCX files for storage to memory\n- API secured using either static API keys or JWT tokens\n- Simple, single binary + config file server deployment, horizontally scalable\n\nNice extras:\n\n- A web client to easily test and fine-tune configuration\n- A single-binary cross platform desktop client for locally running models\n\nSupported models include:\n\n- LLaMa and derivatives (Alpaca, Vicuna, Guanaco, etc.)\n- LLaMA2\n- RedPajamas\n- MPT\n- Orca-mini\n\n|                    Web client                     |                  Desktop app                   |\n| :-----------------------------------------------: | :--------------------------------------------: |\n| ![Web client demonstration](./docs/webclient.gif) | ![Desktop client demonstration](./docs/ui.gif) |\n\nSample of a model+task+memory configuration:\n\n```toml\n[models.gpt2dutch]\nmodel_path = \"./data/gpt2-small-dutch-f16.bin\"\narchitecture = \"gpt2\"\nuse_gpu = false\n\n[memories.dutch_qdrant]\nstore = { qdrant = { url = \"http://127.0.0.1:6333\", collection = \"nl\" } }\ndimensions = 768\nembedding_model = \"gpt2dutch\"\n\n[tasks.dutch_completion]\nmodel = \"gpt2dutch\"\nprelude = \"### System:\\nYou are an AI assistant that follows instruction extremely well. Help as much as you can.\\n\"\nprefix =  \"\\n### User:\\n\"\npostfix = \"\\n### Response:\"\nmemorization = { memory = \"dutch_qdrant\", retrieve = 2 }\n```\n\nSee [config.example.toml](./config.example.toml) for more example configurations.\n\nCustom samplers can be configured using a string-based description, see [here](https://github.com/rustformers/llm/blob/18b2a7d37e56220487e851a45badc46bf9dcb9d3/crates/llm-base/src/samplers.rs#L222). Any biaser (i.e. JSON biaser) is injected as first sampler in the chain.\n\n## Concepts\n\nIn Poly, _models_ are LLM models that support basic text generation and embedding operations. Models can be run on the GPU and have specific context lengths, but are otherwise unconfigurable.\n\nA _task_ uses a model in a specific way (i.e. using specific prompts, stop tokens, sampling, et cetera. Tasks are highly configurable. A model may be shared by multiple tasks.\n\nA _memory_ is a database that stores _chunks_ of text, and allows retrieval of such chunks using vector similarity (where each chunk has a vector calculated as an embedding from an LLM). Memories can be re-used between tasks.\n\n```mermaid\nclassDiagram\n    class Model {\n        String name\n        Boolean use_gpu\n        Int context_length\n    }\n    class Task {\n        String name\n        String prompts\n        Inference parameters\n    }\n    class Memory {\n        String name\n        Chunking parameters\n    }\n    class Chunk {\n        String text\n        Vector embedding\n    }\n\n    Task \"*\" --\u003e \"1\" Model: Generation model\n    Task \"*\" --\u003e \"1\" Memory\n    Memory \"*\"--\u003e \"1\" Model: Embedding model\n    Memory \"1\" --\u003e \"*\" Chunk: Stored chunks\n```\n\nThe API exposes models, tasks and memories by their name (which is unique within their category).\n\n### Tasks\n\nA task configures the way user input is transformed before it is fed to an LLM and the way the LLM output is transformed before it is returned to the user, in order to perform a specific task. A task can be configured to use (optional) `prelude`, `prefix` and `postfix` prompts. The prelude is fed once to the model for each session. The prefix and postfix are applied to each user input (i.e. each chat message):\n\n```mermaid\nsequenceDiagram\n    actor User\n    Task-\u003e\u003eLLM: Prelude\n    loop\n        User-\u003e\u003e+Task: Prompt\n\t\talt When recall is enabled\n\t\t\tTask-\u003e\u003eLLM: Recalled memory items (based on user prompt)\n\t\tend\n        Task-\u003e\u003e+LLM: Prefix\n        Task-\u003e\u003eLLM: Prompt\n        Task-\u003e\u003eLLM: Postfix\n\n        note over LLM: Generate until stop token occurs or token/context limit reached\n        LLM-\u003e\u003e-Task: Response\n        Task-\u003e\u003e-User: Completion\n    end\n```\n\nWhen biasing is enabled, an optional `bias prompt` can be configured. When configured the model will be asked to generate a response (following the flow as shown above). This response is however not directly returned to the user. Instead, the bias prompt is then fed, after which the biaser is enabled (and the biased response is returned to the user).\n\n```mermaid\nsequenceDiagram\n    actor User\n    Task-\u003e\u003eLLM: Prelude\n    loop\n        User-\u003e\u003e+Task: Prompt\n\t\talt When recall is enabled\n\t\t\tTask-\u003e\u003eLLM: Recalled memory items (based on user prompt)\n\t\tend\n        Task-\u003e\u003e+LLM: Prefix\n        Task-\u003e\u003eLLM: Prompt\n        Task-\u003e\u003eLLM: Postfix\n\n        note over LLM: Generate until stop token occurs or token/context limit reached\n        Task-\u003e\u003eLLM: Bias prompt\n        note over LLM: Generate with biaser enabled\n\n        LLM-\u003e\u003e-Task: Response (biased)\n        Task-\u003e\u003e-User: Completion\n    end\n```\n\n## Architecture\n\nPoly is divided into separate crates that can be used independently:\n\n- [poly-server](./poly-server): Serve LLMs through HTTP and WebSocket APIs (provides `llmd`)\n- [poly-backend](./poly-backend): Back-end implementation of LLM tasks\n- [poly-extract](./poly-extract): Crate for extracting plaintext from various document types\n- [poly-bias](./poly-bias): Crate for biasing LLM output to e.g. JSON following a schema\n- [poly-ui](./poly-ui): Simple desktop UI for local LLMs\n\nApplications that want to employ Poly's functionality should use the HTTP REST API exposed by `poly-server`. Rust applications looking to integrate Poly's capabilities could also depend on `poly-backend` directly.\n\n```mermaid\nflowchart TD\n\nsubgraph Poly\n\tPW[poly web-client]\n\tPS[poly-server]\n\tPB[poly-backend]\n\tPE[poly-extract]\n\tPU[poly-ui]\n\tPBs[poly-bias]\n\n\tPW--\u003e|HTTP,WS,SSE|PS\n\tPB--\u003ePBs\n\tPS\u003c-.-\u003ePE\n\n\tPS--\u003ePB\n\tPU--\u003ePB\nend\n\nLLM[rustformers/llm]\nPBs\u003c-.-\u003eLLM\nPB\u003c-.-\u003eQdrant\n\nT[tokenizers]\nGGML\nMetal\nPB--\u003eLLM\nLLM\u003c-.-\u003eT\nLLM--\u003eGGML\nGGML--\u003eMetal\nGGML--\u003eCUDA\n\n```\n\n## Authors\n\n- Tommy van der Vorst (vandervorst@dialogic.nl)\n\n## License\n\nPoly is [licensed under the Apache 2.0 license](http://www.apache.org/licenses/LICENSE-2.0) from Git revision `f329d35`\nonwards (only). Kindly note that this license provides no entitlement to support whatsoever. If you have any specific\nlicensing or support requirements, please contact the authors.\n\nWe will accept contributions only if accompanied with a written statement that the contributed code is also\n(at least) Apache 2.0 licensed.\n\nSpecific licenses apply to the following files:\n\n- [./data/gpt2.bin](./data/gpt2.bin): MIT, source [here](https://huggingface.co/marella/gpt-2-ggml), license [here](https://github.com/marella/ctransformers/blob/main/LICENSE).\n- [./data/gpt2-small-dutch-f16.bin](./data/gpt2-small-dutch-f16.bin): Apache 2.0, source [here](https://huggingface.co/GroNLP/gpt2-small-dutch-embeddings), license [here](https://github.com/wietsedv/gpt2-recycle/blob/master/LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpixelspark%2Fpoly","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpixelspark%2Fpoly","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpixelspark%2Fpoly/lists"}