{"id":26220519,"url":"https://github.com/lyrcaxis/llamba","last_synced_at":"2025-04-16T04:30:30.860Z","repository":{"id":250511308,"uuid":"834664594","full_name":"Lyrcaxis/Llamba","owner":"Lyrcaxis","description":"Minimalistic batching application for LLMs using ASP.NET Core and LLamaSharp","archived":false,"fork":false,"pushed_at":"2024-10-23T14:46:27.000Z","size":15219,"stargazers_count":12,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-06T04:44:25.259Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Lyrcaxis.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-28T01:18:25.000Z","updated_at":"2025-03-24T22:46:13.000Z","dependencies_parsed_at":"2024-10-23T16:49:16.765Z","dependency_job_id":null,"html_url":"https://github.com/Lyrcaxis/Llamba","commit_stats":null,"previous_names":["lyrcaxis/llamba"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lyrcaxis%2FLlamba","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lyrcaxis%2FLlamba/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lyrcaxis%2FLlamba/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Lyrcaxis%2FLlamba/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Lyrcaxis","download_url":"https://codeload.github.com/Lyrcaxis/Llamba/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249195126,"owners_count":21228127,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-12T15:17:49.903Z","updated_at":"2025-04-16T04:30:30.830Z","avatar_url":"https://github.com/Lyrcaxis.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LLaMbA (Large Language Model Batching Application)\nLLaMbA is a minimalistic cross-platform batching engine/server for LLMs, powered by [ASP.NET Core](https://dotnet.microsoft.com/en-us/apps/aspnet) and [LLamaSharp](https://github.com/SciSharp/LLamaSharp).\n\nThe engine's goal is to be able to serve multiple requests with small models as quick as possible, and it was made while having in mind its primary purposes of Serving, Classifying, and Generating Synthetic Data, within a minimal and extensible environment.\n\n\n\n## Why is it fast\nLLaMbA introduces quick and customizable ways to sample, made possible by .NET's `System.Numerics.Tensors` and threading. The Out-Of-The-Box sampling is arguably not as extensive as llama.cpp's, but it serves its purposes nicely and it's quite faster (up to ~10x increasing with smaller model sizes).\n\nIn addition, it hosts a python tokenizer, and utilizes llama.cpp's token grouping features to reduce the total amount of tokens in the batch, by reusing tokens that share the same position in multiple sequences, reducing the total amount of tokens the model sees. This can further be taken advantage of during multiple classifications of the same prompt, where most tokens are the same but the classification purposes change.\n\n\n\n## What it isn't\nWhile LLaMbA contains a basic Web UI for chatting with the LLM, it wasn't made to contain rich features and single-user session efficiency, but with ease-of-testing in mind. That said, the primary use of the Web UI is testing any imposed changes, custom samplers, or systems.\n\nIt also isn't an all-in-one \u0026 one-for-all deliverable; the user is expected to get hands-on and adjust code parts to their needs.\n\n\n\n## Who it's intended for\nAnyone can use LLaMbA for Synthetic Data generation locally as it is, but for more advanced purposes like Serving or Classifying, the primary target audience is Developers that should create safeguards (e.g. auth, limits for max_tokens, moderation) and other systems to compliment the backend and take advantage of the high speeds.\n\nDevelopers are encouraged to experiment and customize the engine to their specs.\n\n\n\n## Requirements\n- [CUDA 12](https://developer.nvidia.com/cuda-12-0-0-download-archive) or the backend of your choice (CUDA11, CUDA12, Vulkan, OpenCL, Metal, CPU).\n- [.NET 8 SDK](https://dotnet.microsoft.com/en-us/download). Necessary for building and running the project.\n- [Python](https://www.python.org/downloads) (+ packages). After installing python, install the necessary packages:\n```\npip install tokenizers uvicorn fastapi asyncio requests\n```\n\n\n\n## Videos\n###### The model used in the videos is LLama3.1-Instruct-8B-Q8, on a single RTX 4080, utilizing ~12GB of VRAM.\n\n\n#### Batching Test (w/ flash attention)\n\n###### About double the speed in comparison to using the llama.cpp sampler.\n\nhttps://github.com/user-attachments/assets/12c2845e-2a20-41df-99fe-fce641512cf0\n\n\n#### WebUI - Made for Testing:\n\n###### Chat UI supports basic back \u0026 forth functionality \u0026 message editing/deleting.\n\nhttps://github.com/user-attachments/assets/11ecd164-5311-4047-bae0-16a8adef621c\n\n###### Batches sent with Completion mode get passed without formatting, whereas Chat mode formats them to model's prompt format.\n\nhttps://github.com/user-attachments/assets/a32f9bfb-f6ca-4c77-ab3e-8aef477a7093\n\n###### It's easy and fast to navigate a model to generate a specific json field from your specs.\n\nhttps://github.com/user-attachments/assets/ccaced6c-c47e-4c7b-87ae-dcaef508d31f\n\n\n\n## General tips\nCheck out [the General Guide](https://github.com/Lyrcaxis/Llamba/wiki/General-Guide) and [Example Usage](https://github.com/Lyrcaxis/Llamba/wiki/Example-Usage) for example usage of the API and a quick code tour.\n\nContext Size can be increased in [`Model.cs`](https://github.com/Lyrcaxis/Llamba/blob/main/Model.cs) to further increase throughput. The default parameters are for LLaMA3.1-8B-Q8 with ~12GB of VRAM.\n\nEnabling Flash Attention will also increase generation throughput.\n\n\n\n## Supported models\nLLaMbA supports all language models currently supported by llama.cpp.\n- see [InferenceFormat.cs](https://github.com/Lyrcaxis/Llamba/blob/main/InferenceFormat.cs) to add your own prompt format.\n- and [Tokenizer.cs](https://github.com/Lyrcaxis/Llamba/blob/main/Tokenization/Tokenizer.cs) for adding a tokenizer. It's easy!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyrcaxis%2Fllamba","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flyrcaxis%2Fllamba","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyrcaxis%2Fllamba/lists"}