{"id":14483111,"url":"https://github.com/chenhunghan/ialacol","last_synced_at":"2025-09-30T20:31:56.254Z","repository":{"id":168673292,"uuid":"644304130","full_name":"chenhunghan/ialacol","owner":"chenhunghan","description":"🪶 Lightweight OpenAI drop-in replacement for Kubernetes","archived":true,"fork":false,"pushed_at":"2024-02-05T18:39:30.000Z","size":256,"stargazers_count":142,"open_issues_count":10,"forks_count":17,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-09-30T05:02:24.258Z","etag":null,"topics":["ai","cloudnative","cuda","ggml","gptq","gpu","helm","kubernetes","langchain","llamacpp","llm","llm-inference","llm-serving","openai","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chenhunghan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-23T08:42:05.000Z","updated_at":"2024-09-05T18:49:54.000Z","dependencies_parsed_at":"2023-10-24T19:35:24.470Z","dependency_job_id":"9465034f-81d8-457f-8093-34fcdcde925a","html_url":"https://github.com/chenhunghan/ialacol","commit_stats":{"total_commits":154,"total_committers":6,"mean_commits":"25.666666666666668","dds":0.538961038961039,"last_synced_commit":"410f6729b6d3a3829fc16b720b0f71af4ec458d2"},"previous_names":["chenhunghan/ialacol"],"tags_count":41,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chenhunghan%2Fialacol","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chenhunghan%2Fialacol/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chenhunghan%2Fialacol/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chenhunghan%2Fialacol/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chenhunghan","download_url":"https://codeload.github.com/chenhunghan/ialacol/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234774951,"owners_count":18884529,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","cloudnative","cuda","ggml","gptq","gpu","helm","kubernetes","langchain","llamacpp","llm","llm-inference","llm-serving","openai","python"],"created_at":"2024-09-03T00:01:30.953Z","updated_at":"2025-09-30T20:31:50.953Z","avatar_url":"https://github.com/chenhunghan.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# ialacol (l-o-c-a-l-a-i)\n\n🚧 being rewritten from Python to Rust/WebAssembly, see details \u003chttps://github.com/chenhunghan/ialacol/pull/93\u003e\n\n## Introduction\n\nialacol (pronounced \"localai\") is a lightweight drop-in replacement for OpenAI API.\n\nIt is an OpenAI API-compatible wrapper [ctransformers](https://github.com/marella/ctransformers) supporting [GGML](https://github.com/ggerganov/ggml)/[GPTQ](https://github.com/PanQiWei/AutoGPTQ) with optional CUDA/Metal acceleration.\n\nialacol is inspired by other similar projects like [LocalAI](https://github.com/go-skynet/LocalAI), [privateGPT](https://github.com/imartinez/privateGPT), [local.ai](https://github.com/louisgv/local.ai), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), [closedai](https://github.com/closedai-project/closedai), and [mlc-llm](https://github.com/mlc-ai/mlc-llm), with a specific focus on Kubernetes deployment.\n\n## Features\n\n- Compatibility with OpenAI APIs, compatible with [langchain](https://github.com/hwchase17/langchain).\n- Lightweight, easy deployment on Kubernetes clusters with a 1-click Helm installation.\n- Streaming first! For better UX.\n- Optional CUDA acceleration.\n- Compatible with [Github Copilot VSCode Extension](https://marketplace.visualstudio.com/items?itemName=GitHub.copilot), see [Copilot](#copilot)\n\n## Supported Models\n\nSee [Receipts](#receipts) below for instructions of deployments.\n\n- [LLaMa 2 variants](https://huggingface.co/meta-llama), including [OpenLLaMA](https://github.com/openlm-research/open_llama), [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1), [openchat_3.5](https://huggingface.co/openchat/openchat_3.5) and [zephyr](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta).\n- [StarCoder variants](https://huggingface.co/bigcode/starcoder)\n- [WizardCoder](https://huggingface.co/WizardLM/WizardCoder-15B-V1.0)\n- [StarChat variants](https://huggingface.co/HuggingFaceH4/starchat-beta)\n- [MPT-7B](https://www.mosaicml.com/blog/mpt-7b)\n- [MPT-30B](https://huggingface.co/mosaicml/mpt-30b)\n- [Falcon](https://falconllm.tii.ae/)\n\nAnd all LLMs supported by [ctransformers](https://github.com/marella/ctransformers/tree/main/models/llms).\n\n## UI\n\n`ialacol` does not have a UI, however it's compatible with any web UI that support OpenAI API, for example [chat-ui](https://github.com/huggingface/chat-ui) after [PR #541](https://github.com/huggingface/chat-ui/pull/541) merged.\n\nAssuming `ialacol` running at port 8000, you can configure [chat-ui](https://github.com/huggingface/chat-ui) to use [`zephyr-7b-beta.Q4_K_M.gguf`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) served by `ialacol`.\n```shell\nMODELS=`[\n  {\n      \"name\": \"zephyr-7b-beta.Q4_K_M.gguf\",\n      \"displayName\": \"Zephyr 7B β\",\n      \"preprompt\": \"\u003c|system|\u003e\\nYou are a friendly chatbot who always responds in the style of a pirate.\u003c/s\u003e\\n\",\n      \"userMessageToken\": \"\u003c|user|\u003e\\n\",\n      \"userMessageEndToken\": \"\u003c/s\u003e\\n\",\n      \"assistantMessageToken\": \"\u003c|assistant|\u003e\\n\",\n      \"assistantMessageEndToken\": \"\\n\",\n      \"parameters\": {\n        \"temperature\": 0.1,\n        \"top_p\": 0.95,\n        \"repetition_penalty\": 1.2,\n        \"top_k\": 50,\n        \"max_new_tokens\": 4096,\n        \"truncate\": 999999\n      },\n      \"endpoints\" : [{\n        \"type\": \"openai\",\n        \"baseURL\": \"http://localhost:8000/v1\",\n        \"completion\": \"chat_completions\"\n      }]\n  }\n]\n```\n\n[openchat_3.5.Q4_K_M.gguf](https://huggingface.co/openchat/openchat_3.5)\n```shell\nMODELS=`[\n  {\n      \"name\": \"openchat_3.5.Q4_K_M.gguf\",\n      \"displayName\": \"OpenChat 3.5\",\n      \"preprompt\": \"\",\n      \"userMessageToken\": \"GPT4 User: \",\n      \"userMessageEndToken\": \"\u003c|end_of_turn|\u003e\",\n      \"assistantMessageToken\": \"GPT4 Assistant: \",\n      \"assistantMessageEndToken\": \"\u003c|end_of_turn|\u003e\",\n      \"parameters\": {\n        \"temperature\": 0.1,\n        \"top_p\": 0.95,\n        \"repetition_penalty\": 1.2,\n        \"top_k\": 50,\n        \"max_new_tokens\": 4096,\n        \"truncate\": 999999,\n        \"stop\": [\"\u003c|end_of_turn|\u003e\"]\n      },\n      \"endpoints\" : [{\n        \"type\": \"openai\",\n        \"baseURL\": \"http://localhost:8000/v1\",\n        \"completion\": \"chat_completions\"\n      }]\n  }\n]`\n```\n\n## Blogs\n\n- [Use Code Llama (and other open LLMs) as Drop-In Replacement for Copilot Code Completion](https://dev.to/chenhunghan/use-code-llama-and-other-open-llms-as-drop-in-replacement-for-copilot-code-completion-58hg)\n- [Containerized AI before Apocalypse 🐳🤖](https://dev.to/chenhunghan/containerized-ai-before-apocalypse-1569)\n- [Deploy Llama 2 AI on Kubernetes, Now](https://dev.to/chenhunghan/deploy-llama-2-ai-on-kubernetes-now-2jc5)\n- [Cloud Native Workflow for Private MPT-30B AI Apps](https://dev.to/chenhunghan/cloud-native-workflow-for-private-ai-apps-2omb)\n- [Offline AI 🤖 on Github Actions 🙅‍♂️💰](https://dev.to/chenhunghan/offline-ai-on-github-actions-38a1)\n\n## Quick Start\n\n### Kubernetes\n\n`ialacol` offer first class citizen support for Kubernetes, which means you can automate/configure everything compare to runing without.\n\nTo quickly get started with ialacol on Kubernetes, follow the steps below:\n\n```sh\nhelm repo add ialacol https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install llama-2-7b-chat ialacol/ialacol\n```\n\nBy defaults, it will deploy [Meta's Llama 2 Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat) model quantized by [TheBloke](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML).\n\nPort-forward\n\n```sh\nkubectl port-forward svc/llama-2-7b-chat 8000:8000\n```\n\nChat with the default model `llama-2-7b-chat.ggmlv3.q4_0.bin` using `curl`\n\n```sh\ncurl -X POST \\\n     -H 'Content-Type: application/json' \\\n     -d '{ \"messages\": [{\"role\": \"user\", \"content\": \"How are you?\"}], \"model\": \"llama-2-7b-chat.ggmlv3.q4_0.bin\", \"stream\": false}' \\\n     http://localhost:8000/v1/chat/completions\n```\n\nAlternatively, using OpenAI's client library (see more examples in the `examples/openai` folder).\n\n```sh\nopenai -k \"sk-fake\" \\\n     -b http://localhost:8000/v1 -vvvvv \\\n     api chat_completions.create -m llama-2-7b-chat.ggmlv3.q4_0.bin \\\n     -g user \"Hello world!\"\n```\n\n### Configuration\n\nAll configuration is done via environmental variable.\n\n| Parameter                          | Description                                                          | Default | Example                                                                      |\n| :----------------------------------| :------------------------------------------------------------------- | :------ | :--------------------------------------------------------------------------- |\n| `DEFAULT_MODEL_HG_REPO_ID`         | The Hugging Face repo id to download the model                       | `None`  | `TheBloke/orca_mini_3B-GGML`                                                 |\n| `DEFAULT_MODEL_HG_REPO_REVISION`   | The Hugging Face repo revision                                       | `main`  | `gptq-4bit-32g-actorder_True`                                                |\n| `DEFAULT_MODEL_FILE`               | The file name to download from the repo, optional for GPTQ models    | `None`  | `orca-mini-3b.ggmlv3.q4_0.bin`                                               |\n| `MODE_TYPE`                        | Model type to override the auto model type detection                 | `None`  | `gptq`, `gpt_bigcode`, `llama`, `mpt`, `replit`, `falcon`, `gpt_neox` `gptj` |\n| `LOGGING_LEVEL`                    | Logging level                                                        | `INFO`  | `DEBUG`                                                                      |\n| `TOP_K`                            | top-k for sampling.                                                  | `40 `   | Integers                                                                     |\n| `TOP_P`                            | top-p for sampling.                                                  | `1.0`   | Floats                                                                       |\n| `REPETITION_PENALTY`               | rp for sampling.                                                     | `1.1`   | Floats                                                                       |\n| `LAST_N_TOKENS`                    | The last n tokens for repetition penalty.                            | `1.1`   | Integers                                                                     |\n| `SEED`                             | The seed for sampling.                                               | `-1`    | Integers                                                                     |\n| `BATCH_SIZE`                       | The batch size for evaluating tokens, only for GGUF/GGML models      | `8`     | Integers                                                                     |\n| `THREADS`                          | Thread number override auto detect by CPU/2, set `1` for GPTQ models | `Auto`  | Integers                                                                     |\n| `MAX_TOKENS`                       | The max number of token to generate                                  | `512`   | Integers                                                                     |\n| `STOP`                             | The token to stop the generation                                     | `None`  | `\u003c|endoftext\u003e`                                                               |\n| `CONTEXT_LENGTH`                   | Override the auto detect context length                              | `512`   | Integers                                                                     |\n| `GPU_LAYERS`                       | The number of layers to off load to GPU                              | `0`     | Integers                                                                     |\n| `TRUNCATE_PROMPT_LENGTH`           | Truncate the prompt if set                                           | `0`     | Integers                                                                     |\n\nSampling parameters including `TOP_K`, `TOP_P`, `REPETITION_PENALTY`, `LAST_N_TOKENS`, `SEED`, `MAX_TOKENS`, `STOP` can be override per request via request body, for example:\n\n```sh\ncurl -X POST \\\n     -H 'Content-Type: application/json' \\\n     -d '{ \"messages\": [{\"role\": \"user\", \"content\": \"Tell me a story.\"}], \"model\": \"llama-2-7b-chat.ggmlv3.q4_0.bin\", \"stream\": false, \"temperature\": \"2\", \"top_p\": \"1.0\", \"top_k\": \"0\" }' \\\n     http://localhost:8000/v1/chat/completions\n```\n\nwill use `temperature=2`, `top_p=1` and `top_k=0`for this request.\n\n\n### Run in Container\n\n#### Image from Github Registry\n\nThere is a [image](https://github.com/chenhunghan/ialacol/pkgs/container/ialacol) hosted on ghcr.io (alternatively [CUDA11](https://github.com/chenhunghan/ialacol/pkgs/container/ialacol-cuda11),[CUDA12](https://github.com/chenhunghan/ialacol/pkgs/container/ialacol-cuda12),[METAL](https://github.com/chenhunghan/ialacol/pkgs/container/ialacol-metal),[GPTQ](https://github.com/chenhunghan/ialacol/pkgs/container/ialacol-gptq) variants).\n\n```sh\ndocker run --rm -it -p 8000:8000 \\\n     -e DEFAULT_MODEL_HG_REPO_ID=\"TheBloke/Llama-2-7B-Chat-GGML\" \\\n     -e DEFAULT_MODEL_FILE=\"llama-2-7b-chat.ggmlv3.q4_0.bin\" \\\n     ghcr.io/chenhunghan/ialacol:latest\n```\n\n#### From Source\n\nFor developers/contributors\n\n##### Python\n\n```bash\npython3 -m venv .venv\nsource .venv/bin/activate\npython3 -m pip install -r requirements.txt\nDEFAULT_MODEL_HG_REPO_ID=\"TheBloke/stablecode-completion-alpha-3b-4k-GGML\" DEFAULT_MODEL_FILE=\"stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin\" LOGGING_LEVEL=\"DEBUG\" THREAD=4 uvicorn main:app --reload --host 0.0.0.0 --port 9999\n```\n\n##### Docker\n\nBuild image\n\n```sh\ndocker build --file ./Dockerfile -t ialacol .\n```\n\nRun container\n\n```sh\nexport DEFAULT_MODEL_HG_REPO_ID=\"TheBloke/orca_mini_3B-GGML\"\nexport DEFAULT_MODEL_FILE=\"orca-mini-3b.ggmlv3.q4_0.bin\"\ndocker run --rm -it -p 8000:8000 \\\n     -e DEFAULT_MODEL_HG_REPO_ID=$DEFAULT_MODEL_HG_REPO_ID \\\n     -e DEFAULT_MODEL_FILE=$DEFAULT_MODEL_FILE ialacol\n```\n\n## GPU Acceleration\n\nTo enable GPU/CUDA acceleration, you need to use the container image built for GPU and add `GPU_LAYERS` environment variable. `GPU_LAYERS` is determine by the size of your GPU memory. See the PR/discussion in [llama.cpp](https://github.com/ggerganov/llama.cpp/pull/1412) to find the best value.\n\n### CUDA 11\n\n- `deployment.image` = `ghcr.io/chenhunghan/ialacol-cuda11:latest`\n- `deployment.env.GPU_LAYERS` is the layer to off loading to GPU.\n\n### CUDA 12\n\n- `deployment.image` = `ghcr.io/chenhunghan/ialacol-cuda12:latest`\n- `deployment.env.GPU_LAYERS` is the layer to off loading to GPU.\n\nOnly `llama`, `falcon`, `mpt` and `gpt_bigcode`(StarCoder/StarChat) support CUDA.\n\n#### Llama with CUDA12\n\n```sh\nhelm install llama2-7b-chat-cuda12 ialacol/ialacol -f examples/values/llama2-7b-chat-cuda12.yaml\n```\n\nDeploys llama2 7b model with 40 layers offloadind to GPU. The inference is accelerated by CUDA 12.\n\n#### StarCoderPlus with CUDA12\n\n```sh\nhelm install starcoderplus-guanaco-cuda12 ialacol/ialacol -f examples/values/starcoderplus-guanaco-cuda12.yaml\n```\n\nDeploys [Starcoderplus-Guanaco-GPT4-15B-V1.0 model](https://huggingface.co/LoupGarou/Starcoderplus-Guanaco-GPT4-15B-V1.0) with 40 layers offloadind to GPU. The inference is accelerated by CUDA 12.\n\n### CUDA Driver Issues\n\nIf you see `CUDA driver version is insufficient for CUDA runtime version` when making the request, you are likely using a Nvidia Driver that is not [compatible with the CUDA version](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html).\n\nUpgrade the driver manually on the node (See [here](https://github.com/awslabs/amazon-eks-ami/issues/1060) if you are using CUDA11 + AMI). Or try different version of CUDA.\n\n### Metal\n\nTo enable Metal support, use the image `ialacol-metal` built for metal.\n\n- `deployment.image` = `ghcr.io/chenhunghan/ialacol-metal:latest`\n\nFor example\n\n```sh\nhelm install llama2-7b-chat-metal ialacol/ialacol -f examples/values/llama2-7b-chat-metal.yaml.yaml\n```\n\n### GPTQ\n\nTo use GPTQ, you must\n\n- `deployment.image` = `ghcr.io/chenhunghan/ialacol-gptq:latest`\n- `deployment.env.MODEL_TYPE` = `gptq`\n\nFor example\n\n```sh\nhelm install llama2-7b-chat-gptq ialacol/ialacol -f examples/values/llama2-7b-chat-gptq.yaml.yaml\n```\n\n```sh\nkubectl port-forward svc/llama2-7b-chat-gptq 8000:8000\nopenai -k \"sk-fake\" -b http://localhost:8000/v1 -vvvvv api chat_completions.create -m gptq_model-4bit-128g.safetensors -g user \"Hello world!\"\n```\n\n## Tips\n\n### Copilot\n\n`ialacol` can be use as a copilot client as GitHub's Copilot is almost identical API as OpenAI completion API.\n\nHowever, few things need to keep in mind:\n\n1. Copilot client sends a lenthy prompt, to include all the related context for code completion, see [copilot-explorer](https://github.com/thakkarparth007/copilot-explorer), which give heavy load on the server, if you are trying to run `ialacol` locally, opt-in `TRUNCATE_PROMPT_LENGTH` environmental variable to truncate the prompt from the beginning to reduce the workload.\n\n2. Copilot sends request in parallel, to increase the throughput, you probably need a queue like [text-inference-batcher](https://github.com/ialacol/text-inference-batcher).\n\nStart two instances of ialacol:\n\n```bash\ngh repo clone chenhunghan/ialacol \u0026\u0026 cd ialacol \u0026\u0026 python3 -m venv .venv \u0026\u0026 source .venv/bin/activate \u0026\u0026 python3 -m pip install -r requirements.txt\nLOGGING_LEVEL=\"DEBUG\"\nTHREAD=2\nDEFAULT_MODEL_HG_REPO_ID=\"TheBloke/stablecode-completion-alpha-3b-4k-GGML\"\nDEFAULT_MODEL_FILE=\"stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin\"\nTRUNCATE_PROMPT_LENGTH=100 # optional\nuvicorn main:app --host 0.0.0.0 --port 9998\nuvicorn main:app --host 0.0.0.0 --port 9999\n```\n\nStart [tib](https://github.com/ialacol/text-inference-batcher), pointing to upstream ialacol instances.\n\n```bash\ngh repo clone ialacol/text-inference-batcher \u0026\u0026 cd text-inference-batcher \u0026\u0026 npm install\nUPSTREAMS=\"http://localhost:9998,http://localhost:9999\" npm start\n```\n\nConfigure VSCode Github Copilot to use [tib](https://github.com/ialacol/text-inference-batcher).\n\n```json\n\"github.copilot.advanced\": {\n     \"debug.overrideEngine\": \"stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin\",\n     \"debug.testOverrideProxyUrl\": \"http://localhost:8000\",\n     \"debug.overrideProxyUrl\": \"http://localhost:8000\"\n}\n```\n\n### Creative v.s. Conservative\n\nLLMs are known to be sensitive to parameters, the higher `temperature` leads to more \"randomness\" hence LLM becomes more \"creative\", `top_p` and `top_k` also contribute to the \"randomness\"\n\nIf you want to make LLM be creative.\n\n```sh\ncurl -X POST \\\n     -H 'Content-Type: application/json' \\\n     -d '{ \"messages\": [{\"role\": \"user\", \"content\": \"Tell me a story.\"}], \"model\": \"llama-2-7b-chat.ggmlv3.q4_0.bin\", \"stream\": false, \"temperature\": \"2\", \"top_p\": \"1.0\", \"top_k\": \"0\" }' \\\n     http://localhost:8000/v1/chat/completions\n```\n\nIf you want to make LLM be more consistent and genereate the same result with the same input.\n\n```sh\ncurl -X POST \\\n     -H 'Content-Type: application/json' \\\n     -d '{ \"messages\": [{\"role\": \"user\", \"content\": \"Tell me a story.\"}], \"model\": \"llama-2-7b-chat.ggmlv3.q4_0.bin\", \"stream\": false, \"temperature\": \"0.1\", \"top_p\": \"0.1\", \"top_k\": \"40\" }' \\\n     http://localhost:8000/v1/chat/completions\n```\n\n## Roadmap\n\n- [x] Support `starcoder` model type via [ctransformers](https://github.com/marella/ctransformers), including:\n  - StarChat \u003chttps://huggingface.co/TheBloke/starchat-beta-GGML\u003e\n  - StarCoder \u003chttps://huggingface.co/TheBloke/starcoder-GGML\u003e\n  - StarCoderPlus \u003chttps://huggingface.co/TheBloke/starcoderplus-GGML\u003e\n- [x] Mimic restof OpenAI API, including `GET /models` and `POST /completions`\n- [ ] GPU acceleration (CUDA/METAL)\n- [ ] Support `POST /embeddings` backed by huggingface Apache-2.0 embedding models such as [Sentence Transformers](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) and [hkunlp/instructor](https://huggingface.co/hkunlp/instructor-large)\n- [ ] Suuport Apache-2.0 [fastchat-t5-3b](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0)\n- [ ] Support more Apache-2.0 models such as [codet5p](https://huggingface.co/Salesforce/codet5p-16b) and others listed [here](https://github.com/eugeneyan/open-llms)\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=chenhunghan/ialacol\u0026type=Date)](https://star-history.com/#chenhunghan/ialacol\u0026Date)\n\n## Receipts\n\n### Llama-2\n\nDeploy [Meta's Llama 2 Chat](https://huggingface.co/meta-llama) model quantized by [TheBloke](https://huggingface.co/TheBloke).\n\n7B Chat\n\n```sh\nhelm repo add ialacol https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install llama2-7b-chat ialacol/ialacol -f examples/values/llama2-7b-chat.yaml\n```\n\n13B Chat\n\n```sh\nhelm repo add ialacol https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install llama2-13b-chat ialacol/ialacol -f examples/values/llama2-13b-chat.yaml\n```\n\n70B Chat\n\n```sh\nhelm repo add ialacol https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install llama2-70b-chat ialacol/ialacol -f examples/values/llama2-70b-chat.yaml\n```\n\n### OpenLM Research's OpenLLaMA Models\n\nDeploy [OpenLLaMA 7B](https://github.com/openlm-research/open_llama) model quantized by [rustformers](https://huggingface.co/rustformers/open-llama-ggml).\n\nℹ️ This is a base model, likely only useful for text completion.\n\n```sh\nhelm repo add ialacol https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install openllama-7b ialacol/ialacol -f examples/values/openllama-7b.yaml\n```\n\n### VMWare's OpenLlama 13B Open Instruct\n\nDeploy [OpenLLaMA 13B Open Instruct](https://huggingface.co/VMware/open-llama-13b-open-instruct) model quantized by [TheBloke](https://huggingface.co/TheBloke).\n\n```sh\nhelm repo add ialacol https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install openllama-13b-instruct ialacol/ialacol -f examples/values/openllama-13b-instruct.yaml\n```\n\n### Mosaic's MPT Models\n\nDeploy [MosaicML's MPT-7B](https://www.mosaicml.com/blog/mpt-7b) model quantized by [rustformers](https://huggingface.co/rustformers). ℹ️ This is a base model, likely only useful for text completion.\n\n```sh\nhelm repo add ialacol https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install mpt-7b ialacol/ialacol -f examples/values/mpt-7b.yaml\n```\n\nDeploy [MosaicML's MPT-30B Chat](https://www.mosaicml.com/blog/mpt-30b) model quantized by [TheBloke](https://huggingface.co/TheBloke).\n\n```sh\nhelm repo add ialacol https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install mpt-30b-chat ialacol/ialacol -f examples/values/mpt-30b-chat.yaml\n```\n\n### Falcon Models\n\nDeploy [Uncensored Falcon 7B](https://huggingface.co/ehartford/WizardLM-Uncensored-Falcon-7b) model quantized by [TheBloke](https://huggingface.co/TheBloke).\n\n```sh\nhelm repo add ialacol https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install falcon-7b ialacol/ialacol -f examples/values/falcon-7b.yaml\n```\n\nDeploy [Uncensored Falcon 40B](https://huggingface.co/ehartford/WizardLM-Uncensored-Falcon-40b) model quantized by [TheBloke](https://huggingface.co/TheBloke).\n\n```sh\nhelm repo add ialacol https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install falcon-40b ialacol/ialacol -f examples/values/falcon-40b.yaml\n```\n\n### StarCoder Models (startcoder, startchat, starcoderplus, WizardCoder)\n\nDeploy [`starchat-beta`](https://huggingface.co/TheBloke/starchat-beta-GGML) model quantized by [TheBloke](https://huggingface.co/TheBloke).\n\n```sh\nhelm repo add starchat https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install starchat-beta ialacol/ialacol -f examples/values/starchat-beta.yaml\n```\n\nDeploy [`WizardCoder`](https://huggingface.co/WizardLM/WizardCoder-15B-V1.0) model quantized by [TheBloke](https://huggingface.co/TheBloke).\n\n```sh\nhelm repo add starchat https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install wizard-coder-15b ialacol/ialacol -f examples/values/wizard-coder-15b.yaml\n```\n\n### Pythia Models\n\nDeploy light-weight [`pythia-70m`](https://huggingface.co/rustformers/pythia-ggml) model with only 70 millions paramters (~40MB) quantized by [rustformers](https://huggingface.co/rustformers).\n\n```sh\nhelm repo add ialacol https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install pythia70m ialacol/ialacol -f examples/values/pythia-70m.yaml\n```\n\n### RedPajama Models\n\nDeploy [`RedPajama` 3B](https://huggingface.co/rustformers/redpajama-3b-ggml) model\n\n```sh\nhelm repo add ialacol https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install redpajama-3b ialacol/ialacol -f examples/values/redpajama-3b.yaml\n```\n\n### StableLM Models\n\nDeploy [`stableLM`](https://huggingface.co/rustformers/stablelm-ggml) 7B model\n\n```sh\nhelm repo add ialacol https://chenhunghan.github.io/ialacol\nhelm repo update\nhelm install stablelm-7b ialacol/ialacol -f examples/values/stablelm-7b.yaml\n```\n\n## Development\n\n```sh\npython3 -m venv .venv\nsource .venv/bin/activate\npython3 -m pip install -r requirements.txt\npip freeze \u003e requirements.txt\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchenhunghan%2Fialacol","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchenhunghan%2Fialacol","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchenhunghan%2Fialacol/lists"}