{"id":16981096,"url":"https://github.com/jasonacox/tinyllm","last_synced_at":"2025-09-06T13:32:40.824Z","repository":{"id":196295835,"uuid":"690363284","full_name":"jasonacox/TinyLLM","owner":"jasonacox","description":"Setup and run a local LLM and Chatbot using consumer grade hardware. ","archived":false,"fork":false,"pushed_at":"2025-08-03T07:18:37.000Z","size":568,"stargazers_count":275,"open_issues_count":6,"forks_count":31,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-08-03T08:31:59.796Z","etag":null,"topics":["artificial-intelligence","chatbot","large-language-models","llama-cpp-python","llm","openai","rag","retrieval-augmented-generation","vllm"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jasonacox.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-09-12T03:48:46.000Z","updated_at":"2025-08-03T08:01:18.000Z","dependencies_parsed_at":"2024-01-27T17:31:45.186Z","dependency_job_id":"2d251a8c-cd36-49eb-93e2-c6c60d708e59","html_url":"https://github.com/jasonacox/TinyLLM","commit_stats":null,"previous_names":["jasonacox/tinyllm"],"tags_count":33,"template":false,"template_full_name":null,"purl":"pkg:github/jasonacox/TinyLLM","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jasonacox%2FTinyLLM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jasonacox%2FTinyLLM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jasonacox%2FTinyLLM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jasonacox%2FTinyLLM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jasonacox","download_url":"https://codeload.github.com/jasonacox/TinyLLM/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jasonacox%2FTinyLLM/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273912534,"owners_count":25189969,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-06T02:00:13.247Z","response_time":2576,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","chatbot","large-language-models","llama-cpp-python","llm","openai","rag","retrieval-augmented-generation","vllm"],"created_at":"2024-10-14T02:04:37.178Z","updated_at":"2025-09-06T13:32:40.776Z","avatar_url":"https://github.com/jasonacox.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TinyLLM\n\nTinyLLM? Yes, the name is a bit of a contradiction, but it means well. It's all about putting a large language model (LLM) on a tiny system that still delivers acceptable performance.\n\nThis project helps you build a small locally hosted LLM with a ChatGPT-like web interface using consumer grade hardware. To read more about my research with llama.cpp and LLMs, see [research.md](research.md).\n\n## Table of Contents\n\n- [Key Features](#key-features)\n- [Hardware Requirements](#hardware-requirements)\n- [Manual Setup](#manual-setup)\n- [Run a Local LLM](#run-a-local-llm)\n  - [Ollama Server (Option 1)](#ollama-server-option-1)\n  - [vLLM Server (Option 2)](#vllm-server-option-2)\n  - [Llama-cpp-python Server (Option 3)](#llama-cpp-python-server-option-3)\n- [Run a Chatbot](#run-a-chatbot)\n  - [Example Session](#example-session)\n  - [Read URLs](#read-urls)\n  - [Current News](#current-news)\n  - [Manual Setup](#manual-setup-1)\n- [LLM Models](#llm-models)\n- [LLM Tools](#llm-tools)\n- [References](#references)\n\n## Key Features\n\n* Supports multiple LLMs (see list below)\n* Builds a local OpenAI API web service via [Ollama](https://ollama.com/), [llama.cpp](https://github.com/ggerganov/llama.cpp) or [vLLM](https://github.com/vllm-project/vllm). \n* Serves up a Chatbot web interface with customizable prompts, accessing external websites (URLs), vector databases and other sources (e.g. news, stocks, weather).\n\n## Hardware Requirements\n\n* CPU: Intel, AMD or Apple Silicon\n* Memory: 8GB+ DDR4\n* Disk: 128G+ SSD\n* GPU: NVIDIA (e.g. GTX 1060 6GB, RTX 3090 24GB) or Apple M1/M2\n* OS: Ubuntu Linux, MacOS\n* Software: Python 3, CUDA Version: 12.2\n\n## Quickstart\n\nTODO - Quick start setup script.\n\n## Manual Setup\n\n```bash\n# Clone the project\ngit clone https://github.com/jasonacox/TinyLLM.git\ncd TinyLLM\n```\n\n## Run a Local LLM\n\nTo run a local LLM, you will need an inference server for the model. This project recommends these options: [vLLM](https://github.com/vllm-project/vllm), [llama-cpp-python](https://github.com/abetlen/llama-cpp-python), and [Ollama](https://ollama.com/). All of these provide a built-in OpenAI API compatible web server that will make it easier for you to integrate with other tools.  \n\n### Ollama Server (Option 1)\n\nThe Ollama project has made it super easy to install and run LLMs on a variety of systems (MacOS, Linux, Windows) with limited hardware. It serves up an OpenAI compatible API as well. The underlying LLM engine is llama.cpp. Like llama.cpp, the downside with this server is that it can only handle one session/prompt at a time. To run the Ollama server container:\n\n```bash\n# Install and run Ollama server\ndocker run -d --gpus=all \\\n    -v $PWD/ollama:/root/.ollama \\\n    -p 11434:11434 \\\n    -p 8000:11434 \\\n    --restart unless-stopped \\\n    --name ollama \\\n    ollama/ollama\n\n# Download and test run the llama3 model\ndocker exec -it ollama ollama run llama3\n\n# Tell server to keep model loaded in GPU\ncurl http://localhost:11434/api/generate -d '{\"model\": \"llama3\", \"keep_alive\": -1}'\n```\nOllama support several models (LLMs): https://ollama.com/library If you set up the docker container mentioned above, you can down and run them using:\n\n```bash\n# Download and run Phi-3 Mini, open model by Microsoft.\ndocker exec -it ollama ollama run phi3\n\n# Download and run mistral 7B model, by Mistral AI\ndocker exec -it ollama ollama run mistral\n```\n\nIf you use the TinyLLM Chatbot (see below) with Ollama, make sure you specify the model via: `LLM_MODEL=\"llama3\"` This will cause Ollama to download and run this model. It may take a while to start on first run unless you run one of the `ollama run` or `curl` commands above.\n\n### vLLM Server (Option 2)\n\nvLLM offers a robust OpenAI API compatible web server that supports multiple simultaneous inference threads (sessions). It automatically downloads the models you specify from HuggingFace and runs extremely well in containers. vLLM requires GPUs with more VRAM since it uses non-quantized models. AWQ models are also available and more optimizations are underway in the project to reduce the memory footprint. Note, for GPUs with a compute capability of 6 or less, Pascal architecture (see [GPU table](https://github.com/jasonacox/TinyLLM/tree/main/vllm#nvidia-gpu-and-torch-architecture)), follow details [here](./vllm/) instead.\n\n```bash\n# Build Container\ncd vllm\n./build.sh \n\n# Make a Directory to store Models\nmkdir models\n\n# Edit run.sh or run-awq.sh to pull the model you want to use. Mistral is set by default.\n# Run the Container - This will download the model on the first run\n./run.sh  \n\n# The trailing logs will be displayed so you can see the progress. Use ^C to exit without\n# stopping the container. \n```\n\n### Llama-cpp-python Server (Option 3)\n\nThe llama-cpp-python's OpenAI API compatible web server is easy to set up and use. It runs optimized GGUF models that work well on many consumer grade GPUs with small amounts of VRAM. As with Ollama, a downside with this server is that it can only handle one session/prompt at a time. The steps below outline how to setup and run the server via command line. Read the details in [llmserver](./llmserver/) to see how to set it up as a persistent service or docker container on your Linux host.\n\n```bash\n# Uninstall any old version of llama-cpp-python\npip3 uninstall llama-cpp-python -y\n\n# Linux Target with Nvidia CUDA support\nCMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip3 install llama-cpp-python==0.2.27 --no-cache-dir\nCMAKE_ARGS=\"-DLLAMA_CUBLAS=on\" FORCE_CMAKE=1 pip3 install llama-cpp-python[server]==0.2.27 --no-cache-dir\n\n# MacOS Target with Apple Silicon M1/M2\nCMAKE_ARGS=\"-DLLAMA_METAL=on\" pip3 install -U llama-cpp-python --no-cache-dir\npip3 install 'llama-cpp-python[server]'\n\n# Download Models from HuggingFace\ncd llmserver/models\n\n# Get the Mistral 7B GGUF Q-5bit model Q5_K_M and Meta LLaMA-2 7B GGUF Q-5bit model Q5_K_M\nwget https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q5_K_M.gguf\nwget https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf\n\n# Run Test - API Server\npython3 -m llama_cpp.server \\\n    --model ./models/mistral-7b-instruct-v0.1.Q5_K_M.gguf \\\n    --host localhost \\\n    --n_gpu_layers 99 \\\n    --n_ctx 2048 \\\n    --chat_format llama-2\n```\n\n## Run a Chatbot\n\nThe TinyLLM Chatbot is a simple web based python FastAPI app that allows you to chat with an LLM using the OpenAI API. It supports multiple sessions and remembers your conversational history. Some RAG (Retrieval Augmented Generation) features including:\n\n* Summarizing external websites and PDFs (paste a URL in chat window)\n* List top 10 headlines from current news (use `/news`)\n* Display company stock symbol and current stock price (use `/stock \u003ccompany\u003e`)\n* Provide current weather conditions (use `/weather \u003clocation\u003e`)\n* Use a vector databases for RAG queries - see [RAG](rag) page for details\n\n```bash\n# Move to chatbot folder\ncd ../chatbot\ntouch prompts.json\n\n# Pull and run latest container - see run.sh\ndocker run \\\n    -d \\\n    -p 5000:5000 \\\n    -e PORT=5000 \\\n    -e OPENAI_API_BASE=\"http://localhost:8000/v1\" \\\n    -e LLM_MODEL=\"tinyllm\" \\\n    -e USE_SYSTEM=\"false\" \\\n    -e SENTENCE_TRANSFORMERS_HOME=/app/.tinyllm \\\n    -v $PWD/.tinyllm:/app/.tinyllm \\\n    --name chatbot \\\n    --restart unless-stopped \\\n    jasonacox/chatbot\n```\n\n### Example Session\n\nOpen http://localhost:5000 - Example session:\n\n\u003cimg width=\"930\" alt=\"image\" src=\"https://github.com/jasonacox/TinyLLM/assets/836718/9eef2769-a352-4cc9-9698-ce15e41c2c45\"\u003e\n\n### Read URLs\n\nIf a URL is pasted in the text box, the chatbot will read and summarize it.\n\n\u003cimg width=\"810\" alt=\"image\" src=\"https://github.com/jasonacox/TinyLLM/assets/836718/44d8a2f7-54c1-4b1c-8471-fdf13439be3b\"\u003e\n\n### Current News\n\nThe `/news` command will fetch the latest news and have the LLM summarize the top ten headlines. It will store the raw feed in the context prompt to allow follow-up questions.\n\n\u003cimg width=\"930\" alt=\"image\" src=\"https://github.com/jasonacox/TinyLLM/assets/836718/2732fe07-99ee-4795-a8ac-42d9a9712f6b\"\u003e\n\n### Manual Setup\n\nYou can also test the chatbot server without docker using the following.\n\n```bash\n# Install required packages\npip3 install fastapi uvicorn python-socketio jinja2 openai bs4 pypdf requests lxml aiohttp\n\n# Run the chatbot web server\npython3 server.py\n```\n\n## LLM Models\n\nHere are some suggested models that work well with llmserver (llama-cpp-python). You can test other models and different quantization, but in my experiments, the Q5_K_M models performed the best. Below are the download links from HuggingFace as well as the model card's suggested context length size and chat prompt mode.\n\n| LLM | Quantized | Link to Download | Context Length | Chat Prompt Mode |\n| --- | --- | --- | --- | --- |\n|  |  | 7B Models |  |  |\n| Mistral v0.1 7B | 5-bit | [mistral-7b-instruct-v0.1.Q5_K_M.gguf](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q5_K_M.gguf) | 4096 | llama-2 |\n| Llama-2 7B | 5-bit | [llama-2-7b-chat.Q5_K_M.gguf](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf) | 2048 | llama-2 |\n| Mistrallite 32K 7B | 5-bit | [mistrallite.Q5_K_M.gguf](https://huggingface.co/TheBloke/MistralLite-7B-GGUF/resolve/main/mistrallite.Q5_K_M.gguf) | 16384 | mistrallite (can be glitchy) |\n|  |  | 10B Models |  |  |\n| Nous-Hermes-2-SOLAR 10.7B | 5-bit | [nous-hermes-2-solar-10.7b.Q5_K_M.gguf](https://huggingface.co/TheBloke/Nous-Hermes-2-SOLAR-10.7B-GGUF/resolve/main/nous-hermes-2-solar-10.7b.Q5_K_M.gguf) | 4096 | chatml |\n|  |  | 13B Models |  |  |\n| Claude2 trained Alpaca 13B | 5-bit | [claude2-alpaca-13b.Q5_K_M.gguf](https://huggingface.co/TheBloke/claude2-alpaca-13B-GGUF/resolve/main/claude2-alpaca-13b.Q5_K_M.gguf) | 2048 | chatml |\n| Llama-2 13B | 5-bit | [llama-2-13b-chat.Q5_K_M.gguf](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q5_K_M.gguf) | 2048 | llama-2 |\n| Vicuna 13B v1.5| 5-bit | [vicuna-13b-v1.5.Q5_K_M.gguf](https://huggingface.co/TheBloke/vicuna-13B-v1.5-GGUF/resolve/main/vicuna-13b-v1.5.Q5_K_M.gguf) | 2048 | vicuna |\n|  |  | Mixture-of-Experts (MoE) Models |  |  |\n| Hai's Mixtral 11Bx2 MoE 19B | 5-bit | [mixtral_11bx2_moe_19b.Q5_K_M.gguf](https://huggingface.co/TheBloke/Mixtral_11Bx2_MoE_19B-GGUF/resolve/main/mixtral_11bx2_moe_19b.Q5_K_M.gguf) | 4096 | chatml |\n| Mixtral-8x7B v0.1 | 3-bit | [Mixtral-8x7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q3_K_M.gguf) | 4096 | llama-2 |\n| Mixtral-8x7B v0.1 | 4-bit | [Mixtral-8x7B-Instruct-v0.1-GGUF](https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf) | 4096 | llama-2 |\n\nHere are some suggested models that work well with vLLM.\n\n| LLM | Quantized | Link to Download | Context Length | License |\n| --- | --- | --- | --- | --- |\n| Mistral v0.1 7B | None | [mistralai/Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) | 32k | Apache 2 |\n| Mistral v0.2 7B | None | [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) | 32k | Apache 2 |\n| Mistral v0.1 7B AWQ | AWQ | [TheBloke/Mistral-7B-Instruct-v0.1-AWQ](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-AWQ) | 32k | Apache 2 |\n| Mixtral-8x7B | None | [mistralai/Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) | 32k | Apache 2 |\n| Pixtral-12B-2409 12B Vision | None | [mistralai/Pixtral-12B-2409](https://huggingface.co/mistralai/Pixtral-12B-2409) | 128k | Apache 2 |\n| Meta Llama-3 8B | None | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 8k | Meta |\n| Meta Llama-3.2 11B Vision | FP8 | [neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic](https://huggingface.co/neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic) | 128k | Meta |\n| Qwen-2.5 7B | None | [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | 128k | Apache 2 |\n| Yi-1.5 9B | None | [01-ai/Yi-1.5-9B-Chat-16K](https://huggingface.co/01-ai/Yi-1.5-9B-Chat-16K) | 16k | Apache 2 |\n| Phi-3 Small 7B | None | [microsoft/Phi-3-small-8k-instruct](https://huggingface.co/microsoft/Phi-3-small-8k-instruct) | 16k | MIT |\n| Phi-3 Medium 14B | None | [microsoft/Phi-3-medium-4k-instruct](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct) | 4k | MIT |\n| Phi-3.5 Vision 4B | None | [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) | 128k | MIT |\n| Phi-4 14B | None | [microsoft/phi-4](https://huggingface.co/microsoft/phi-4) | 16k | MIT |\n\n## LLM Tools\n\n### LLM\n\nA CLI utility (`llm`) and Python library for interacting with Large Language Models. To configure this tool to use your local LLM's OpenAI API:\n\n```bash\n# Install llm command line tool\npipx install llm\n\n# Location to store configuration files:\ndirname \"$(llm logs path)\"\n```\n\nYou define the model in the `extra-openai-models.yaml` file. Create this file in the directory discovered above. Edit the model_name and api_base to match your LLM OpenAI API setup:\n\n```yaml\n- model_id: tinyllm\n  model_name: meta-llama/Meta-Llama-3.1-8B-Instruct\n  api_base: \"http://localhost:8000/v1\"\n```\n\n```bash\n# Configure llm to use your local model\nllm models default tinyllm\n\n# Test\nllm \"What is love?\"\n```\n\n## References\n\n* LLaMa.cpp - https://github.com/ggerganov/llama.cpp\n* LLaMa-cpp-python - https://github.com/abetlen/llama-cpp-python\n* vLLM - https://github.com/vllm-project/vllm\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjasonacox%2Ftinyllm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjasonacox%2Ftinyllm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjasonacox%2Ftinyllm/lists"}