{"id":13490069,"url":"https://github.com/liltom-eth/llama2-webui","last_synced_at":"2025-05-14T20:02:23.178Z","repository":{"id":182930711,"uuid":"668513898","full_name":"liltom-eth/llama2-webui","owner":"liltom-eth","description":"Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Use `llama2-wrapper` as your local llama2 backend for Generative Agents/Apps.  ","archived":false,"fork":false,"pushed_at":"2024-03-22T09:50:24.000Z","size":1075,"stargazers_count":1958,"open_issues_count":26,"forks_count":204,"subscribers_count":24,"default_branch":"main","last_synced_at":"2025-04-08T09:49:16.925Z","etag":null,"topics":["llama-2","llama2","llm","llm-inference"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/liltom-eth.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-07-20T02:03:38.000Z","updated_at":"2025-04-08T09:43:31.000Z","dependencies_parsed_at":null,"dependency_job_id":"a73f379a-51ef-4b06-9a35-cc8ba82a57ae","html_url":"https://github.com/liltom-eth/llama2-webui","commit_stats":null,"previous_names":["liltom-eth/llama2-webui"],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liltom-eth%2Fllama2-webui","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liltom-eth%2Fllama2-webui/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liltom-eth%2Fllama2-webui/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/liltom-eth%2Fllama2-webui/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/liltom-eth","download_url":"https://codeload.github.com/liltom-eth/llama2-webui/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248384926,"owners_count":21094796,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llama-2","llama2","llm","llm-inference"],"created_at":"2024-07-31T19:00:40.236Z","updated_at":"2025-04-11T11:38:10.874Z","avatar_url":"https://github.com/liltom-eth.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook","Tools for Self-Hosting","Project List","A01_文本生成_文本对话","HarmonyOS","GitHub projects","Applications","Repos","LLMs ChatUI"],"sub_categories":["LLMs","\u003cspan id=\"tool\"\u003eLLM (LLM \u0026 Tool)\u003c/span\u003e","大语言对话模型及数据","Windows Manager","提示语（魔法）"],"readme":"# llama2-webui\r\n\r\nRunning Llama 2 with gradio web UI on GPU or CPU from anywhere (Linux/Windows/Mac). \r\n- Supporting all Llama 2 models (7B, 13B, 70B, GPTQ, GGML, GGUF, [CodeLlama](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GPTQ)) with 8-bit, 4-bit mode. \r\n- Use [llama2-wrapper](https://pypi.org/project/llama2-wrapper/) as your local llama2 backend for Generative Agents/Apps; [colab example](./colab/Llama_2_7b_Chat_GPTQ.ipynb). \r\n- [Run OpenAI Compatible API](#start-openai-compatible-api) on Llama2 models.\r\n\r\n![screenshot](./static/screenshot.png)\r\n\r\n![code_llama_playground](https://i.imgur.com/FgMUiT6.gif)\r\n\r\n## Features\r\n\r\n- Supporting models: [Llama-2-7b](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)/[13b](https://huggingface.co/llamaste/Llama-2-13b-chat-hf)/[70b](https://huggingface.co/llamaste/Llama-2-70b-chat-hf), [Llama-2-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ), [Llama-2-GGML](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML),  [Llama-2-GGUF](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF),  [CodeLlama](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GPTQ) ...\r\n- Supporting model backends: [transformers](https://github.com/huggingface/transformers), [bitsandbytes(8-bit inference)](https://github.com/TimDettmers/bitsandbytes), [AutoGPTQ(4-bit inference)](https://github.com/PanQiWei/AutoGPTQ), [llama.cpp](https://github.com/ggerganov/llama.cpp)\r\n- Demos: [Run Llama2 on MacBook Air](https://twitter.com/liltom_eth/status/1682791729207070720?s=20); [Run Llama2 on free Colab T4 GPU](./colab/Llama_2_7b_Chat_GPTQ.ipynb)\r\n- Use  [llama2-wrapper](https://pypi.org/project/llama2-wrapper/)  as your local llama2 backend for Generative Agents/Apps; [colab example](./colab/Llama_2_7b_Chat_GPTQ.ipynb).  \r\n- [Run OpenAI Compatible API](#start-openai-compatible-api) on Llama2 models.\r\n- [News](./docs/news.md), [Benchmark](./docs/performance.md), [Issue Solutions](./docs/issues.md)\r\n\r\n## Contents\r\n\r\n- [Install](#install)\r\n- [Usage](#usage)\r\n  - [Start Chat UI](#start-chat-ui)\r\n  - [Start Code Llama UI](#start-code-llama-ui)\r\n  - [Use llama2-wrapper for Your App](#use-llama2-wrapper-for-your-app)\r\n  - [Start OpenAI Compatible API](#start-openai-compatible-api)\r\n- [Benchmark](#benchmark)\r\n- [Download Llama-2 Models](#download-llama-2-models)\r\n  - [Model List](#model-list)\r\n  - [Download Script](#download-script)\r\n- [Tips](#tips)\r\n  - [Env Examples](#env-examples)\r\n  - [Run on Nvidia GPU](#run-on-nvidia-gpu)\r\n    - [Run bitsandbytes 8 bit](#run-bitsandbytes-8-bit)\r\n    - [Run GPTQ 4 bit](#run-gptq-4-bit)\r\n  - [Run on CPU](#run-on-cpu)\r\n    - [Mac Metal Acceleration](#mac-metal-acceleration)\r\n    - [AMD/Nvidia GPU Acceleration](#amdnvidia-gpu-acceleration)\r\n- [License](#license)\r\n- [Contributing](#contributing)\r\n\r\n\r\n\r\n## Install\r\n### Method 1: From [PyPI](https://pypi.org/project/llama2-wrapper/)\r\n```\r\npip install llama2-wrapper\r\n```\r\nThe newest `llama2-wrapper\u003e=0.1.14` supports llama.cpp's `gguf` models.\r\n\r\nIf you would like to use old `ggml` models, install `llama2-wrapper\u003c=0.1.13` or manually install `llama-cpp-python==0.1.77`.\r\n\r\n### Method 2: From Source:\r\n\r\n```\r\ngit clone https://github.com/liltom-eth/llama2-webui.git\r\ncd llama2-webui\r\npip install -r requirements.txt\r\n```\r\n### Install Issues:\r\n`bitsandbytes \u003e= 0.39` may not work on older NVIDIA GPUs. In that case, to use `LOAD_IN_8BIT`, you may have to downgrade like this:\r\n\r\n-  `pip install bitsandbytes==0.38.1`\r\n\r\n`bitsandbytes` also need a special install for Windows:\r\n\r\n```\r\npip uninstall bitsandbytes\r\npip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.0-py3-none-win_amd64.whl\r\n```\r\n\r\n## Usage\r\n\r\n### Start Chat UI\r\n\r\nRun chatbot simply with web UI:\r\n\r\n```bash\r\npython app.py\r\n```\r\n\r\n`app.py` will load the default config `.env` which uses `llama.cpp` as the backend to run `llama-2-7b-chat.ggmlv3.q4_0.bin` model for inference. The model `llama-2-7b-chat.ggmlv3.q4_0.bin` will be automatically downloaded.\r\n\r\n```bash\r\nRunning on backend llama.cpp.\r\nUse default model path: ./models/llama-2-7b-chat.Q4_0.gguf\r\nStart downloading model to: ./models/llama-2-7b-chat.Q4_0.gguf\r\n```\r\n\r\nYou can also customize your `MODEL_PATH`, `BACKEND_TYPE,` and model configs in `.env` file to run different llama2 models on different backends (llama.cpp, transformers, gptq). \r\n\r\n### Start Code Llama UI\r\n\r\nWe provide a code completion / filling UI for Code Llama.\r\n\r\nBase model **Code Llama** and extend model **Code Llama — Python** are not fine-tuned to follow instructions. They should be prompted so that the expected answer is the natural continuation of the prompt. That means these two models focus on code filling and code completion.\r\n\r\nHere is an example run CodeLlama code completion on llama.cpp backend:\r\n\r\n``` \r\npython code_completion.py --model_path ./models/codellama-7b.Q4_0.gguf\r\n```\r\n\r\n![code_llama_playground](https://i.imgur.com/FgMUiT6.gif)\r\n\r\n`codellama-7b.Q4_0.gguf` can be downloaded from [TheBloke/CodeLlama-7B-GGUF](https://huggingface.co/TheBloke/CodeLlama-7B-GGUF/blob/main/codellama-7b.Q4_0.gguf).\r\n\r\n**Code Llama — Instruct** trained with “natural language instruction” inputs paired with anticipated outputs. This strategic methodology enhances the model’s capacity to grasp human expectations in prompts. That means instruct models can be used in a chatbot-like app.\r\n\r\nExample run CodeLlama chat on gptq backend:\r\n\r\n```\r\npython app.py --backend_type gptq --model_path ./models/CodeLlama-7B-Instruct-GPTQ/ --share True\r\n```\r\n\r\n![code_llama_chat](https://i.imgur.com/lQLfemB.gif)\r\n\r\n`CodeLlama-7B-Instruct-GPTQ` can be downloaded from [TheBloke/CodeLlama-7B-Instruct-GPTQ](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GPTQ)\r\n\r\n### Use llama2-wrapper for Your App\r\n\r\n🔥 For developers, we released `llama2-wrapper`  as a llama2 backend wrapper in [PYPI](https://pypi.org/project/llama2-wrapper/).\r\n\r\nUse  `llama2-wrapper`  as your local llama2 backend to answer questions and more, [colab example](./colab/ggmlv3_q4_0.ipynb):\r\n\r\n```python\r\n# pip install llama2-wrapper\r\nfrom llama2_wrapper import LLAMA2_WRAPPER, get_prompt \r\nllama2_wrapper = LLAMA2_WRAPPER()\r\n# Default running on backend llama.cpp.\r\n# Automatically downloading model to: ./models/llama-2-7b-chat.ggmlv3.q4_0.bin\r\nprompt = \"Do you know Pytorch\"\r\nanswer = llama2_wrapper(get_prompt(prompt), temperature=0.9)\r\n```\r\n\r\nRun gptq llama2 model on Nvidia GPU, [colab example](./colab/Llama_2_7b_Chat_GPTQ.ipynb):\r\n\r\n```python\r\nfrom llama2_wrapper import LLAMA2_WRAPPER \r\nllama2_wrapper = LLAMA2_WRAPPER(backend_type=\"gptq\")\r\n# Automatically downloading model to: ./models/Llama-2-7b-Chat-GPTQ\r\n```\r\n\r\nRun llama2 7b with bitsandbytes 8 bit with a `model_path`:\r\n\r\n```python\r\nfrom llama2_wrapper import LLAMA2_WRAPPER \r\nllama2_wrapper = LLAMA2_WRAPPER(\r\n\tmodel_path = \"./models/Llama-2-7b-chat-hf\",\r\n  backend_type = \"transformers\",\r\n  load_in_8bit = True\r\n)\r\n```\r\nCheck [API Document](https://pypi.org/project/llama2-wrapper/) for more usages.\r\n\r\n### Start OpenAI Compatible API\r\n\r\n`llama2-wrapper` offers a web server that acts as a drop-in replacement for the OpenAI API. This allows you to use Llama2 models with any OpenAI compatible clients, libraries or services, etc.\r\n\r\nStart Fast API:\r\n\r\n```\r\npython -m llama2_wrapper.server\r\n```\r\n\r\nit will use `llama.cpp` as the backend by default to run `llama-2-7b-chat.ggmlv3.q4_0.bin` model.\r\n\r\nStart Fast API for `gptq` backend:\r\n\r\n```\r\npython -m llama2_wrapper.server --backend_type gptq\r\n```\r\n\r\nNavigate to http://localhost:8000/docs to see the OpenAPI documentation.\r\n\r\n#### Basic settings\r\n\r\n| Flag             | Description                                                  |\r\n| ---------------- | ------------------------------------------------------------ |\r\n| `-h`, `--help`   | Show this help message.                                      |\r\n| `--model_path`   | The path to the model to use for generating completions.     |\r\n| `--backend_type` | Backend for llama2, options: llama.cpp, gptq, transformers   |\r\n| `--max_tokens`   | Maximum context size.                                        |\r\n| `--load_in_8bit` | Whether to use bitsandbytes to run model in 8 bit mode (only for transformers models). |\r\n| `--verbose`      | Whether to print verbose output to stderr.                   |\r\n| `--host`         | API address                                                  |\r\n| `--port`         | API port                                                     |\r\n\r\n## Benchmark\r\n\r\nRun benchmark script to compute performance on your device, `benchmark.py` will load the same `.env` as `app.py`.:\r\n\r\n```bash\r\npython benchmark.py\r\n```\r\n\r\nYou can also select the `iter`, `backend_type` and `model_path` the benchmark will be run (overwrite .env args) :\r\n\r\n```bash\r\npython benchmark.py --iter NB_OF_ITERATIONS --backend_type gptq\r\n```\r\n\r\n By default, the number of iterations is 5, but if you want a faster result or a more accurate one \r\n you can set it to whatever value you want, but please only report results with at least 5 iterations.\r\n\r\nThis [colab example](./colab/Llama_2_7b_Chat_GPTQ.ipynb) also show you how to benchmark gptq model on free Google Colab T4 GPU.\r\n\r\nSome benchmark performance:\r\n\r\n| Model                       | Precision | Device             | RAM / GPU VRAM | Speed (tokens/sec) | load time (s) |\r\n| --------------------------- | --------- | ------------------ | -------------- | ------------------ | ------------- |\r\n| Llama-2-7b-chat-hf          | 8 bit     | NVIDIA RTX 2080 Ti | 7.7 GB VRAM    | 3.76               | 641.36        |\r\n| Llama-2-7b-Chat-GPTQ        | 4 bit     | NVIDIA RTX 2080 Ti | 5.8 GB VRAM    | 18.85              | 192.91        |\r\n| Llama-2-7b-Chat-GPTQ        | 4 bit     | Google Colab T4    | 5.8 GB VRAM    | 18.19              | 37.44         |\r\n| llama-2-7b-chat.ggmlv3.q4_0 | 4 bit     | Apple M1 Pro CPU   | 5.4 GB RAM     | 17.90              | 0.18          |\r\n| llama-2-7b-chat.ggmlv3.q4_0 | 4 bit     | Apple M2 CPU       | 5.4 GB RAM     | 13.70              | 0.13          |\r\n| llama-2-7b-chat.ggmlv3.q4_0 | 4 bit     | Apple M2 Metal     | 5.4 GB RAM     | 12.60              | 0.10          |\r\n| llama-2-7b-chat.ggmlv3.q2_K | 2 bit     | Intel i7-8700      | 4.5 GB RAM     | 7.88               | 31.90         |\r\n\r\nCheck/contribute the performance of your device in the full [performance doc](./docs/performance.md).\r\n\r\n## Download Llama-2 Models\r\n\r\nLlama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.\r\n\r\nLlama-2-7b-Chat-GPTQ is the GPTQ model files for [Meta's Llama 2 7b Chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). GPTQ 4-bit Llama-2 model require less GPU VRAM to run it.\r\n\r\n### Model List\r\n\r\n| Model Name                          | set MODEL_PATH in .env                   | Download URL                                                 |\r\n| ----------------------------------- | ---------------------------------------- | ------------------------------------------------------------ |\r\n| meta-llama/Llama-2-7b-chat-hf       | /path-to/Llama-2-7b-chat-hf              | [Link](https://huggingface.co/llamaste/Llama-2-7b-chat-hf)   |\r\n| meta-llama/Llama-2-13b-chat-hf      | /path-to/Llama-2-13b-chat-hf             | [Link](https://huggingface.co/llamaste/Llama-2-13b-chat-hf)  |\r\n| meta-llama/Llama-2-70b-chat-hf      | /path-to/Llama-2-70b-chat-hf             | [Link](https://huggingface.co/llamaste/Llama-2-70b-chat-hf)  |\r\n| meta-llama/Llama-2-7b-hf            | /path-to/Llama-2-7b-hf                   | [Link](https://huggingface.co/meta-llama/Llama-2-7b-hf)      |\r\n| meta-llama/Llama-2-13b-hf           | /path-to/Llama-2-13b-hf                  | [Link](https://huggingface.co/meta-llama/Llama-2-13b-hf)     |\r\n| meta-llama/Llama-2-70b-hf           | /path-to/Llama-2-70b-hf                  | [Link](https://huggingface.co/meta-llama/Llama-2-70b-hf)     |\r\n| TheBloke/Llama-2-7b-Chat-GPTQ       | /path-to/Llama-2-7b-Chat-GPTQ            | [Link](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ) |\r\n| TheBloke/Llama-2-7b-Chat-GGUF       | /path-to/llama-2-7b-chat.Q4_0.gguf       | [Link](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/blob/main/llama-2-7b-chat.Q4_0.gguf) |\r\n| TheBloke/Llama-2-7B-Chat-GGML       | /path-to/llama-2-7b-chat.ggmlv3.q4_0.bin | [Link](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML) |\r\n| TheBloke/CodeLlama-7B-Instruct-GPTQ | TheBloke/CodeLlama-7B-Instruct-GPTQ      | [Link](https://huggingface.co/TheBloke/CodeLlama-7B-Instruct-GPTQ) |\r\n| ...                                 | ...                                      | ...                                                          |\r\n\r\nRunning 4-bit model `Llama-2-7b-Chat-GPTQ` needs GPU with 6GB VRAM. \r\n\r\nRunning 4-bit model `llama-2-7b-chat.ggmlv3.q4_0.bin` needs CPU with 6GB RAM. There is also a list of other 2, 3, 4, 5, 6, 8-bit GGML models that can be used from [TheBloke/Llama-2-7B-Chat-GGML](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML).\r\n\r\n### Download Script\r\n\r\nThese models can be downloaded through:\r\n\r\n```bash\r\npython -m llama2_wrapper.download --repo_id TheBloke/CodeLlama-7B-Python-GPTQ\r\n\r\npython -m llama2_wrapper.download --repo_id TheBloke/Llama-2-7b-Chat-GGUF --filename llama-2-7b-chat.Q4_0.gguf --save_dir ./models\r\n```\r\n\r\nOr use CMD like:\r\n\r\n```bash\r\n# Make sure you have git-lfs installed (https://git-lfs.com)\r\ngit lfs install\r\ngit clone git@hf.co:meta-llama/Llama-2-7b-chat-hf\r\n```\r\n\r\nTo download Llama 2 models, you need to request access from [https://ai.meta.com/llama/](https://ai.meta.com/llama/) and also enable access on repos like [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main). Requests will be processed in hours.\r\n\r\nFor GPTQ models like [TheBloke/Llama-2-7b-Chat-GPTQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ), you can directly download without requesting access.\r\n\r\nFor GGML models like [TheBloke/Llama-2-7B-Chat-GGML](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML), you can directly download without requesting access.\r\n\r\n## Tips\r\n\r\n### Env Examples\r\n\r\nThere are some examples in `./env_examples/` folder.\r\n\r\n| Model Setup                                            | Example .env                |\r\n| ------------------------------------------------------ | --------------------------- |\r\n| Llama-2-7b-chat-hf 8-bit (transformers backend)        | .env.7b_8bit_example        |\r\n| Llama-2-7b-Chat-GPTQ 4-bit (gptq transformers backend) | .env.7b_gptq_example        |\r\n| Llama-2-7B-Chat-GGML 4bit (llama.cpp backend)          | .env.7b_ggmlv3_q4_0_example |\r\n| Llama-2-13b-chat-hf (transformers backend)             | .env.13b_example            |\r\n| ...                                                    | ...                         |\r\n\r\n### Run on Nvidia GPU\r\n\r\nThe running requires around 14GB of GPU VRAM for Llama-2-7b and 28GB of GPU VRAM for Llama-2-13b. \r\n\r\nIf you are running on multiple GPUs, the model will be loaded automatically on GPUs and split the VRAM usage. That allows you to run Llama-2-7b (requires 14GB of GPU VRAM) on a setup like 2 GPUs (11GB VRAM each).\r\n\r\n#### Run bitsandbytes 8 bit\r\n\r\nIf you do not have enough memory,  you can set up your `LOAD_IN_8BIT` as `True` in `.env`. This can reduce memory usage by around half with slightly degraded model quality. It is compatible with the CPU, GPU, and Metal backend.\r\n\r\nLlama-2-7b with 8-bit compression can run on a single GPU with 8 GB of VRAM, like an Nvidia RTX 2080Ti, RTX 4080, T4, V100 (16GB).\r\n\r\n#### Run GPTQ 4 bit\r\n\r\nIf you want to run 4 bit  Llama-2 model like `Llama-2-7b-Chat-GPTQ`,  you can set up your `BACKEND_TYPE` as `gptq` in `.env` like example `.env.7b_gptq_example`. \r\n\r\nMake sure you have downloaded the 4-bit model from `Llama-2-7b-Chat-GPTQ` and set the `MODEL_PATH` and arguments in `.env` file.\r\n\r\n`Llama-2-7b-Chat-GPTQ` can run on a single GPU with 6 GB of VRAM.\r\n\r\nIf you encounter issue like `NameError: name 'autogptq_cuda_256' is not defined`, please refer to [here](https://huggingface.co/TheBloke/open-llama-13b-open-instruct-GPTQ/discussions/1)\r\n\u003e pip install https://github.com/PanQiWei/AutoGPTQ/releases/download/v0.3.0/auto_gptq-0.3.0+cu117-cp310-cp310-linux_x86_64.whl  \r\n\r\n### Run on CPU\r\n\r\nRun Llama-2 model on CPU requires [llama.cpp](https://github.com/ggerganov/llama.cpp) dependency and [llama.cpp Python Bindings](https://github.com/abetlen/llama-cpp-python), which are already installed. \r\n\r\n\r\nDownload GGML models like `llama-2-7b-chat.ggmlv3.q4_0.bin` following [Download Llama-2 Models](#download-llama-2-models) section. `llama-2-7b-chat.ggmlv3.q4_0.bin` model requires at least 6 GB RAM to run on CPU.\r\n\r\nSet up configs like `.env.7b_ggmlv3_q4_0_example` from `env_examples` as `.env`.\r\n\r\nRun web UI `python app.py` .\r\n\r\n#### Mac Metal Acceleration\r\n\r\nFor Mac users, you can also set up Mac Metal for acceleration, try install this dependencies:\r\n\r\n```bash\r\npip uninstall llama-cpp-python -y\r\nCMAKE_ARGS=\"-DLLAMA_METAL=on\" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir\r\npip install 'llama-cpp-python[server]'\r\n```\r\n\r\nor check details:\r\n\r\n- [MacOS Install with Metal GPU](https://github.com/abetlen/llama-cpp-python/blob/main/docs/install/macos.md)\r\n\r\n#### AMD/Nvidia GPU Acceleration\r\n\r\nIf you would like to use AMD/Nvidia GPU for acceleration, check this:\r\n\r\n- [Installation with OpenBLAS / cuBLAS / CLBlast / Metal](https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal)\r\n\r\n\r\n\r\n\r\n\r\n## License\r\n\r\nMIT - see [MIT License](LICENSE)\r\n\r\nThis project enables users to adapt it freely for proprietary purposes without any restrictions.\r\n\r\n## Contributing\r\n\r\nKindly read our [Contributing Guide](CONTRIBUTING.md) to learn and understand our development process.\r\n\r\n### All Contributors\r\n\r\n\u003ca href=\"https://github.com/liltom-eth/llama2-webui/graphs/contributors\"\u003e\r\n  \u003cimg src=\"https://contrib.rocks/image?repo=liltom-eth/llama2-webui\" /\u003e\r\n\u003c/a\u003e\r\n\r\n### Review\r\n\u003ca href='https://github.com/repo-reviews/repo-reviews.github.io/blob/main/create.md' target=\"_blank\"\u003e\u003cimg alt='Github' src='https://img.shields.io/badge/review-100000?style=flat\u0026logo=Github\u0026logoColor=white\u0026labelColor=888888\u0026color=555555'/\u003e\u003c/a\u003e\r\n\r\n### Star History\r\n\r\n[![Star History Chart](https://api.star-history.com/svg?repos=liltom-eth/llama2-webui\u0026type=Date)](https://star-history.com/#liltom-eth/llama2-webui\u0026Date)\r\n\r\n## Credits\r\n\r\n- https://huggingface.co/meta-llama/Llama-2-7b-chat-hf\r\n- https://huggingface.co/spaces/huggingface-projects/llama-2-7b-chat\r\n- https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ\r\n- [https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)\r\n- [https://github.com/TimDettmers/bitsandbytes](https://github.com/TimDettmers/bitsandbytes)\r\n- [https://github.com/PanQiWei/AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ)\r\n- [https://github.com/abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliltom-eth%2Fllama2-webui","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fliltom-eth%2Fllama2-webui","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fliltom-eth%2Fllama2-webui/lists"}