{"id":25373916,"url":"https://github.com/nikolasent/ollama-webui-intel","last_synced_at":"2025-04-09T09:15:45.178Z","repository":{"id":277464636,"uuid":"931267883","full_name":"NikolasEnt/ollama-webui-intel","owner":"NikolasEnt","description":"Ollama with intel (i)GPU acceleration in docker and benchmark","archived":false,"fork":false,"pushed_at":"2025-04-06T01:58:48.000Z","size":1598,"stargazers_count":8,"open_issues_count":2,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-06T02:33:56.998Z","etag":null,"topics":["benchmark","gpu-acceleration","intel","llm-inference","ollama"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NikolasEnt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-12T01:47:18.000Z","updated_at":"2025-04-06T01:58:51.000Z","dependencies_parsed_at":"2025-04-06T02:31:11.453Z","dependency_job_id":"b023ac6b-e883-4eb4-afaa-fcd067018713","html_url":"https://github.com/NikolasEnt/ollama-webui-intel","commit_stats":null,"previous_names":["nikolasent/ollama-webui-intel"],"tags_count":0,"template":false,"template_full_name":"NikolasEnt/AI-project-template","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NikolasEnt%2Follama-webui-intel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NikolasEnt%2Follama-webui-intel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NikolasEnt%2Follama-webui-intel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NikolasEnt%2Follama-webui-intel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NikolasEnt","download_url":"https://codeload.github.com/NikolasEnt/ollama-webui-intel/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248008626,"owners_count":21032556,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","gpu-acceleration","intel","llm-inference","ollama"],"created_at":"2025-02-15T03:19:48.214Z","updated_at":"2025-04-09T09:15:45.160Z","avatar_url":"https://github.com/NikolasEnt.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Ollama with Intel GPUs\n\nThis repository demonstrates running [Ollama](https://github.com/ollama/ollama) with [`ipex-llm`](https://github.com/intel/ipex-llm) as an accelerated backend, compatible with both Intel iGPUs and dedicated GPUs (such as Arc, Flex, and Max). The provided `docker-compose.yml` file includes a patched version of Ollama for Intel acceleration with the required parameters and settings, along with the [Open WebUI](https://docs.openwebui.com/) interface for convenience.\n\n![Benchmark results](/readme_imgs/title.png)\n\nRead more about the project and benchmark results in blog post: [https://nikolasent.github.io/hardware/deeplearning/2025/02/09/iGPU-Benchmark-VLM.html](https://nikolasent.github.io/hardware/deeplearning/2025/02/09/iGPU-Benchmark-VLM.html).\n\n## Quick Start\n\nUsing Intel GPUs requires that you have Intel firmware installed. For example, on Debian-like systems:\n\n```bash\nsudo apt-get install firmware-misc-nonfree firmware-intel-graphics\nsudo update-initramfs -u -k all  # Required after kernel updates as well.\n```\n\n[Docker](https://docs.docker.com/engine/install/) and [docker compose](https://docs.docker.com/compose/install/) are also required.\n\n```bash\ndocker compose build\ndocker compose up -d\n```\n\nOpen [http://127.0.0.1:18080](http://127.0.0.1:18080) to access Open WebUI with accelerated ollama backend.\n\n_Tip:_  For performance monitoring, including GPU utilization and power usage, `intel-gpu-top` is a useful tool, which is provided as part of `intel-gpu-tools` package:\n```\nsudo apt install intel-gpu-tools\nsudo intel_gpu_top\n```\n\nAlternatively, one may compile [btop++](https://github.com/aristocratos/btop) with Intel GPU support.\n\n## Parameters\n\nThe Docker environment is pre-configured to run on Intel iGPUs. Here are some parameters that may need adjustment:\n\nThe [Dockerfile](Dockerfile) environment variables:\n* `DEVICE` variable if another hardware, such as a dedicated GPU, is used.\n* Customize `OLLAMA_NUM_GPU` if required to manage GPU offload.\n\nIn the [docker-compose.yml](docker-compose.yml) file:\n* Configure the volumes of services to set up where data and models will be stored. Prefer using disks with fast I/O.\n* Use memory limit feature, such as `mem_limit: \"32G\"`, to limit RAM used by ipex_ollama service.\n\n## Advice on performance\n\n1. If using CPU inference, tuning the `num_thread` model parameter in ollama for specific tasks (given the model and context length) may improve performance.\n2. Use the `cpuset` option in `docker-compose.yml` to pin the `ipex_ollama` service to specific CPU cores. For example, use `cpuset: \"0-3\"` to utilize the first four CPU cores (e.g., to use only performance cores). Select the most performant value empirically.\n\n## Benchmarks\n\nThe script [scripts/benchmark.py](scripts/benchmark.py) contains a benchmarking tool that evaluates tokens/s generated by any OpenAI-compatible API, including benchmarks for both Language Models (LLMs) and Vision-Language Models (VLMs). The benchmarks are reported on an Intel Ultra 5 125H Meteor Lake SoC with 64GB RAM.\n\nWith sufficient RAM, this SoC can handle relatively large models locally, making it a power-efficient solution for low-cost experiments with local models.\n\nFeel free to explore the benchmark code and adjust it as needed for your specific experimentation and setup. The provided code is configured to produce the results below, so ensure that the required models are pulled before running the benchmark script.\n\nThe benchmark script is designed to be a standalone script that can be executed from the host machine (not from inside the Docker environment). You can use this benchmark code to test any OpenAI-compatible APIs by adjusting the API_URI and specifying the required model names.\n\n### Language models\n\n| Model              | Ultra 5 CPU tokens/s | Ultra 5 iGPU tokens/s | RTX 3090 tokens/s |\n|--------------------|----------------------|-----------------------|-------------------|\n| deepseek-r1:70b    | 1.12 ± 0.07          | 1.65 ± 0.08           | NA                |\n| llama3.3:70b       | 1.16 ± 0.01          | 1.58 ± 0.00           | NA                |\n| llama3.1:70b       | 1.17 ± 0.00          | 1.57 ± 0.00           | NA                |\n| llama3.1:8b        | 9.76 ± 0.18          | 12.69 ± 0.20          | 104.31 ± 2.06     |\n| qwen2.5:72b        | 1.11 ± 0.01          | 1.24 ± 0.00           | NA                |\n| qwen2.5:32b        | 2.46 ± 0.01          | 3.44 ± 0.02           | 31.91 ± 0.34      |\n| qwen2.5:7b         | 10.26 ± 0.18         | 13.06 ± 0.09          | 101.03 ± 1.01     |\n| qwq                | 2.29 ± 0.08          | 3.01 ± 0.04           | 30.53 ± 0.75      |\n| mistral-small:24b  | 3.37 ± 0.03          | 4.87 ± 0.02           | 45.31 ± 0.25      |\n| phi4:14b           | 5.27 ± 0.08          | 7.11 ± 0.06           | 64.09 ± 0.95      |\n| phi3.5:3.8b        | 19.07 ± 0.86         | 19.60 ± 2.42          | 171.51 ± 1.15     |\n| llama3.2:3b        | 20.63 ± 0.44         | 23.20 ± 0.26          | 161.96 ± 3.01     |\n| smallthinker:3b    | 13.83 ± 0.63         | 14.66 ± 0.42          | 105.53 ± 1.84     |\n| smollm2:1.7b       | 27.41 ± 0.66         | 27.84 ± 0.65          | 209.49 ± 1.78     |\n| smollm2:360m       | 57.56 ± 2.63         | 35.13 ± 0.32          | 250.60 ± 8.13     |\n| starcoder2:3b      | 19.47 ± 1.51         | 22.30 ± 2.38          | 177.34 ± 3.42     |\n| qwen2.5-coder:1.5b | 27.19 ± 0.26         | 36.74 ± 0.23          | 170.02 ± 4.20     |\n| opencoder:1.5b     | 32.88 ± 1.60         | 17.67 ± 0.90          | 207.72 ± 3.92     |\n\n### VLMs\n\n| Model               | Ultra 5 iGPU tokens/s | RTX 3090 tokens/s |\n|---------------------|-----------------------|-------------------|\n| llama3.2-vision:90b | 0.92 ± 0.01           | NA                |\n| llama3.2-vision:11b | 5.73 ± 0.03           | 61.90 ± 0.20      |\n| minicpm-v:8b        | 14.94 ± 0.41          | 98.69 ± 0.18      |\n| llava-phi3:3.8b     | 18.93 ± 0.12          | 154.73 ± 1.62     |\n| moondream:1.8b      | 35.53 ± 1.48          | 280.98 ± 45.34    |\n\n## Links\n\n1. Intel docs on `ipex-llm`: [Run Ollama with IPEX-LLM on Intel GPU](https://ipex-llm-latest.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html).\n2. [`ipex-llm` repo](https://github.com/intel/ipex-llm/tree/main).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnikolasent%2Follama-webui-intel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnikolasent%2Follama-webui-intel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnikolasent%2Follama-webui-intel/lists"}