{"id":19418987,"url":"https://github.com/playform/llama","last_synced_at":"2025-02-25T03:43:31.434Z","repository":{"id":227824183,"uuid":"754611546","full_name":"PlayForm/Llama","owner":"PlayForm","description":null,"archived":false,"fork":false,"pushed_at":"2024-09-03T22:42:54.000Z","size":16178,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"Current","last_synced_at":"2025-02-23T01:17:31.831Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://playform.cloud","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PlayForm.png","metadata":{"files":{"readme":"README-sycl.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-08T12:27:09.000Z","updated_at":"2024-08-07T03:33:47.000Z","dependencies_parsed_at":"2024-03-15T10:54:20.523Z","dependency_job_id":null,"html_url":"https://github.com/PlayForm/Llama","commit_stats":null,"previous_names":["playform/llama"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PlayForm%2FLlama","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PlayForm%2FLlama/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PlayForm%2FLlama/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PlayForm%2FLlama/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PlayForm","download_url":"https://codeload.github.com/PlayForm/Llama/tar.gz/refs/heads/Current","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240599180,"owners_count":19826959,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-10T13:15:48.302Z","updated_at":"2025-02-25T03:43:31.404Z","avatar_url":"https://github.com/PlayForm.png","language":"C++","readme":"# llama.cpp for SYCL\n\n- [Background](#background)\n- [OS](#os)\n- [Intel GPU](#intel-gpu)\n- [Docker](#docker)\n- [Linux](#linux)\n- [Windows](#windows)\n- [Environment Variable](#environment-variable)\n- [Known Issue](#known-issue)\n- [Q\u0026A](#q\u0026a)\n- [Todo](#todo)\n\n## Background\n\nSYCL is a higher-level programming model to improve programming productivity on various hardware accelerators—such as CPUs, GPUs, and FPGAs. It is a single-source embedded domain-specific language based on pure C++17.\n\noneAPI is a specification that is open and standards-based, supporting multiple architecture types including but not limited to GPU, CPU, and FPGA. The spec has both direct programming and API-based programming paradigms.\n\nIntel uses the SYCL as direct programming language to support CPU, GPUs and FPGAs.\n\nTo avoid to re-invent the wheel, this code refer other code paths in llama.cpp (like OpenBLAS, cuBLAS, CLBlast). We use a open-source tool [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) (Commercial release [Intel® DPC++ Compatibility Tool](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html)) migrate to SYCL.\n\nThe llama.cpp for SYCL is used to support Intel GPUs.\n\nFor Intel CPU, recommend to use llama.cpp for X86 (Intel MKL building).\n\n## OS\n\n|OS|Status|Verified|\n|-|-|-|\n|Linux|Support|Ubuntu 22.04, Fedora Silverblue 39|\n|Windows|Support|Windows 11|\n\n\n## Intel GPU\n\n### Verified\n\n|Intel GPU| Status | Verified Model|\n|-|-|-|\n|Intel Data Center Max Series| Support| Max 1550|\n|Intel Data Center Flex Series| Support| Flex 170|\n|Intel Arc Series| Support| Arc 770, 730M|\n|Intel built-in Arc GPU| Support| built-in Arc GPU in Meteor Lake|\n|Intel iGPU| Support| iGPU in i5-1250P, i7-1260P, i7-1165G7|\n\nNote: If the EUs (Execution Unit) in iGPU is less than 80, the inference speed will be too slow to use.\n\n### Memory\n\nThe memory is a limitation to run LLM on GPUs.\n\nWhen run llama.cpp, there is print log to show the applied memory on GPU. You could know how much memory to be used in your case. Like `llm_load_tensors:            buffer size =  3577.56 MiB`.\n\nFor iGPU, please make sure the shared memory from host memory is enough. For llama-2-7b.Q4_0, recommend the host memory is 8GB+.\n\nFor dGPU, please make sure the device memory is enough. For llama-2-7b.Q4_0, recommend the device memory is 4GB+.\n\n## Docker\n\nNote:\n- Only docker on Linux is tested. Docker on WSL may not work.\n- You may need to install Intel GPU driver on the host machine (See the [Linux](#linux) section to know how to do that)\n\n### Build the image\n\nYou can choose between **F16** and **F32** build. F16 is faster for long-prompt inference.\n\n\n```sh\n# For F16:\n#docker build -t llama-cpp-sycl --build-arg=\"LLAMA_SYCL_F16=ON\" -f .devops/main-intel.Dockerfile .\n\n# Or, for F32:\ndocker build -t llama-cpp-sycl -f .devops/main-intel.Dockerfile .\n\n# Note: you can also use the \".devops/main-server.Dockerfile\", which compiles the \"server\" example\n```\n\n### Run\n\n```sh\n# Firstly, find all the DRI cards:\nls -la /dev/dri\n# Then, pick the card that you want to use.\n\n# For example with \"/dev/dri/card1\"\ndocker run -it --rm -v \"$(pwd):/app:Z\" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card1:/dev/dri/card1 llama-cpp-sycl -m \"/app/models/YOUR_MODEL_FILE\" -p \"Building a website can be done in 10 simple steps:\" -n 400 -e -ngl 33\n```\n\n## Linux\n\n### Setup Environment\n\n1. Install Intel GPU driver.\n\na. Please install Intel GPU driver by official guide: [Install GPU Drivers](https://dgpu-docs.intel.com/driver/installation.html).\n\nNote: for iGPU, please install the client GPU driver.\n\nb. Add user to group: video, render.\n\n```sh\nsudo usermod -aG render username\nsudo usermod -aG video username\n```\n\nNote: re-login to enable it.\n\nc. Check\n\n```sh\nsudo apt install clinfo\nsudo clinfo -l\n```\n\nOutput (example):\n\n```\nPlatform #0: Intel(R) OpenCL Graphics\n `-- Device #0: Intel(R) Arc(TM) A770 Graphics\n\n\nPlatform #0: Intel(R) OpenCL HD Graphics\n `-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]\n```\n\n2. Install Intel® oneAPI Base toolkit.\n\na. Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).\n\nRecommend to install to default folder: **/opt/intel/oneapi**.\n\nFollowing guide use the default folder as example. If you use other folder, please modify the following guide info with your folder.\n\nb. Check\n\n```sh\nsource /opt/intel/oneapi/setvars.sh\n\nsycl-ls\n```\n\nThere should be one or more level-zero devices. Please confirm that at least one GPU is present, like **[ext_oneapi_level_zero:gpu:0]**.\n\nOutput (example):\n```\n[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]\n[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700K OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]\n[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.30.26918.50]\n[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26918]\n\n```\n\n2. Build locally:\n\nNote:\n- You can choose between **F16** and **F32** build. F16 is faster for long-prompt inference.\n- By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for **example/main** only.\n\n```sh\nmkdir -p build\ncd build\nsource /opt/intel/oneapi/setvars.sh\n\n# For FP16:\n#cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DLLAMA_SYCL_F16=ON\n\n# Or, for FP32:\ncmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx\n\n# Build example/main only\n#cmake --build . --config Release --target main\n\n# Or, build all binary\ncmake --build . --config Release -v\n\ncd ..\n```\n\nor\n\n```sh\n./examples/sycl/build.sh\n```\n\n### Run\n\n1. Put model file to folder **models**\n\nYou could download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) as example.\n\n2. Enable oneAPI running environment\n\n```\nsource /opt/intel/oneapi/setvars.sh\n```\n\n3. List device ID\n\nRun without parameter:\n\n```sh\n./build/bin/ls-sycl-device\n\n# or running the \"main\" executable and look at the output log:\n\n./build/bin/main\n```\n\nCheck the ID in startup log, like:\n\n```\nfound 4 SYCL devices:\n  Device 0: Intel(R) Arc(TM) A770 Graphics,\tcompute capability 1.3,\n    max compute_units 512,\tmax work group size 1024,\tmax sub group size 32,\tglobal mem size 16225243136\n  Device 1: Intel(R) FPGA Emulation Device,\tcompute capability 1.2,\n    max compute_units 24,\tmax work group size 67108864,\tmax sub group size 64,\tglobal mem size 67065057280\n  Device 2: 13th Gen Intel(R) Core(TM) i7-13700K,\tcompute capability 3.0,\n    max compute_units 24,\tmax work group size 8192,\tmax sub group size 64,\tglobal mem size 67065057280\n  Device 3: Intel(R) Arc(TM) A770 Graphics,\tcompute capability 3.0,\n    max compute_units 512,\tmax work group size 1024,\tmax sub group size 32,\tglobal mem size 16225243136\n\n```\n\n|Attribute|Note|\n|-|-|\n|compute capability 1.3|Level-zero running time, recommended |\n|compute capability 3.0|OpenCL running time, slower than level-zero in most cases|\n\n4. Set device ID and execute llama.cpp\n\nSet device ID = 0 by **GGML_SYCL_DEVICE=0**\n\n```sh\nGGML_SYCL_DEVICE=0 ./build/bin/main -m models/llama-2-7b.Q4_0.gguf -p \"Building a website can be done in 10 simple steps:\" -n 400 -e -ngl 33\n```\nor run by script:\n\n```sh\n./examples/sycl/run_llama2.sh\n```\n\nNote:\n\n- By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter **--no-mmap** to disable mmap() to skip this issue.\n\n\n5. Check the device ID in output\n\nLike:\n```\nUsing device **0** (Intel(R) Arc(TM) A770 Graphics) as main device\n```\n\n## Windows\n\n### Setup Environment\n\n1. Install Intel GPU driver.\n\nPlease install Intel GPU driver by official guide: [Install GPU Drivers](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/software/drivers.html).\n\nNote: **The driver is mandatory for compute function**.\n\n2. Install Visual Studio.\n\nPlease install [Visual Studio](https://visualstudio.microsoft.com/) which impact oneAPI environment enabling in Windows.\n\n3. Install Intel® oneAPI Base toolkit.\n\na. Please follow the procedure in [Get the Intel® oneAPI Base Toolkit ](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html).\n\nRecommend to install to default folder: **/opt/intel/oneapi**.\n\nFollowing guide uses the default folder as example. If you use other folder, please modify the following guide info with your folder.\n\nb. Enable oneAPI running environment:\n\n- In Search, input 'oneAPI'.\n\nSearch \u0026 open \"Intel oneAPI command prompt for Intel 64 for Visual Studio 2022\"\n\n- In Run:\n\nIn CMD:\n```\n\"C:\\Program Files (x86)\\Intel\\oneAPI\\setvars.bat\" intel64\n```\n\nc. Check GPU\n\nIn oneAPI command line:\n\n```\nsycl-ls\n```\n\nThere should be one or more level-zero devices. Please confirm that at least one GPU is present, like **[ext_oneapi_level_zero:gpu:0]**.\n\nOutput (example):\n```\n[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.10.0.17_160000]\n[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]\n[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO  [31.0.101.5186]\n[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044]\n```\n\n4. Install cmake \u0026 make\n\na. Download \u0026 install cmake for Windows: https://cmake.org/download/\n\nb. Download \u0026 install mingw-w64 make for Windows provided by w64devkit\n\n- Download the latest fortran version of [w64devkit](https://github.com/skeeto/w64devkit/releases).\n\n- Extract `w64devkit` on your pc.\n\n- Add the **bin** folder path in the Windows system PATH environment, like `C:\\xxx\\w64devkit\\bin\\`.\n\n### Build locally:\n\nIn oneAPI command line window:\n\n```\nmkdir -p build\ncd build\n@call \"C:\\Program Files (x86)\\Intel\\oneAPI\\setvars.bat\" intel64 --force\n\n::  for FP16\n::  faster for long-prompt inference\n::  cmake -G \"MinGW Makefiles\" ..  -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON\n\n::  for FP32\ncmake -G \"MinGW Makefiles\" ..  -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release\n\n\n::  build example/main only\n::  make main\n\n::  build all binary\nmake -j\ncd ..\n```\n\nor\n\n```\n.\\examples\\sycl\\win-build-sycl.bat\n```\n\nNote:\n\n- By default, it will build for all binary files. It will take more time. To reduce the time, we recommend to build for **example/main** only.\n\n### Run\n\n1. Put model file to folder **models**\n\nYou could download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) as example.\n\n2. Enable oneAPI running environment\n\n- In Search, input 'oneAPI'.\n\nSearch \u0026 open \"Intel oneAPI command prompt for Intel 64 for Visual Studio 2022\"\n\n- In Run:\n\nIn CMD:\n```\n\"C:\\Program Files (x86)\\Intel\\oneAPI\\setvars.bat\" intel64\n```\n\n3. List device ID\n\nRun without parameter:\n\n```\nbuild\\bin\\ls-sycl-device.exe\n\nor\n\nbuild\\bin\\main.exe\n```\n\nCheck the ID in startup log, like:\n\n```\nfound 4 SYCL devices:\n  Device 0: Intel(R) Arc(TM) A770 Graphics,\tcompute capability 1.3,\n    max compute_units 512,\tmax work group size 1024,\tmax sub group size 32,\tglobal mem size 16225243136\n  Device 1: Intel(R) FPGA Emulation Device,\tcompute capability 1.2,\n    max compute_units 24,\tmax work group size 67108864,\tmax sub group size 64,\tglobal mem size 67065057280\n  Device 2: 13th Gen Intel(R) Core(TM) i7-13700K,\tcompute capability 3.0,\n    max compute_units 24,\tmax work group size 8192,\tmax sub group size 64,\tglobal mem size 67065057280\n  Device 3: Intel(R) Arc(TM) A770 Graphics,\tcompute capability 3.0,\n    max compute_units 512,\tmax work group size 1024,\tmax sub group size 32,\tglobal mem size 16225243136\n\n```\n\n|Attribute|Note|\n|-|-|\n|compute capability 1.3|Level-zero running time, recommended |\n|compute capability 3.0|OpenCL running time, slower than level-zero in most cases|\n\n4. Set device ID and execute llama.cpp\n\nSet device ID = 0 by **set GGML_SYCL_DEVICE=0**\n\n```\nset GGML_SYCL_DEVICE=0\nbuild\\bin\\main.exe -m models\\llama-2-7b.Q4_0.gguf -p \"Building a website can be done in 10 simple steps:\\nStep 1:\" -n 400 -e -ngl 33 -s 0\n```\nor run by script:\n\n```\n.\\examples\\sycl\\win-run-llama2.bat\n```\n\nNote:\n\n- By default, mmap is used to read model file. In some cases, it leads to the hang issue. Recommend to use parameter **--no-mmap** to disable mmap() to skip this issue.\n\n\n5. Check the device ID in output\n\nLike:\n```\nUsing device **0** (Intel(R) Arc(TM) A770 Graphics) as main device\n```\n\n## Environment Variable\n\n#### Build\n\n|Name|Value|Function|\n|-|-|-|\n|LLAMA_SYCL|ON (mandatory)|Enable build with SYCL code path. \u003cbr\u003eFor FP32/FP16, LLAMA_SYCL=ON is mandatory.|\n|LLAMA_SYCL_F16|ON (optional)|Enable FP16 build with SYCL code path. Faster for long-prompt inference. \u003cbr\u003eFor FP32, not set it.|\n|CMAKE_C_COMPILER|icx|Use icx compiler for SYCL code path|\n|CMAKE_CXX_COMPILER|icpx (Linux), icx (Windows)|use icpx/icx for SYCL code path|\n\n#### Running\n\n\n|Name|Value|Function|\n|-|-|-|\n|GGML_SYCL_DEVICE|0 (default) or 1|Set the device id used. Check the device ids by default running output|\n|GGML_SYCL_DEBUG|0 (default) or 1|Enable log function by macro: GGML_SYCL_DEBUG|\n\n## Known Issue\n\n- Hang during startup\n\n  llama.cpp use mmap as default way to read model file and copy to GPU. In some system, memcpy will be abnormal and block.\n\n  Solution: add **--no-mmap** or **--mmap 0**.\n\n## Q\u0026A\n\n- Error:  `error while loading shared libraries: libsycl.so.7: cannot open shared object file: No such file or directory`.\n\n  Miss to enable oneAPI running environment.\n\n  Install oneAPI base toolkit and enable it by: `source /opt/intel/oneapi/setvars.sh`.\n\n- In Windows, no result, not error.\n\n  Miss to enable oneAPI running environment.\n\n- Meet compile error.\n\n  Remove folder **build** and try again.\n\n- I can **not** see **[ext_oneapi_level_zero:gpu:0]** afer install GPU driver in Linux.\n\n  Please run **sudo sycl-ls**.\n\n  If you see it in result, please add video/render group to your ID:\n\n  ```\n  sudo usermod -aG render username\n  sudo usermod -aG video username\n  ```\n\n  Then **relogin**.\n\n  If you do not see it, please check the installation GPU steps again.\n\n## Todo\n\n- Support multiple cards.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplayform%2Fllama","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fplayform%2Fllama","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplayform%2Fllama/lists"}