Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/swamikannan/llamacpp-install-procedure-in-windows-and-cuda


https://github.com/swamikannan/llamacpp-install-procedure-in-windows-and-cuda

llama-cpp llamacpp llm-inference visual-studio windows-10 windows-11

Last synced: 7 days ago
JSON representation

Awesome Lists containing this project

README

        

# LlamaCpp Install Procedure in Windows



## Introduction
I was trying to install Llama.CPP with CUDA support on my system as an LLM inference server to run my multi-agent environment. I had already tried a few other options but for various reasons, they came up a cropper:

1. [Ollama](https://ollama.com/): Easy to use but the server is constrained on the types of roles you can use. They only allow ****, **** and **** roles. However, fine-tuned models like NousResearch's [Theta](https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B) and [Pro](https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B) are fine-tuned specifically for [function calling using a new role ****.](https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B#prompt-format-for-function-calling) Hence, a lot of wrangling and manipulating of the user prompts were required to get the right output. This also increased my token usage. I couldn't get a solution for this even on their Discord server.
On a side note, though, if the functionality of Ollama is enough for you, it is a brilliant inference server and I can't stop recommending it.

2. [llama-cpp-python](https://pypi.org/project/llama-cpp-python/). Pretty brilliant again, but there were some issues about it being slower than the bare-bones Llama.cpp. Since, I am GPU-poor and wanted to maximize my inference speed, I decided to install [Llama.cpp](https://github.com/ggerganov/llama.cpp) on my Windows laptop.

Oh boy!



## Issues and attempts:
* Initially, tried building Llama.cpp using [w64devkit](https://github.com/skeeto/w64devkit/releases) and [OpenBLAS for Windows](https://github.com/xianyi/OpenBLAS/releases). CPU version worked but not CUDA.
* Visual Studio would not detect CUDA while making the executable. I traversed multiple discussions that on NVidia groups and VS forums that were complaining of similar errors.
* Tried installing stand-alone versions of CMake and the Windows SDK.
* Even tried editing the MAKE file as shown here, but to no avail. Honestly, I am not a C++ guy so I had no idea what I was doing.


## Solution:
I finally found the key to my problem here . More specifically, in the screenshot below:



Basically, the only Community version of Visual Studio that was available for download from Microsoft was incompatible even with the latest version of cuda (As of writing this post, the latest version of Nvidia is CUDA 12.5). Hence, all my errors were fundamentally derived from there. I also saw a lot of questions on forums and issues on Github repos of how various pieces of libraries just weren't working together. Hence, I wrote down this post to explain in detail, all the steps I took to ensure a smooth installation and running of the Llama.CPP server on Windows with CUDA.

## Steps (All the way from the basics):
To be fair, the [README file of Llama.cpp](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) is pretty well written and the steps are easy to follow. The problems are with getting CUDA and the C++ Desktop environment of VS to talk to each other.

### CUDA:
1. Download and install CUDA from here: [Cuda Toolkit 12.5 downloads](https://developer.nvidia.com/cuda-downloads) . If you are worried about Pytorch compatibility, currently [CUDA 12.1](https://developer.nvidia.com/cuda-12-1-1-download-archive) is supported by Pytorch.

### VISUAL STUDIO 2019:
2. Download and install Visual C++ as follows:
* Download and install the Visual Studio 2109 software from [here](https://www.techspot.com/downloads/7241-visual-studio-2019.html). Unless you have a Professional or an Enterprise license, Microsoft does not give you access to Visual Studio 2019 versions i.e. there is no official download of Visual Studio 2019 Community available.
* Run Visual Studio Installer from the Start Menu. This software is the gateway to download all the libraries that you need to work within Visual Studio

screenshot of VSC Installer from the Start Menu
* Once, the application has opened, click on the Modify option:

screenshot of VSC Installer launch screen
* Select the **Desktop Development with C+++**

screenshot of VSC Installer launch screen
* Make sure the following components are selected on the right side of your window:


* Click on the "Install while downloading" link:

screenshot of VSC Installer launch screen
3. There are 4 files that will be present in **C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.5\extras\visual_studio_integration\MSBuildExtensions** (Replace "v12.5" in the path to your CUDA version). These four files are:


  1. CUDA 11.8.props

  2. CUDA 11.8.targets

  3. CUDA 11.8.xml

  4. Nvda.Build.CudaTasks.v11.8.dll

Copy and paste all these files into the relevant Visual Studio directory: **C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Microsoft\VC\v160\BuildCustomizations**
### ENVIRONMENT VARIABLES IN WINDOWS:
4. Set the **CMAKE_ARGS** environment variable (Ensure your Windows account has administrative rights to perform these functions) as follows:
* Click on the Start icon on the bottom left and type: environment
* Click on "edit environment variables for your account
![screenshot of start menu](images/environment/access_environment_variables.png)
* In the system variables section in the pop up window, click on "New"
* Set the variable name as "CMAKE_ARGS" and the Variable value as **"-DLLAMA_CUBLAS=on -DLLAMA_BLAS_VENDOR=OpenBLAS"** as shown below and click "OK":


5. Set the **CUDA_PATH** variable in a similar way:
* Similarly, create a second system variable. Set the variable name as CUDA_PATH. The Variable value should be the path to your CUDA library. Examples as below:


6. Set the **LLAMA_CUDA** variable:
* Create a third system variable. Set the variable name as LLAMA_CUDA and its value to **"on"** as shown below and click "OK":


7. Ensure that the PATH variable for CUDA is set correctly. On installation of CUDA in step 1, the CUDA directory should have been set in PATH.
* Go to the environment variables as explained in step 3.
* Scroll through the system variables until you see a system variable named *PATH* or *path*
* Select the variable and click on "Edit".
* Ensure the CUDA path is configured in the list of entries provided:
![](images/environment/env_path.gif)
8. Once all the variables are configured, restart Windows.
### INSTALLATION OF LLAMA-CPP

9. Clone the Llama.cpp repo. You will need Python (version 3.8+ just to be safe), pip and git installed.
* Run the following command in your command prompt:

```
git clone https://github.com/ggerganov/llama.cpp.git
```
* Navigate to the location where this folder "llama.cpp" is downloaded

```cd llama.cpp```

10. Build the executable for usage
1. The **Release** version
```
cmake -B build -DLLAMA_CUDA=ON
cmake --build build --config Release-j 8
```
2. The **Debug** version: For some reason, I was getting a few weird artifacts in the LLM response when I was using the Release version. I avoided these by switching to the Debug version of the build. If you face the same issues, you can re-perform step 9 and instead of step 10, you can build the executable as follows:
```
cmake -B build -DLLAMA_CUDA=ON
cmake --build build -j 8
```
##### NOTE: The "-j 8" is optional. 'j' defines the number of workers that work in parallel to build the executable. The more the faster, but it is still optional
11. If you plan to deploy Llama.cpp as a server, you can build it in the following way:
1. For the **Release** version:
```
cmake --build build --config Release -t llama-server
```
2. For the **Debug** version:
```
cmake --build build --config Release -t llama-server
```
### RUN THE SERVER
12. Setup and run the server as per your build (**Release** or **Debug**)
1. **Release** version
```
-m -c --n-gpu-layers --host --port
```
An example is as follows:
```
"llama.cpp\build\bin\Release\server.exe" -m "D:\Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf" -c 2048 --n-gpu-layers 33 --host 0.0.0.0 --port 8080
```
2. **Debug** version

The syntax is similar to the Release version. The only difference is the location of server.exe
```
"llama.cpp\build\bin\Debug\llama-server.exe" -m "D:\Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q4_K_M.gguf" -c 2048 --n-gpu-layers 33 --host 0.0.0.0 --port 8080`
```

### RUNNING INFERENCE ON THE MODEL HOSTED ON LLAMA.CPP
13. You will first need to install the OpenAI library. This is because Llama.CPP uses the openAI API to inference local models:

```
pip install openai
```
14. Create a python file e.g. test.py and enter the following:
```
import openai

client = openai.OpenAI(
base_url="::port"
api_key = "sk-no-key-required"
)
```
An example of this would be:
```
client = openai.OpenAI(
base_url="http://192.168.0.1:8080/v1",
api_key = "sk-no-key-required"
)
```
15. You can now use this "client" object to run your queries:
```
client.chat.completions.create(
model="gpt-3.5-turbo", #
messages=messages,
stream=True
)
```