https://github.com/countzero/windows_llama.cpp

PowerShell automation to rebuild llama.cpp for a Windows environment.
https://github.com/countzero/windows_llama.cpp

cmake conda cuda llama-cpp openblas powershell windows

Last synced: 24 days ago
JSON representation

PowerShell automation to rebuild llama.cpp for a Windows environment.

Host: GitHub
URL: https://github.com/countzero/windows_llama.cpp
Owner: countzero
Created: 2023-06-10T10:01:24.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2026-04-24T10:19:13.000Z (27 days ago)
Last Synced: 2026-04-24T11:33:48.536Z (27 days ago)
Topics: cmake, conda, cuda, llama-cpp, openblas, powershell, windows
Language: PowerShell
Homepage:
Size: 3.84 MB
Stars: 36
Watchers: 3
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md

Awesome Lists containing this project

README

# Windows llama.cpp

A PowerShell automation to rebuild [llama.cpp](https://github.com/ggerganov/llama.cpp) for a Windows environment. It automates the following steps:

1. Fetching and extracting a specific release of [OpenBLAS](https://github.com/xianyi/OpenBLAS/releases)
2. Fetching the latest version of [llama.cpp](https://github.com/ggerganov/llama.cpp)
3. Fixing OpenBLAS binding in the `CMakeLists.txt`
4. Rebuilding the binaries with CMake
5. Updating the Python dependencies
6. Automatically detects the best BLAS acceleration

## BLAS support

This script currently supports `OpenBLAS` for CPU BLAS acceleration and `CUDA` for NVIDIA GPU BLAS acceleration.

## Installation

### 1. Install Prerequisites

Download and install the latest versions in the following order:

1. [Git](https://git-scm.com/download)
2. [Git Large File Storage](https://git-lfs.com)
3. [Miniconda](https://conda.io/projects/conda/en/stable/user-guide/install)
4. [Visual Studio 2022 - Community](https://aka.ms/vs/17/release/vs_community.exe)
5. [Cuda](https://developer.nvidia.com/cuda-downloads)
6. [CMake](https://cmake.org/download/)

> [!TIP]
> When installing Visual Studio 2022 it is sufficent to just install the `Build Tools for Visual Studio 2022` package. Also make sure that `Desktop development with C++` is enabled in the installer.

### 2. Enable Hardware Accelerated GPU Scheduling (optional)

Execute the following in a PowerShell terminal with Administrator privileges to enable the [Hardware Accelerated GPU Scheduling](https://devblogs.microsoft.com/directx/hardware-accelerated-gpu-scheduling/) feature:

```PowerShell
New-ItemProperty `
-Path "HKLM:\SYSTEM\CurrentControlSet\Control\GraphicsDrivers" `
-Name "HwSchMode" `
-Value "2" `
-PropertyType DWORD `
-Force
```

Then restart your computer to activate the feature.

### 3. Clone the repository from GitHub

Clone the repository to a nice place on your machine via:

```PowerShell
git clone --recurse-submodules git@github.com:countzero/windows_llama.cpp.git
```

### 4. Create a new Conda environment

Create a new Conda environment for this project with a specific version of Python:

```PowerShell
conda create --name llama.cpp python=3.12
```

### 5. Initialize Conda for shell interaction

To make Conda available in you current shell execute the following:

```PowerShell
conda init
```

> [!TIP]
> You can always revert this via `conda init --reverse`.

### 6. Execute the build script

To build llama.cpp binaries for a Windows environment with the best available BLAS acceleration execute the script:

```PowerShell
./rebuild_llama.cpp.ps1
```

> [!TIP]
> If PowerShell is not configured to execute files allow it by executing the following in an elevated PowerShell: `Set-ExecutionPolicy RemoteSigned`

### 7. Download a large language model

Download a large language model (LLM) with weights in the GGUF format into the `./vendor/llama.cpp/models` directory. You can for example download the [gemma-2-9b-it](https://huggingface.co/google/gemma-2-9b-it) model in a quantized GGUF format:

* https://huggingface.co/bartowski/gemma-2-9b-it-GGUF/resolve/main/gemma-2-9b-it-IQ4_XS.gguf

> [!TIP]
> See the [🤗 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) and [LMSYS Chatbot Arena Leaderboard](https://chat.lmsys.org/?leaderboard) for best in class open source LLMs.

## Usage

### Chat via server script

You can easily chat with a specific model by using the [.\examples\server.ps1](./examples/server.ps1) script:

```PowerShell
.\examples\server.ps1 -model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf"
```

> [!NOTE]
> The script will automatically start the llama.cpp server with an optimal configuration for your machine.

Execute the following to get detailed help on further options of the server script:

```PowerShell
Get-Help -Detailed .\examples\server.ps1
```

### Chat via CLI

You can now chat with the model:

```PowerShell
./vendor/llama.cpp/build/bin/Release/llama-cli `
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 33 `
--reverse-prompt '[[USER_NAME]]:' `
--prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
--file "./vendor/llama.cpp/prompts/chat-with-vicuna-v1.txt" `
--color `
--interactive
```

### Chat via Webinterface

You can start llama.cpp as a webserver:

```PowerShell
./vendor/llama.cpp/build/bin/Release/llama-server `
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 33
```

And then access llama.cpp via the webinterface at:

* http://127.0.0.1:8080/

### Serve multiple models via presets

You can run llama.cpp as a router that exposes several preconfigured models behind a single endpoint. Each model is defined as an INI section with `llama-server` flags as keys:

```PowerShell
./vendor/llama.cpp/build/bin/Release/llama-server `
--models-dir "D:\AI\LLM\gguf" `
--models-preset "./presets/models_24GB_VRAM.ini" `
--models-max 1
```

Clients select a model by its section header name in the OpenAI-compatible `"model"` request field. The router loads and unloads models on demand.

> [!NOTE]
> See [presets/README.md](./presets/README.md) for the INI format, shipped presets, and how to add your own.

### Increase the context size

You can increase the context size of a model with a minimal quality loss by setting the RoPE parameters. The formula for the parameters is as follows:

```
context_scale = increased_context_size / original_context_size
rope_frequency_scale = 1 / context_scale
rope_frequency_base = 10000 * context_scale
```

> [!NOTE]
> To increase the context size of an [openchat-3.6-8b-20240522](https://huggingface.co/openchat/openchat-3.6-8b-20240522) model from its original context size of `8192` to `32768` means, that the `context_scale` is `4.0`. The `rope_frequency_scale` will then be `0.25` and the `rope_frequency_base` equals `40000`.

To extend the context to 32k execute the following:

```PowerShell
./vendor/llama.cpp/build/bin/Release/llama-cli `
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
--ctx-size 32768 `
--rope-freq-scale 0.25 `
--rope-freq-base 40000 `
--threads 16 `
--n-gpu-layers 33 `
--reverse-prompt '[[USER_NAME]]:' `
--prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
--file "./vendor/llama.cpp/prompts/chat-with-vicuna-v1.txt" `
--color `
--interactive
```

### Enforce JSON response

You can enforce a specific grammar for the response generation. The following will always return a JSON response:

```PowerShell
./vendor/llama.cpp/build/bin/Release/llama-cli `
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 33 `
--prompt-cache "./cache/gemma-2-9b-it-IQ4_XS.gguf.prompt" `
--prompt "The scientific classification (Taxonomy) of a Llama: " `
--grammar-file "./vendor/llama.cpp/grammars/json.gbnf"
--color
```

### Measure model perplexity

Execute the following to measure the perplexity of the GGML formatted model:

```PowerShell
./vendor/llama.cpp/build/bin/Release/llama-perplexity `
--model "./vendor/llama.cpp/models/gemma-2-9b-it-IQ4_XS.gguf" `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 33 `
--file "./vendor/wikitext-2-raw-v1/wikitext-2-raw/wiki.test.raw"
```

### Count prompt tokens

You can easily count the tokens of a prompt for a specific model by using the [.\examples\count_tokens.ps1](./examples/count_tokens.ps1) script:

```PowerShell
.\examples\count_tokens.ps1 `
-model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf" `
-file ".\prompts\chat_with_llm.txt"
```

To inspect the actual tokenization result you can use the `-debug` flag:

```PowerShell
.\examples\count_tokens.ps1 `
-model ".\vendor\llama.cpp\models\gemma-2-9b-it-IQ4_XS.gguf" `
-prompt "Hello Word!" `
-debug
```

> [!NOTE]
> The script is a simple wrapper for the [tokenize.cpp](https://github.com/ggerganov/llama.cpp/blob/master/examples/tokenize/tokenize.cpp) example of the llama.cpp project.

Execute the following to get detailed help on further options of the server script:

```PowerShell
Get-Help -Detailed .\examples\count_tokens.ps1
```

## Build

### Rebuild llama.cpp

Every time there is a new release of [llama.cpp](https://github.com/ggerganov/llama.cpp) you can simply execute the script to automatically rebuild everything:

| Command | Description |
| ----------------------------------------------------- | -------------------------------------------- |
| `./rebuild_llama.cpp.ps1` | Automatically detects best BLAS acceleration |
| `./rebuild_llama.cpp.ps1 -blasAccelerator "OFF"` | Without any BLAS acceleration |
| `./rebuild_llama.cpp.ps1 -blasAccelerator "OpenBLAS"` | With CPU BLAS acceleration |
| `./rebuild_llama.cpp.ps1 -blasAccelerator "CUDA"` | With NVIDIA GPU BLAS acceleration |

### Build a specific version of llama.cpp

You can build a specific version of llama.cpp by specifying a git tag, commit or pull request:

| Command | Description |
| ---------------------------------------------- | ------------------------ |
| `./rebuild_llama.cpp.ps1` | The latest release |
| `./rebuild_llama.cpp.ps1 -version "b1138"` | The tag `b1138` |
| `./rebuild_llama.cpp.ps1 -version "1d16309"` | The commit `1d16309` |
| `./rebuild_llama.cpp.ps1 -pullRequest "18675"` | The pull request `18675` |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/countzero/windows_llama.cpp

Awesome Lists containing this project

README