https://github.com/gravitee-io/llamaj.cpp

A port of https://github.com/ggml-org/llama.cpp on the JVM using jextract
https://github.com/gravitee-io/llamaj.cpp
security-scan
Last synced: 3 months ago
JSON representation
A port of https://github.com/ggml-org/llama.cpp on the JVM using jextract
Host: GitHub
URL: https://github.com/gravitee-io/llamaj.cpp
Owner: gravitee-io
License: apache-2.0
Created: 2025-02-14T15:17:06.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2026-03-16T17:25:59.000Z (4 months ago)
Last Synced: 2026-03-17T04:21:20.336Z (4 months ago)
Topics: security-scan
Language: Java
Homepage:
Size: 6.44 MB
Stars: 5
Watchers: 8
Forks: 0
Open Issues: 5
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.adoc
- License: LICENSE.txt
- Security: SECURITY.md
Awesome Lists containing this project

README

          #  Llamaj.cpp

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/gravitee-io/llamaj.cpp/LICENSE.txt)

[![Releases](https://img.shields.io/badge/semantic--release-conventional%20commits-e10079?logo=semantic-release)](https://github.com/gravitee-io/llamaj.cpp/releases)

[![CircleCI](https://dl.circleci.com/status-badge/img/gh/gravitee-io/llamaj.cpp/tree/main.svg?style=svg)](https://dl.circleci.com/status-badge/redirect/gh/gravitee-io/llamaj.cpp/tree/main)

[![Community Forum](https://img.shields.io/badge/Gravitee-Community%20Forum-white?logo=githubdiscussion&logoColor=white)](https://community.gravitee.io?utm_source=readme)

**Llamaj.cpp** is a Java and JVM port of llama.cpp using jextract, enabling local large language model (LLM) inference through native foreign function & memory API interop. Natively supports macOS M-series and Linux x86_64 with GPU acceleration. Platform and hardware support (Windows, ARM, CUDA, etc.) can be extended through custom builds.

## Keywords

`llama.cpp` · `java` · `jvm` · `llm` · `large language models` · `inference` · `ai` · `native interop` · `foreign function & memory api` · `jextract`

## Requirements

- Java 25

- mvn

- MacOS M-series / Linux x86_64 (CPU) (you can check the last section if you do not see your platform here)

## How to use

Include the dependency in your pom.xml

```

    

        ...

        

            io.gravitee.llama.cpp

            llamaj.cpp

            x.x.x

        

    

```

> **Note:** All examples below use `LlamaVocab` to handle tokenization. It's obtained from a loaded `LlamaModel` and is essential for converting between tokens and text representations.

### Example 1: Basic Conversation

```java

import io.gravitee.llama.cpp.*;

import java.lang.foreign.Arena;

import java.nio.file.Path;

public class BasicExample {

    public static void main(String[] args) {

        var arena = Arena.ofConfined();

        // Initialize runtime

        LlamaRuntime.llama_backend_init();

        // Load model

        var modelParams = new LlamaModelParams(arena);

        var model = new LlamaModel(arena, Path.of("models/model.gguf"), modelParams);

        // Create context

        var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);

        var context = new LlamaContext(model, contextParams);

        // Set up tokenizer and sampler

        var vocab = new LlamaVocab(model);

        var tokenizer = new LlamaTokenizer(vocab, context);

        var sampler = new LlamaSampler(arena)

            .temperature(0.7f)

            .topK(40)

            .topP(0.9f, 1)

            .seed(42);

        // Create conversation state

        var state = ConversationState.create(arena, context, tokenizer, sampler, 0)

            .setMaxTokens(100)

            .initialize("What is the capital of France?");

        // Generate response

        var iterator = new DefaultLlamaIterator(state);

        while (iterator.hasNext()) {

            var output = iterator.next();

            System.out.print(output.text());

        }

        // Cleanup

        context.free();

        sampler.free();

        model.free();

        LlamaRuntime.llama_backend_free();

    }

}

```

### Example 2: Log Probabilities

Enable log-probability collection to inspect the model's confidence at each token position.

Set `topLogprobs` to the number of top-alternative tokens you want alongside the sampled one (0 = disabled, no overhead):

```java

import io.gravitee.llama.cpp.*;

import java.lang.foreign.Arena;

import java.nio.file.Path;

public class LogprobsExample {

    public static void main(String[] args) {

        var arena = Arena.ofConfined();

        LlamaRuntime.llama_backend_init();

        var model = new LlamaModel(arena, Path.of("models/model.gguf"), new LlamaModelParams(arena));

        var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);

        var context = new LlamaContext(arena, model, contextParams);

        var vocab = new LlamaVocab(model);

        var tokenizer = new LlamaTokenizer(vocab, context);

        var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);

        var state = ConversationState.create(arena, context, tokenizer, sampler)

            .setMaxTokens(50)

            .setTopLogprobs(5)   // return top-5 alternatives at every token position

            .initialize("What is the capital of France?");

        var iterator = new DefaultLlamaIterator(state);

        while (iterator.hasNext()) {

            var output = iterator.next();

            System.out.print(output.text());

            Logprobs lp = output.logprobs();

            System.out.printf("%n  chosen: \"%s\"  logprob=%.4f%n",

                lp.chosenToken().token(), lp.chosenToken().logprob());

            lp.topLogprobs().forEach(t ->

                System.out.printf("    alt: \"%s\"  logprob=%.4f%n", t.token(), t.logprob()));

        }

        context.free();

        sampler.free();

        model.free();

        LlamaRuntime.llama_backend_free();

    }

}

```

Each `LlamaOutput` carries a `Logprobs` object with:

- `chosenToken()` — the token that was sampled, its text, vocabulary ID, log-probability, and raw UTF-8 bytes

- `topLogprobs()` — up to N alternatives sorted by descending log-probability; the chosen token is always included

When `topLogprobs` is `0` (the default), `output.logprobs()` is `null` and no logit processing is done.

### Example 3: Parallel Conversations

Process multiple conversations simultaneously in a single batch:

```java

import io.gravitee.llama.cpp.*;

import java.lang.foreign.Arena;

import java.nio.file.Path;

public class ParallelExample {

    public static void main(String[] args) {

        var arena = Arena.ofConfined();

        // Initialize runtime

        LlamaRuntime.llama_backend_init();

        // Load model

        var modelParams = new LlamaModelParams(arena);

        var model = new LlamaModel(arena, Path.of("models/model.gguf"), modelParams);

        // Create context with multi-sequence support

        var contextParams = new LlamaContextParams(arena)

                .nCtx(2048)

                .nBatch(512)

                .nSeqMax(4);  // Support up to 4 parallel conversations

        var context = new LlamaContext(model, contextParams);

        // Set up shared tokenizer and sampler

        var vocab = new LlamaVocab(model);

        var tokenizer = new LlamaTokenizer(vocab, context);

        var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);

        // Create multiple conversation states with unique sequence IDs

        var state1 = ConversationState.create(arena, context, tokenizer, sampler, 0)

                .setMaxTokens(100).initialize("What is the capital of France?");

        var state2 = ConversationState.create(arena, context, tokenizer, sampler, 1)

                .setMaxTokens(100).initialize("What is the capital of England?");

        var state3 = ConversationState.create(arena, context, tokenizer, sampler, 2)

                .setMaxTokens(100).initialize("What is the capital of Poland?");

        // Create parallel iterator - prompts are auto-processed when states are added

        var parallel = new BatchIterator(arena, context, 512, 4)

                .addState(state1)

                .addState(state2)

                .addState(state3);

        // Generate tokens in parallel

        System.out.println("=== Parallel Generation ===");

        while (parallel.hasNext()) {

            // Each hasNext() generates tokens for all active conversations

            // Get all outputs from this batch (one per active conversation)

            var outputs = parallel.getOutputs();

            for (var output : outputs) {

                System.out.println("Seq " + output.sequenceId() + ": " + output.text());

            }

        }

        System.out.println();

        // Print results

        System.out.println("Conversation 1: " + state1.getAnswer());

        System.out.println("  Tokens: " + state1.getAnswerTokens());

        System.out.println("Conversation 2: " + state2.getAnswer());

        System.out.println("  Tokens: " + state2.getAnswerTokens());

        System.out.println("Conversation 3: " + state3.getAnswer());

        System.out.println("  Tokens: " + state3.getAnswerTokens());

        // Cleanup

        parallel.free();

        context.free();

        sampler.free();

        model.free();

        LlamaRuntime.llama_backend_free();

    }

}

```

### Example 4: Distributed Inference with RPC

Offload model weights and KV-cache to remote machines using the RPC backend.

When using `--rpc`, weights are loaded **exclusively** on the remote servers -- the local GPU is not used.

Start RPC server nodes first (see [containers/README.md](containers/README.md)):

```bash

# On the remote machine (or another terminal)

./scripts/start-rpc-server.sh

```

Then connect from Java:

```java

import io.gravitee.llama.cpp.*;

import io.gravitee.llama.cpp.nativelib.LlamaLibLoader;

import java.lang.foreign.Arena;

import java.nio.file.Path;

public class RpcExample {

    public static void main(String[] args) {

        var arena = Arena.ofConfined();

        // Initialize runtime

        String libPath = LlamaLibLoader.load();

        LlamaRuntime.llama_backend_init();

        // Register remote RPC servers -- returns their device handles

        var rpcDevices = BackendRegistry.addRpcServer(arena, "127.0.0.1:50052");

        // Print all discovered backends and devices

        BackendRegistry.printSummary();

        // Load model, restricting offloading to only the RPC devices

        var modelParams = new LlamaModelParams(arena)

            .devices(arena, rpcDevices)

            .nGpuLayers(999);

        var model = new LlamaModel(arena, Path.of("models/model.gguf"), modelParams);

        // Everything else works exactly the same as local inference

        var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);

        var context = new LlamaContext(model, contextParams);

        var vocab = new LlamaVocab(model);

        var tokenizer = new LlamaTokenizer(vocab, context);

        var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);

        var state = ConversationState.create(arena, context, tokenizer, sampler, 0)

            .setMaxTokens(100)

            .initialize("What is the capital of France?");

        var iterator = new DefaultLlamaIterator(state);

        while (iterator.hasNext()) {

            System.out.print(iterator.next().text());

        }

        context.free();

        sampler.free();

        model.free();

        LlamaRuntime.llama_backend_free();

    }

}

```

Or from the CLI:

```bash

$ java --enable-preview --enable-native-access=ALL-UNNAMED \

  -jar llamaj.cpp-.jar \

  --model models/model.gguf \

  --rpc 127.0.0.1:50052

```

Multiple RPC servers:

```bash

$ java --enable-preview --enable-native-access=ALL-UNNAMED \

  -jar llamaj.cpp-.jar \

  --model models/model.gguf \

  --rpc 192.168.1.10:50052,192.168.1.11:50052

```

## Build

The build uses a **platform-specific Maven profile** to download the correct jextract tool and pre-built llama.cpp native libraries, generate the Java FFM bindings, format the code, apply license headers, and install the artifact to your local Maven repository.

**macOS (Apple Silicon):**

```bash

cd llamaj.cpp/

mvn prettier:write license:format clean generate-sources -Pmacosx-aarch64 install

```

**Linux (x86_64):**

```bash

cd llamaj.cpp/

mvn prettier:write license:format clean generate-sources -Plinux-x86_64 install

```

> On Linux, you also need to set the library path at runtime:

> ```bash

> export LD_LIBRARY_PATH="$HOME/.llama.cpp:$LD_LIBRARY_PATH"

> ```

## Run

```bash

$ mvn exec:java -Dexec.mainClass=io.gravitee.llama.cpp.Main \

    -Dexec.args="--model /path/to/model/model.gguf --system 'You are a helpful assistant. Answer question to the best of your ability'"

```

or

```bash

$ java --enable-preview -jar llamaj.cpp-.jar \

  --model models/model.gguf \

  --system 'You are a helpful assistant. Answer question to the best of your ability'

```

On linux, don't forget to link your libraries with the environment variable below:

```bash

$ export LD_LIBRARY_PATH="$HOME/.llama.cpp:$LD_LIBRARY_PATH"

```

There are plenty of models on HuggingFace, we suggest the one [here](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF)

### Usage

```

Usage: java -jar llamaj.cpp-.jar --model  [options...]

Options:

--system        : System message (default: "You are a helpful AI assistant.")

--n_gpu_layers      : Number of GPU layers (default: 999)

--use_mlock     : Use mlock (default: true)

--use_mmap      : Use mmap (default: true)

--rpc         : Comma-separated RPC server endpoints for distributed inference

                           (e.g., "127.0.0.1:50052,192.168.1.11:50052")

                           When set, weights are offloaded exclusively to the remote servers

--temperature     : Sampler temperature (default: 0.4)

--min_p           : Sampler min_p (default: 0.1)

--min_p_window      : Sampler min_p_window (default: 40)

--top_k             : Sampler top_k (default: 10)

--top_p           : Sampler top_p (default: 0.2)

--top_p_window      : Sampler top_p_window (default: 10)

--seed             : Sampler seed (default: random)

--n_ctx             : Context size (default: 512)

--n_batch           : Batch size (default: 512)

--n_seq_max         : Max sequence length (default: 512)

--quota             : Iterator quota (default: 512)

--n_keep          : Tokens to keep when exceeding ctx size (default: 256)

--log_level       : Logging level (ERROR, WARN, INFO, DEBUG, default: ERROR)

```

## Use your own llama.cpp build

1. Clone `llama.cpp` repository

> Make sure the jextract folder is in the same path level as your repository

```bash

$ git clone https://github.com/ggml-org/llama.cpp

$ cd llama.cpp

```

2. Compile sources

> Make sure you have gcc / g++ compiler

```bash

$ gcc --help

$ g++ --help

```

On Linux:

```bash

$ cmake -B build

$ cmake --build build --config Release -j $(nproc)  

```

On MacOs:

```bash

$ cmake -B build

$ cmake --build build --config Release  -j $(sysctl -n hw.ncpu)

```

If you wish to build llama.cpp with particular configuration (CUDA, OpenBLAS, AVX2, AVX512, ...)

Please refer to the [llama.cpp](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) documentation

3. Link sources

You can use the environment variable `LLAMA_CPP_LIB_PATH=/path/to/llama.cpp/build/bin/`

This will directly load the dynamically shared object library files (`.so` for linux, `.dylib` for macos) 

You can also decide to copy these files into a temporary folder using the environment variable `LLAMA_CPP_USE_TMP_LIB_PATH=true`

The path temporary file will be used to load the shared object libraries

## Beyond Apple M-Series and Linux x86_64

To add support for other platforms (Windows, ARM, CUDA, etc.), follow this approach:

### 1. Build llama.cpp

Clone and build llama.cpp for your target platform:

```bash

git clone https://github.com/ggerganov/llama.cpp.git

cd llama.cpp

cmake -B build

cmake --build build --config Release

```

### 2. Generate FFM API Bindings with jextract

Download jextract for your platform from [OpenJDK early-access builds](https://download.java.net/java/early_access/jextract/25/2/), then generate the Java bindings:

```bash

# Example for Windows x86_64

jextract -t io.gravitee.llama.cpp.windows.x86_64 \

  --include-dir /path/to/llama.cpp/ggml/include \

  --include-dir /path/to/llama.cpp/include \

  --output src/main/java \

  --header-class-name llama_h \

  /path/to/llama.cpp/tools/mtmd/mtmd.h \

  /path/to/llama.cpp/tools/mtmd/mtmd-helper.h \

  /path/to/llama.cpp/include/llama.h \

  /path/to/llama.cpp/ggml/include/ggml-rpc.h

```

### 3. Post-process Generated Sources

Check the generated sources and apply any necessary fixes (e.g., visibility modifiers, fully-qualified method calls).

### 4. Build the Bindings JAR

Compile the generated sources and build a JAR using your own build system (Maven, Gradle, etc.).

### 5. Integrate into Your Classpath

Add the generated JAR to your project's classpath and ensure the native libraries from step 1 are available at runtime.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gravitee-io/llamaj.cpp

Awesome Lists containing this project

README