{"id":44736959,"url":"https://github.com/gravitee-io/llamaj.cpp","last_synced_at":"2026-04-03T16:01:40.390Z","repository":{"id":300014749,"uuid":"932815851","full_name":"gravitee-io/llamaj.cpp","owner":"gravitee-io","description":"A port of https://github.com/ggml-org/llama.cpp on the JVM using jextract","archived":false,"fork":false,"pushed_at":"2026-03-16T17:25:59.000Z","size":6753,"stargazers_count":5,"open_issues_count":5,"forks_count":0,"subscribers_count":8,"default_branch":"main","last_synced_at":"2026-03-17T04:21:20.336Z","etag":null,"topics":["security-scan"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gravitee-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.adoc","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-02-14T15:17:06.000Z","updated_at":"2026-03-16T17:06:24.000Z","dependencies_parsed_at":"2025-07-04T13:53:26.180Z","dependency_job_id":"5e172e19-ec87-4914-9380-75b735a1310d","html_url":"https://github.com/gravitee-io/llamaj.cpp","commit_stats":null,"previous_names":["gravitee-io/llamaj.cpp"],"tags_count":62,"template":false,"template_full_name":null,"purl":"pkg:github/gravitee-io/llamaj.cpp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gravitee-io%2Fllamaj.cpp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gravitee-io%2Fllamaj.cpp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gravitee-io%2Fllamaj.cpp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gravitee-io%2Fllamaj.cpp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gravitee-io","download_url":"https://codeload.github.com/gravitee-io/llamaj.cpp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gravitee-io%2Fllamaj.cpp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31172543,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-29T21:28:10.185Z","status":"online","status_checked_at":"2026-03-30T02:00:06.831Z","response_time":138,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["security-scan"],"created_at":"2026-02-15T20:04:48.306Z","updated_at":"2026-04-03T16:01:40.284Z","avatar_url":"https://github.com/gravitee-io.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"#  Llamaj.cpp\n\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://github.com/gravitee-io/llamaj.cpp/LICENSE.txt)\n[![Releases](https://img.shields.io/badge/semantic--release-conventional%20commits-e10079?logo=semantic-release)](https://github.com/gravitee-io/llamaj.cpp/releases)\n[![CircleCI](https://dl.circleci.com/status-badge/img/gh/gravitee-io/llamaj.cpp/tree/main.svg?style=svg)](https://dl.circleci.com/status-badge/redirect/gh/gravitee-io/llamaj.cpp/tree/main)\n[![Community Forum](https://img.shields.io/badge/Gravitee-Community%20Forum-white?logo=githubdiscussion\u0026logoColor=white)](https://community.gravitee.io?utm_source=readme)\n\n**Llamaj.cpp** is a Java and JVM port of llama.cpp using jextract, enabling local large language model (LLM) inference through native foreign function \u0026 memory API interop. Natively supports macOS M-series and Linux x86_64 with GPU acceleration. Platform and hardware support (Windows, ARM, CUDA, etc.) can be extended through custom builds.\n\n## Keywords\n\n`llama.cpp` · `java` · `jvm` · `llm` · `large language models` · `inference` · `ai` · `native interop` · `foreign function \u0026 memory api` · `jextract`\n\n## Requirements\n\n- Java 25\n- mvn\n- MacOS M-series / Linux x86_64 (CPU) (you can check the last section if you do not see your platform here)\n\n## How to use\n\nInclude the dependency in your pom.xml\n```\n    \u003cdependencies\u003e\n        ...\n        \u003cdependency\u003e\n            \u003cgroupId\u003eio.gravitee.llama.cpp\u003c/groupId\u003e\n            \u003cartifactId\u003ellamaj.cpp\u003c/artifactId\u003e\n            \u003cversion\u003ex.x.x\u003c/version\u003e\n        \u003c/dependency\u003e\n    \u003c/dependencies\u003e\n```\n\n\u003e **Note:** All examples below use `LlamaVocab` to handle tokenization. It's obtained from a loaded `LlamaModel` and is essential for converting between tokens and text representations.\n\n### Example 1: Basic Conversation\n\n```java\nimport io.gravitee.llama.cpp.*;\nimport java.lang.foreign.Arena;\nimport java.nio.file.Path;\n\npublic class BasicExample {\n    public static void main(String[] args) {\n        var arena = Arena.ofConfined();\n\n        // Initialize runtime\n        LlamaRuntime.llama_backend_init();\n\n        // Load model\n        var modelParams = new LlamaModelParams(arena);\n        var model = new LlamaModel(arena, Path.of(\"models/model.gguf\"), modelParams);\n\n        // Create context\n        var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);\n        var context = new LlamaContext(model, contextParams);\n\n        // Set up tokenizer and sampler\n        var vocab = new LlamaVocab(model);\n        var tokenizer = new LlamaTokenizer(vocab, context);\n        var sampler = new LlamaSampler(arena)\n            .temperature(0.7f)\n            .topK(40)\n            .topP(0.9f, 1)\n            .seed(42);\n\n        // Create conversation state\n        var state = ConversationState.create(arena, context, tokenizer, sampler, 0)\n            .setMaxTokens(100)\n            .initialize(\"What is the capital of France?\");\n\n        // Generate response\n        var iterator = new DefaultLlamaIterator(state);\n        while (iterator.hasNext()) {\n            var output = iterator.next();\n            System.out.print(output.text());\n        }\n\n        // Cleanup\n        context.free();\n        sampler.free();\n        model.free();\n        LlamaRuntime.llama_backend_free();\n    }\n}\n```\n\n### Example 2: Log Probabilities\n\nEnable log-probability collection to inspect the model's confidence at each token position.\nSet `topLogprobs` to the number of top-alternative tokens you want alongside the sampled one (0 = disabled, no overhead):\n\n```java\nimport io.gravitee.llama.cpp.*;\nimport java.lang.foreign.Arena;\nimport java.nio.file.Path;\n\npublic class LogprobsExample {\n    public static void main(String[] args) {\n        var arena = Arena.ofConfined();\n        LlamaRuntime.llama_backend_init();\n\n        var model = new LlamaModel(arena, Path.of(\"models/model.gguf\"), new LlamaModelParams(arena));\n        var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);\n        var context = new LlamaContext(arena, model, contextParams);\n        var vocab = new LlamaVocab(model);\n        var tokenizer = new LlamaTokenizer(vocab, context);\n        var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);\n\n        var state = ConversationState.create(arena, context, tokenizer, sampler)\n            .setMaxTokens(50)\n            .setTopLogprobs(5)   // return top-5 alternatives at every token position\n            .initialize(\"What is the capital of France?\");\n\n        var iterator = new DefaultLlamaIterator(state);\n        while (iterator.hasNext()) {\n            var output = iterator.next();\n            System.out.print(output.text());\n\n            Logprobs lp = output.logprobs();\n            System.out.printf(\"%n  chosen: \\\"%s\\\"  logprob=%.4f%n\",\n                lp.chosenToken().token(), lp.chosenToken().logprob());\n            lp.topLogprobs().forEach(t -\u003e\n                System.out.printf(\"    alt: \\\"%s\\\"  logprob=%.4f%n\", t.token(), t.logprob()));\n        }\n\n        context.free();\n        sampler.free();\n        model.free();\n        LlamaRuntime.llama_backend_free();\n    }\n}\n```\n\nEach `LlamaOutput` carries a `Logprobs` object with:\n- `chosenToken()` — the token that was sampled, its text, vocabulary ID, log-probability, and raw UTF-8 bytes\n- `topLogprobs()` — up to N alternatives sorted by descending log-probability; the chosen token is always included\n\nWhen `topLogprobs` is `0` (the default), `output.logprobs()` is `null` and no logit processing is done.\n\n### Example 3: Parallel Conversations\n\nProcess multiple conversations simultaneously in a single batch:\n\n```java\nimport io.gravitee.llama.cpp.*;\n\nimport java.lang.foreign.Arena;\nimport java.nio.file.Path;\n\npublic class ParallelExample {\n    public static void main(String[] args) {\n        var arena = Arena.ofConfined();\n\n        // Initialize runtime\n        LlamaRuntime.llama_backend_init();\n\n        // Load model\n        var modelParams = new LlamaModelParams(arena);\n        var model = new LlamaModel(arena, Path.of(\"models/model.gguf\"), modelParams);\n\n        // Create context with multi-sequence support\n        var contextParams = new LlamaContextParams(arena)\n                .nCtx(2048)\n                .nBatch(512)\n                .nSeqMax(4);  // Support up to 4 parallel conversations\n        var context = new LlamaContext(model, contextParams);\n\n        // Set up shared tokenizer and sampler\n        var vocab = new LlamaVocab(model);\n        var tokenizer = new LlamaTokenizer(vocab, context);\n        var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);\n\n        // Create multiple conversation states with unique sequence IDs\n        var state1 = ConversationState.create(arena, context, tokenizer, sampler, 0)\n                .setMaxTokens(100).initialize(\"What is the capital of France?\");\n        var state2 = ConversationState.create(arena, context, tokenizer, sampler, 1)\n                .setMaxTokens(100).initialize(\"What is the capital of England?\");\n        var state3 = ConversationState.create(arena, context, tokenizer, sampler, 2)\n                .setMaxTokens(100).initialize(\"What is the capital of Poland?\");\n\n        // Create parallel iterator - prompts are auto-processed when states are added\n        var parallel = new BatchIterator(arena, context, 512, 4)\n                .addState(state1)\n                .addState(state2)\n                .addState(state3);\n\n        // Generate tokens in parallel\n        System.out.println(\"=== Parallel Generation ===\");\n        while (parallel.hasNext()) {\n            // Each hasNext() generates tokens for all active conversations\n            // Get all outputs from this batch (one per active conversation)\n            var outputs = parallel.getOutputs();\n            for (var output : outputs) {\n                System.out.println(\"Seq \" + output.sequenceId() + \": \" + output.text());\n            }\n        }\n        System.out.println();\n\n        // Print results\n        System.out.println(\"Conversation 1: \" + state1.getAnswer());\n        System.out.println(\"  Tokens: \" + state1.getAnswerTokens());\n        System.out.println(\"Conversation 2: \" + state2.getAnswer());\n        System.out.println(\"  Tokens: \" + state2.getAnswerTokens());\n        System.out.println(\"Conversation 3: \" + state3.getAnswer());\n        System.out.println(\"  Tokens: \" + state3.getAnswerTokens());\n\n        // Cleanup\n        parallel.free();\n        context.free();\n        sampler.free();\n        model.free();\n        LlamaRuntime.llama_backend_free();\n    }\n}\n```\n\n### Example 4: Distributed Inference with RPC\n\nOffload model weights and KV-cache to remote machines using the RPC backend.\nWhen using `--rpc`, weights are loaded **exclusively** on the remote servers -- the local GPU is not used.\n\nStart RPC server nodes first (see [containers/README.md](containers/README.md)):\n\n```bash\n# On the remote machine (or another terminal)\n./scripts/start-rpc-server.sh\n```\n\nThen connect from Java:\n\n```java\nimport io.gravitee.llama.cpp.*;\nimport io.gravitee.llama.cpp.nativelib.LlamaLibLoader;\nimport java.lang.foreign.Arena;\nimport java.nio.file.Path;\n\npublic class RpcExample {\n    public static void main(String[] args) {\n        var arena = Arena.ofConfined();\n\n        // Initialize runtime\n        String libPath = LlamaLibLoader.load();\n        LlamaRuntime.llama_backend_init();\n\n        // Register remote RPC servers -- returns their device handles\n        var rpcDevices = BackendRegistry.addRpcServer(arena, \"127.0.0.1:50052\");\n\n        // Print all discovered backends and devices\n        BackendRegistry.printSummary();\n\n        // Load model, restricting offloading to only the RPC devices\n        var modelParams = new LlamaModelParams(arena)\n            .devices(arena, rpcDevices)\n            .nGpuLayers(999);\n        var model = new LlamaModel(arena, Path.of(\"models/model.gguf\"), modelParams);\n\n        // Everything else works exactly the same as local inference\n        var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);\n        var context = new LlamaContext(model, contextParams);\n        var vocab = new LlamaVocab(model);\n        var tokenizer = new LlamaTokenizer(vocab, context);\n        var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);\n\n        var state = ConversationState.create(arena, context, tokenizer, sampler, 0)\n            .setMaxTokens(100)\n            .initialize(\"What is the capital of France?\");\n\n        var iterator = new DefaultLlamaIterator(state);\n        while (iterator.hasNext()) {\n            System.out.print(iterator.next().text());\n        }\n\n        context.free();\n        sampler.free();\n        model.free();\n        LlamaRuntime.llama_backend_free();\n    }\n}\n```\n\nOr from the CLI:\n\n```bash\n$ java --enable-preview --enable-native-access=ALL-UNNAMED \\\n  -jar llamaj.cpp-\u003cversion\u003e.jar \\\n  --model models/model.gguf \\\n  --rpc 127.0.0.1:50052\n```\n\nMultiple RPC servers:\n\n```bash\n$ java --enable-preview --enable-native-access=ALL-UNNAMED \\\n  -jar llamaj.cpp-\u003cversion\u003e.jar \\\n  --model models/model.gguf \\\n  --rpc 192.168.1.10:50052,192.168.1.11:50052\n```\n\n## Build\n\nThe build uses a **platform-specific Maven profile** to download the correct jextract tool and pre-built llama.cpp native libraries, generate the Java FFM bindings, format the code, apply license headers, and install the artifact to your local Maven repository.\n\n**macOS (Apple Silicon):**\n\n```bash\ncd llamaj.cpp/\nmvn prettier:write license:format clean generate-sources -Pmacosx-aarch64 install\n```\n\n**Linux (x86_64):**\n\n```bash\ncd llamaj.cpp/\nmvn prettier:write license:format clean generate-sources -Plinux-x86_64 install\n```\n\n\u003e On Linux, you also need to set the library path at runtime:\n\u003e ```bash\n\u003e export LD_LIBRARY_PATH=\"$HOME/.llama.cpp:$LD_LIBRARY_PATH\"\n\u003e ```\n\n## Run\n\n```bash\n$ mvn exec:java -Dexec.mainClass=io.gravitee.llama.cpp.Main \\\n    -Dexec.args=\"--model /path/to/model/model.gguf --system 'You are a helpful assistant. Answer question to the best of your ability'\"\n```\n\nor\n\n```bash\n$ java --enable-preview -jar llamaj.cpp-\u003cversion\u003e.jar \\\n  --model models/model.gguf \\\n  --system 'You are a helpful assistant. Answer question to the best of your ability'\n```\n\nOn linux, don't forget to link your libraries with the environment variable below:\n```bash\n$ export LD_LIBRARY_PATH=\"$HOME/.llama.cpp:$LD_LIBRARY_PATH\"\n```\n\nThere are plenty of models on HuggingFace, we suggest the one [here](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF)\n\n### Usage\n```\nUsage: java -jar llamaj.cpp-\u003cversion\u003e.jar --model \u003cpath_to_gguf_model\u003e [options...]\nOptions:\n--system \u003cmessage\u003e       : System message (default: \"You are a helpful AI assistant.\")\n--n_gpu_layers \u003cint\u003e     : Number of GPU layers (default: 999)\n--use_mlock \u003cboolean\u003e    : Use mlock (default: true)\n--use_mmap \u003cboolean\u003e     : Use mmap (default: true)\n--rpc \u003cendpoints\u003e        : Comma-separated RPC server endpoints for distributed inference\n                           (e.g., \"127.0.0.1:50052,192.168.1.11:50052\")\n                           When set, weights are offloaded exclusively to the remote servers\n--temperature \u003cfloat\u003e    : Sampler temperature (default: 0.4)\n--min_p \u003cfloat\u003e          : Sampler min_p (default: 0.1)\n--min_p_window \u003cint\u003e     : Sampler min_p_window (default: 40)\n--top_k \u003cint\u003e            : Sampler top_k (default: 10)\n--top_p \u003cfloat\u003e          : Sampler top_p (default: 0.2)\n--top_p_window \u003cint\u003e     : Sampler top_p_window (default: 10)\n--seed \u003clong\u003e            : Sampler seed (default: random)\n--n_ctx \u003cint\u003e            : Context size (default: 512)\n--n_batch \u003cint\u003e          : Batch size (default: 512)\n--n_seq_max \u003cint\u003e        : Max sequence length (default: 512)\n--quota \u003cint\u003e            : Iterator quota (default: 512)\n--n_keep \u003cint\u003e         : Tokens to keep when exceeding ctx size (default: 256)\n--log_level \u003clevel\u003e      : Logging level (ERROR, WARN, INFO, DEBUG, default: ERROR)\n```\n\n## Use your own llama.cpp build\n\n1. Clone `llama.cpp` repository\n\n\u003e Make sure the jextract folder is in the same path level as your repository\n\n```bash\n$ git clone https://github.com/ggml-org/llama.cpp\n$ cd llama.cpp\n```\n\n2. Compile sources\n\n\u003e Make sure you have gcc / g++ compiler\n\n```bash\n$ gcc --help\n$ g++ --help\n```\n\nOn Linux:\n```bash\n$ cmake -B build\n$ cmake --build build --config Release -j $(nproc)  \n```\n\nOn MacOs:\n```bash\n$ cmake -B build\n$ cmake --build build --config Release  -j $(sysctl -n hw.ncpu)\n```\n\nIf you wish to build llama.cpp with particular configuration (CUDA, OpenBLAS, AVX2, AVX512, ...)\nPlease refer to the [llama.cpp](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) documentation\n\n3. Link sources\n\nYou can use the environment variable `LLAMA_CPP_LIB_PATH=/path/to/llama.cpp/build/bin/`\nThis will directly load the dynamically shared object library files (`.so` for linux, `.dylib` for macos) \nYou can also decide to copy these files into a temporary folder using the environment variable `LLAMA_CPP_USE_TMP_LIB_PATH=true`\nThe path temporary file will be used to load the shared object libraries\n\n## Beyond Apple M-Series and Linux x86_64\n\nTo add support for other platforms (Windows, ARM, CUDA, etc.), follow this approach:\n\n### 1. Build llama.cpp\n\nClone and build llama.cpp for your target platform:\n\n```bash\ngit clone https://github.com/ggerganov/llama.cpp.git\ncd llama.cpp\ncmake -B build\ncmake --build build --config Release\n```\n\n### 2. Generate FFM API Bindings with jextract\n\nDownload jextract for your platform from [OpenJDK early-access builds](https://download.java.net/java/early_access/jextract/25/2/), then generate the Java bindings:\n\n```bash\n# Example for Windows x86_64\njextract -t io.gravitee.llama.cpp.windows.x86_64 \\\n  --include-dir /path/to/llama.cpp/ggml/include \\\n  --include-dir /path/to/llama.cpp/include \\\n  --output src/main/java \\\n  --header-class-name llama_h \\\n  /path/to/llama.cpp/tools/mtmd/mtmd.h \\\n  /path/to/llama.cpp/tools/mtmd/mtmd-helper.h \\\n  /path/to/llama.cpp/include/llama.h \\\n  /path/to/llama.cpp/ggml/include/ggml-rpc.h\n```\n\n### 3. Post-process Generated Sources\n\nCheck the generated sources and apply any necessary fixes (e.g., visibility modifiers, fully-qualified method calls).\n\n### 4. Build the Bindings JAR\n\nCompile the generated sources and build a JAR using your own build system (Maven, Gradle, etc.).\n\n### 5. Integrate into Your Classpath\n\nAdd the generated JAR to your project's classpath and ensure the native libraries from step 1 are available at runtime.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgravitee-io%2Fllamaj.cpp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgravitee-io%2Fllamaj.cpp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgravitee-io%2Fllamaj.cpp/lists"}