https://github.com/tjake/jlama

Jlama is a modern LLM inference engine for Java
https://github.com/tjake/jlama
ai genai gpt huggingface java llama llm openai simd transformers
Last synced: 2 months ago
JSON representation
Jlama is a modern LLM inference engine for Java
Host: GitHub
URL: https://github.com/tjake/jlama
Owner: tjake
License: apache-2.0
Created: 2023-07-31T03:15:38.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2025-05-06T16:26:31.000Z (2 months ago)
Last Synced: 2025-05-11T01:35:34.407Z (2 months ago)
Topics: ai, genai, gpt, huggingface, java, llama, llm, openai, simd, transformers
Language: Java
Homepage:
Size: 4.24 MB
Stars: 1,057
Watchers: 31
Forks: 116
Open Issues: 27
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

awesome-ChatGPT-repositories - Jlama - Jlama is a modern LLM inference engine for Java (Langchain)
README

        # 🦙 Jlama: A modern LLM inference engine for Java



  



[![Maven Central Version](https://img.shields.io/maven-central/v/com.github.tjake/jlama-parent?style=flat-square)](https://central.sonatype.com/artifact/com.github.tjake/jlama-core/overview)

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)

[![Discord](https://img.shields.io/discord/1279855254812229642?style=flat-square&label=Discord&color=663399)](https://discord.gg/HsYXHrMu6J)

[![DeepWiki](https://img.shields.io/badge/DeepWiki-tjake%2FJlama-blue.svg)](https://deepwiki.com/tjake/Jlama)

## 🚀 Features

Model Support:

  * Gemma & Gemma 2 Models

  * Llama & Llama2 & Llama3 Models

  * Mistral & Mixtral Models

  * Qwen2 Models

  * IBM Granite Models

  * GPT-2 Models

  * BERT Models

  * BPE Tokenizers

  * WordPiece Tokenizers

Implements:

  * Paged Attention

  * Mixture of Experts

  * Tool Calling

  * Generate Embeddings

  * Classifier Support

  * Huggingface [SafeTensors](https://github.com/huggingface/safetensors) model and tokenizer format

  * Support for F32, F16, BF16 types

  * Support for Q8, Q4 model quantization

  * Fast GEMM operations

  * Distributed Inference!

Jlama requires Java 20 or later and utilizes the new [Vector API](https://openjdk.org/jeps/448) 

for faster inference.

## 🤔 What is it used for? 

Add LLM Inference directly to your Java application. To learn more read the [DeepWiki](https://deepwiki.com/tjake/Jlama) docs.

## 🔬 Quick Start

### 🕵️‍♀️ How to use as a local client (with jbang!)

Jlama includes a command line tool that makes it easy to use.

The CLI can be run with [jbang](https://www.jbang.dev/download/).

```shell

#Install jbang (or https://www.jbang.dev/download/)

curl -Ls https://sh.jbang.dev | bash -s - app setup

#Install Jlama CLI (will ask if you trust the source)

jbang app install --force jlama@tjake

```

Now that you have jlama installed you can download a model from huggingface and chat with it.

Note I have pre-quantized models available at https://hf.co/tjake

```shell

# Run the openai chat api and UI on a model

jlama restapi tjake/Llama-3.2-1B-Instruct-JQ4 --auto-download

```

open browser to http://localhost:8080/



  



```shell

Usage:

jlama [COMMAND]

Description:

Jlama is a modern LLM inference engine for Java!

Quantized models are maintained at https://hf.co/tjake

Choose from the available commands:

Inference:

  chat                 Interact with the specified model

  restapi              Starts a openai compatible rest api for interacting with this model

  complete             Completes a prompt using the specified model

Distributed Inference:

  cluster-coordinator  Starts a distributed rest api for a model using cluster workers

  cluster-worker       Connects to a cluster coordinator to perform distributed inference

Other:

  download             Downloads a HuggingFace model - use owner/name format

  list                 Lists local models

  quantize             Quantize the specified model

  rm                   Removes local model

  version              Display JLama version information

```

### 👨‍💻 How to use in your Java project

The main purpose of Jlama is to provide a simple way to use large language models in Java.

The simplest way to embed Jlama in your app is with the [Langchain4j Integration](https://github.com/langchain4j/langchain4j-examples/tree/main/jlama-examples).  

If you would like to embed Jlama without langchain4j, add the following [maven](https://central.sonatype.com/artifact/com.github.tjake/jlama-core/) dependencies to your project:

```xml

  com.github.tjake

  jlama-core

  ${jlama.version}

  com.github.tjake

  jlama-native

  

  ${os.detected.name}-${os.detected.arch}

  ${jlama.version}

```

jlama uses Java 21 preview features. You can enable the features globally with:

```shell

export JDK_JAVA_OPTIONS="--add-modules jdk.incubator.vector --enable-preview"

```

or enable the preview features by configuring maven compiler and failsafe plugins.

Then you can use the Model classes to run models:

```java

 public void sample() throws IOException {

    String model = "tjake/Llama-3.2-1B-Instruct-JQ4";

    String workingDirectory = "./models";

    String prompt = "What is the best season to plant avocados?";

    // Downloads the model or just returns the local path if it's already downloaded

    File localModelPath = new Downloader(workingDirectory, model).huggingFaceModel();

    

    // Loads the quantized model and specified use of quantized memory

    AbstractModel m = ModelSupport.loadModel(localModelPath, DType.F32, DType.I8);

    PromptContext ctx;

    // Checks if the model supports chat prompting and adds prompt in the expected format for this model

    if (m.promptSupport().isPresent()) {

        ctx = m.promptSupport()

                .get()

                .builder()

                .addSystemMessage("You are a helpful chatbot who writes short responses.")

                .addUserMessage(prompt)

                .build();

    } else {

        ctx = PromptContext.of(prompt);

    }

    System.out.println("Prompt: " + ctx.getPrompt() + "\n");

    // Generates a response to the prompt and prints it

    // The api allows for streaming or non-streaming responses

    // The response is generated with a temperature of 0.7 and a max token length of 256

    Generator.Response r = m.generate(UUID.randomUUID(), ctx, 0.0f, 256, (s, f) -> {});

    System.out.println(r.responseText);

 }

```

Or you can use a **Builder API**:

```java

 public void sample() throws IOException {

    String model = "tjake/Llama-3.2-1B-Instruct-JQ4";

    String workingDirectory = "./models";

    String prompt = "What is the best season to plant avocados?";

    // Downloads the model or just returns the local path if it's already downloaded

    File localModelPath = new Downloader(workingDirectory, model).huggingFaceModel();

    

    // Loads the quantized model and specified use of quantized memory

    AbstractModel m = ModelSupport.loadModel(localModelPath, DType.F32, DType.I8);

    PromptContext ctx;

    // Checks if the model supports chat prompting and adds prompt in the expected format for this model

    if (m.promptSupport().isPresent()) {

        ctx = m.promptSupport()

                .get()

                .builder()

                .addSystemMessage("You are a helpful chatbot who writes short responses.")

                .addUserMessage(prompt)

                .build();

    } else {

        ctx = PromptContext.of(prompt);

    }

    System.out.println("Prompt: " + ctx.getPrompt() + "\n");

    // Generates a response to the prompt and prints it

    // The api allows for streaming or non-streaming responses

    // The response is generated with a temperature of 0.7 and a max token length of 256

    Generator.Response r = m.generateBuilder()

            .session(UUID.randomUUID()) //By default, UUID.randomUUID()

            .promptContext(ctx) // Required or use prompt(String text)

            .ntokens(256) //By default, 256

            .temperature(0.0f) //By default, 0.0f

            .onTokenWithTimings((s, aFloat) -> {}) //By default, (s, aFloat) -> {}, nothing

            .generate();

    

    System.out.println(r.responseText);

 }

```

You can simplify promptSupport using:

```java

 public void sample() throws IOException {

    String model = "tjake/Llama-3.2-1B-Instruct-JQ4";

    String workingDirectory = "./models";

    String prompt = "What is the best season to plant avocados?";

    // Downloads the model or just returns the local path if it's already downloaded

    File localModelPath = new Downloader(workingDirectory, model).huggingFaceModel();

    

    // Loads the quantized model and specified use of quantized memory

    AbstractModel m = ModelSupport.loadModel(localModelPath, DType.F32, DType.I8);

    

    var systemPrompt = "You are a helpful chatbot who writes short responses.";

    PromptContext ctx = m.prompt()

                        .addUserMessage(prompt)

                        .addSystemMessage(systemPrompt)

                        .build(); //build method will create a PromptContext, if model don't support prompt, a simple PromptContext object will be created

    System.out.println("Prompt: " + ctx.getPrompt() + "\n");

    // Generates a response to the prompt and prints it

    // The api allows for streaming or non-streaming responses

    // The response is generated with a temperature of 0.7 and a max token length of 256

    Generator.Response r = m.generateBuilder()

            .session(UUID.randomUUID()) //By default, UUID.randomUUID()

            .promptContext(ctx) // Required or use prompt(String text)

            .ntokens(256) //By default, 256

            .temperature(0.0f) //By default, 0.0f

            .onTokenWithTimings((s, aFloat) -> {}) //By default, (s, aFloat) -> {}, nothing

            .generate();

    

    System.out.println(r.responseText);

 }

```

## ⭐ Give us a Star! 

If you like or are using this project to build your own, please give us a star. It's a free way to show your support.

## 🗺️ Roadmap

* Support more and more models

* Add pure java tokenizers

* Support Quantization (e.g. k-quantization)

* Add LoRA support

* GraalVM support

* Add distributed inference

## 🏷️ License and Citation

The code is available under [Apache License](./LICENSE).

If you find this project helpful in your research, please cite this work at

```

@misc{jlama2024,

    title = {Jlama: A modern Java inference engine for large language models},

    url = {https://github.com/tjake/jlama},

    author = {T Jake Luciani},

    month = {January},

    year = {2024}

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tjake/jlama

Awesome Lists containing this project

README