Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mukel/llama3.java

Practical Llama 3 inference in Java
https://github.com/mukel/llama3.java

java llama llama3 llm llm-inference llms

Last synced: about 3 hours ago
JSON representation

Practical Llama 3 inference in Java

Host: GitHub
URL: https://github.com/mukel/llama3.java
Owner: mukel
License: other
Created: 2024-04-25T09:09:02.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-08-10T22:30:56.000Z (about 1 month ago)
Last Synced: 2024-09-22T11:31:51.742Z (2 days ago)
Topics: java, llama, llama3, llm, llm-inference, llms
Language: Java
Homepage:
Size: 114 KB
Stars: 340
Watchers: 18
Forks: 41
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Llama3.java

Practical [Llama 3](https://github.com/meta-llama/llama3) and [3.1](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1) inference implemented in a single Java file.



  



This project is the successor of [llama2.java](https://github.com/mukel/llama2.java)

based on [llama2.c](https://github.com/karpathy/llama2.c) by [Andrej Karpathy](https://twitter.com/karpathy) and his [excellent educational videos](https://www.youtube.com/c/AndrejKarpathy).

Besides the educational value, this project will be used to test and tune compiler optimizations and features on the JVM, particularly for the [Graal compiler](https://www.graalvm.org/latest/reference-manual/java/compiler).

## Features

 - Single file, no dependencies

 - [GGUF format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) parser

 - Llama 3 tokenizer based on [minbpe](https://github.com/karpathy/minbpe)

 - Llama 3 inference with Grouped-Query Attention

 - Support Llama 3.1 (ad-hoc RoPE scaling)

 - Support for Q8_0 and Q4_0 quantizations

 - Fast matrix-vector multiplication routines for quantized tensors using Java's [Vector API](https://openjdk.org/jeps/469)

 - Simple CLI with `--chat` and `--instruct` modes.

Here's the interactive `--chat` mode in action: 



  



## Setup

Download pure `Q4_0` and (optionally) `Q8_0` quantized .gguf files from:  

  - https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF

  - https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF

The `~4.3GB` pure `Q4_0` quantized model is recommended, please be gentle with [huggingface.co](https://huggingface.co) servers: 

```

# Llama 3.1

curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_0.gguf

# Llama 3

curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf

# Optionally download the Q8_0 quantized model ~8GB

# curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gg

# curl -L -O https://huggingface.co/mukel/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q8_0.gguf

```

#### Optional: quantize to pure `Q4_0` manually

In the wild, `Q8_0` quantizations are fine, but `Q4_0` quantizations are rarely pure e.g. the `output.weights` tensor is quantized with `Q6_K`, instead of `Q4_0`.  

A **pure** `Q4_0` quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source 

with the `quantize` utility from [llama.cpp](https://github.com/ggerganov/llama.cpp) as follows:

```bash

./llama-quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0

```

## Build and run

Java 21+ is required, in particular the [`MemorySegment` mmap-ing feature](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode,long,long,java.lang.foreign.Arena)).

[`jbang`](https://www.jbang.dev/) is a perfect fit for this use case, just:

```

jbang Llama3.java --help

```

Or execute directly, also via [`jbang`](https://www.jbang.dev/):

```bash 

chmod +x Llama3.java

./Llama3.java --help

```

## Run from source

```bash

java --enable-preview --source 21 --add-modules jdk.incubator.vector LLama3.java -i --model Meta-Llama-3-8B-Instruct-Q4_0.gguf

```

#### Optional: Makefile + manually build and run

A simple [Makefile](./Makefile) is provided, run `make` to produce `llama3.jar` or manually:

```bash

javac -g --enable-preview -source 21 --add-modules jdk.incubator.vector -d target/classes Llama3.java

jar -cvfe llama3.jar com.llama4j.Llama3 LICENSE -C target/classes .

```

Run the resulting `llama3.jar` as follows: 

```bash

java --enable-preview --add-modules jdk.incubator.vector -jar llama3.jar --help

```

## Performance

**Important Note**  

On GraalVM, please note that the Graal compiler doesn't support the Vector API yet, run with `-Dllama.VectorAPI=false`, but expect sub-optimal performance.   

Vanilla OpenJDK 21+ is recommended for now, which supports the Vector API.

### llama.cpp

Vanilla `llama.cpp` built with `make -j 20`.

```bash

./main --version

version: 2879 (4f026363)

built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu

```

Executed as follows:

```bash

./main -m ../Meta-Llama-3-8B-Instruct-Q4_0.gguf \

  -n 512 \

  -s 42 \

  -p "<|start_of_header_id|>user<|end_of_header_id|>Why is the sky blue?<|eot_id|><|start_of_header_id|>assistant<|end_of_header_id|>\n\n" \

  --interactive-specials

```

Collected the **"eval time"** metric in tokens\s.

### Llama3.java

Running on OpenJDK 21.0.2.

```bash

jbang Llama3.java \

  --model ./Meta-Llama-3-8B-Instruct-Q4_0.gguf \

  --max-tokens 512 \

  --seed 42 \

  --stream false \

  --prompt "Why is the sky blue?"

```

### Results

#### Notebook Intel 13900H 6pC+8eC/20T 64GB (5200) Linux 6.6.26 

| Model                            | tokens/s | Implementation   |  

|----------------------------------|----------|------------------|

| Llama-3-8B-Instruct-Q4_0.gguf    | 7.53     | llama.cpp        |

| Llama-3-8B-Instruct-Q4_0.gguf    | 6.95     | llama3.java      |

| Llama-3-8B-Instruct-Q8_0.gguf    | 5.16     | llama.cpp        |

| Llama-3-8B-Instruct-Q8_0.gguf    | 4.02     | llama3.java      |

#### Workstation AMD 3950X 16C/32T 64GB (3200) Linux 6.6.25

****Notes**  

*Running on a single CCD e.g. `taskset -c 0-15 jbang Llama3.java ...` since inference is constrained by memory bandwidth.* 

| Model                            | tokens/s | Implementation   |  

|----------------------------------|----------|------------------|

| Llama-3-8B-Instruct-Q4_0.gguf    | 9.26     | llama.cpp        |

| Llama-3-8B-Instruct-Q4_0.gguf    | 8.03     | llama3.java      |

| Llama-3-8B-Instruct-Q8_0.gguf    | 5.79     | llama.cpp        |

| Llama-3-8B-Instruct-Q8_0.gguf    | 4.92     | llama3.java      |

## License

MIT