An open API service indexing awesome lists of open source software.

https://github.com/mukel/lfm25.java

Fast LFM (Liquid AI) inference in pure Java
https://github.com/mukel/lfm25.java

inference java jvm

Last synced: 18 days ago
JSON representation

Fast LFM (Liquid AI) inference in pure Java

Awesome Lists containing this project

README

          

# LFM25.java



![Java 21+](https://img.shields.io/badge/Java-21%2B-007396?logo=java&logoColor=white)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg?logo=apache)](LICENSE)
[![GraalVM](https://img.shields.io/badge/GraalVM-Native_Image-F29111?labelColor=00758F)](https://www.graalvm.org/latest/reference-manual/native-image/)
![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows-lightgrey)

Fast, zero-dependency, inference engine for [Liquid AI](https://www.liquid.ai/) [LFM2.5 models](https://www.liquid.ai/models) in pure Java.

----

## Features

- Single file, **no dependencies**, based on [llama3.java](https://github.com/mukel/llama3.java)
- Supports Liquid AI LFM2.5 GGUF models (dense and MoE)
- Fast [GGUF format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) parser
- Supported dtypes/quantizations: `F16`, `BF16`, `F32`, `Q4_0`, `Q4_1`, `Q4_K`, `Q5_K`, `Q6_K`, `Q8_0`
- Fast kernels using Java's [Vector API](https://openjdk.org/jeps/469)
- CLI with `--chat` and `--prompt` modes
- Thinking mode control with `--think off|on|inline`
- GraalVM Native Image support
- AOT model preloading for **instant time-to-first-token**

## Setup

Download GGUF models from Hugging Face:

| Model | Architecture | GGUF Repository |
|-------|-------------|-----------------|
| 350M | Dense | [LiquidAI/LFM2.5-350M-GGUF](https://huggingface.co/LiquidAI/LFM2.5-350M-GGUF) |
| 1.2B-Thinking | Dense | [LiquidAI/LFM2.5-1.2B-Thinking-GGUF](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking-GGUF) |
| 1.2B-Instruct | Dense | [LiquidAI/LFM2.5-1.2B-Instruct-GGUF](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-GGUF) |
| 8B-A1B | Mixture of Experts (MoE) | [LiquidAI/LFM2.5-8B-A1B-GGUF](https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF) |

## Setup

Download an [LFM2.5 model](https://www.liquid.ai/models) in GGUF format or convert one with [llama.cpp](https://github.com/ggml-org/llama.cpp).

#### Optional: pure quantizations

`Q4_0` files are often mixed-quant in practice. A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with `llama-quantize` from [llama.cpp](https://github.com/ggml-org/llama.cpp):

```bash
./llama-quantize --pure ./LFM2.5-1.2B-Instruct-BF16.gguf ./LFM2.5-1.2B-Instruct-Q4_0.gguf Q4_0
```

Pick any supported target quantization, for example `Q4_0`, `Q4_1`, `Q4_K`, `Q5_K`, `Q6_K`, or `Q8_0`.

## Build and run

Java 21+ is required, in particular for the [`MemorySegment` mmap feature](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode,long,long,java.lang.foreign.Arena)).

[`jbang`](https://www.jbang.dev/) is a good fit for this use case.

```bash
jbang LFM25.java --help
jbang LFM25.java --model ./LFM2.5-1.2B-Instruct-Q8_0.gguf --chat
jbang LFM25.java --model ./LFM2.5-1.2B-Instruct-Q8_0.gguf --prompt "Tell me a joke"
```

Or run it directly, still via [`jbang`](https://www.jbang.dev/):

```bash
chmod +x LFM25.java
./LFM25.java --help
```

## CLI

```text
Usage: jbang LFM25.java [options]

Options:
--model, -m required, path to .gguf file
--interactive, --chat, -i run in chat mode
--instruct run in instruct (once) mode, default mode
--prompt, -p input prompt
--suffix suffix for fill-in-the-middle request
--system-prompt, -sp system prompt for chat/instruct mode
--temperature, -temp temperature in [0,inf], default 1.0
--top-p p value in top-p sampling in [0,1], default 0.95
--seed random seed, default System.nanoTime()
--max-tokens, -n number of steps to run, default 1024
--stream print tokens during generation, default true
--echo print all tokens to stderr, default false
--color colorize thinking output in terminal, default auto
--think control thinking output
--keep-past-thinking keep prior assistant thinking in history, default false
--raw-prompt bypass chat template and tokenize --prompt directly
```

### GraalVM Native Image

Compile with `make native` to produce a `lfm25` executable, then:

```bash
./lfm25 --model ./LFM2.5-8B-A1B-Q8_0.gguf --chat
```

### AOT model preloading

`LFM25.java` supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).

To AOT pre-load a GGUF model:
```bash
PRELOAD_GGUF=/path/to/model.gguf make native
```

A larger specialized binary is generated with parse overhead removed for that specific model.
It can still run other models with the usual parsing overhead.

## Benchmarks



\*\**Hardware specs: AMD Ryzen 9950X 16C/32T 64GB (6400) Linux 6.18.12.*

[GraalVM 25+](https://www.graalvm.org/downloads) is recommended for the absolute best performance (JIT mode), it provides partial, but good support for the [Vector API](https://openjdk.org/jeps/469), also in Native Image.

By default, the "preferred" vector size is used, it can be force-set with `-Dllama.VectorBitSize=0|128|256|512`, `0` means disabled.

## Related Repositories

- [llama3.java](https://github.com/mukel/llama3.java)
- [gemma4.java](https://github.com/mukel/gemma4.java)
- [gptoss.java](https://github.com/mukel/gptoss.java)
- [qwen35.java](https://github.com/mukel/qwen35.java)
- [nemotron3.java](https://github.com/mukel/nemotron3.java)

## License

Apache 2.0