https://github.com/mukel/lfm25.java
Fast LFM (Liquid AI) inference in pure Java
https://github.com/mukel/lfm25.java
inference java jvm
Last synced: 18 days ago
JSON representation
Fast LFM (Liquid AI) inference in pure Java
- Host: GitHub
- URL: https://github.com/mukel/lfm25.java
- Owner: mukel
- License: apache-2.0
- Created: 2026-06-07T10:51:38.000Z (26 days ago)
- Default Branch: main
- Last Pushed: 2026-06-08T09:19:58.000Z (25 days ago)
- Last Synced: 2026-06-08T11:11:14.932Z (25 days ago)
- Topics: inference, java, jvm
- Language: Java
- Homepage:
- Size: 47.9 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# LFM25.java

[](LICENSE)
[](https://www.graalvm.org/latest/reference-manual/native-image/)

Fast, zero-dependency, inference engine for [Liquid AI](https://www.liquid.ai/) [LFM2.5 models](https://www.liquid.ai/models) in pure Java.
----
## Features
- Single file, **no dependencies**, based on [llama3.java](https://github.com/mukel/llama3.java)
- Supports Liquid AI LFM2.5 GGUF models (dense and MoE)
- Fast [GGUF format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) parser
- Supported dtypes/quantizations: `F16`, `BF16`, `F32`, `Q4_0`, `Q4_1`, `Q4_K`, `Q5_K`, `Q6_K`, `Q8_0`
- Fast kernels using Java's [Vector API](https://openjdk.org/jeps/469)
- CLI with `--chat` and `--prompt` modes
- Thinking mode control with `--think off|on|inline`
- GraalVM Native Image support
- AOT model preloading for **instant time-to-first-token**
## Setup
Download GGUF models from Hugging Face:
| Model | Architecture | GGUF Repository |
|-------|-------------|-----------------|
| 350M | Dense | [LiquidAI/LFM2.5-350M-GGUF](https://huggingface.co/LiquidAI/LFM2.5-350M-GGUF) |
| 1.2B-Thinking | Dense | [LiquidAI/LFM2.5-1.2B-Thinking-GGUF](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking-GGUF) |
| 1.2B-Instruct | Dense | [LiquidAI/LFM2.5-1.2B-Instruct-GGUF](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct-GGUF) |
| 8B-A1B | Mixture of Experts (MoE) | [LiquidAI/LFM2.5-8B-A1B-GGUF](https://huggingface.co/LiquidAI/LFM2.5-8B-A1B-GGUF) |
## Setup
Download an [LFM2.5 model](https://www.liquid.ai/models) in GGUF format or convert one with [llama.cpp](https://github.com/ggml-org/llama.cpp).
#### Optional: pure quantizations
`Q4_0` files are often mixed-quant in practice. A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with `llama-quantize` from [llama.cpp](https://github.com/ggml-org/llama.cpp):
```bash
./llama-quantize --pure ./LFM2.5-1.2B-Instruct-BF16.gguf ./LFM2.5-1.2B-Instruct-Q4_0.gguf Q4_0
```
Pick any supported target quantization, for example `Q4_0`, `Q4_1`, `Q4_K`, `Q5_K`, `Q6_K`, or `Q8_0`.
## Build and run
Java 21+ is required, in particular for the [`MemorySegment` mmap feature](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/nio/channels/FileChannel.html#map(java.nio.channels.FileChannel.MapMode,long,long,java.lang.foreign.Arena)).
[`jbang`](https://www.jbang.dev/) is a good fit for this use case.
```bash
jbang LFM25.java --help
jbang LFM25.java --model ./LFM2.5-1.2B-Instruct-Q8_0.gguf --chat
jbang LFM25.java --model ./LFM2.5-1.2B-Instruct-Q8_0.gguf --prompt "Tell me a joke"
```
Or run it directly, still via [`jbang`](https://www.jbang.dev/):
```bash
chmod +x LFM25.java
./LFM25.java --help
```
## CLI
```text
Usage: jbang LFM25.java [options]
Options:
--model, -m required, path to .gguf file
--interactive, --chat, -i run in chat mode
--instruct run in instruct (once) mode, default mode
--prompt, -p input prompt
--suffix suffix for fill-in-the-middle request
--system-prompt, -sp system prompt for chat/instruct mode
--temperature, -temp temperature in [0,inf], default 1.0
--top-p p value in top-p sampling in [0,1], default 0.95
--seed random seed, default System.nanoTime()
--max-tokens, -n number of steps to run, default 1024
--stream print tokens during generation, default true
--echo print all tokens to stderr, default false
--color colorize thinking output in terminal, default auto
--think control thinking output
--keep-past-thinking keep prior assistant thinking in history, default false
--raw-prompt bypass chat template and tokenize --prompt directly
```
### GraalVM Native Image
Compile with `make native` to produce a `lfm25` executable, then:
```bash
./lfm25 --model ./LFM2.5-8B-A1B-Q8_0.gguf --chat
```
### AOT model preloading
`LFM25.java` supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).
To AOT pre-load a GGUF model:
```bash
PRELOAD_GGUF=/path/to/model.gguf make native
```
A larger specialized binary is generated with parse overhead removed for that specific model.
It can still run other models with the usual parsing overhead.
## Benchmarks
\*\**Hardware specs: AMD Ryzen 9950X 16C/32T 64GB (6400) Linux 6.18.12.*
[GraalVM 25+](https://www.graalvm.org/downloads) is recommended for the absolute best performance (JIT mode), it provides partial, but good support for the [Vector API](https://openjdk.org/jeps/469), also in Native Image.
By default, the "preferred" vector size is used, it can be force-set with `-Dllama.VectorBitSize=0|128|256|512`, `0` means disabled.
## Related Repositories
- [llama3.java](https://github.com/mukel/llama3.java)
- [gemma4.java](https://github.com/mukel/gemma4.java)
- [gptoss.java](https://github.com/mukel/gptoss.java)
- [qwen35.java](https://github.com/mukel/qwen35.java)
- [nemotron3.java](https://github.com/mukel/nemotron3.java)
## License
Apache 2.0