https://github.com/datalevin/dtlvnative

The native dependency of Datalevin database
https://github.com/datalevin/dtlvnative
Last synced: 2 months ago
JSON representation
The native dependency of Datalevin database
Host: GitHub
URL: https://github.com/datalevin/dtlvnative
Owner: datalevin
License: other
Created: 2021-09-12T04:01:32.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2026-03-30T14:43:09.000Z (2 months ago)
Last Synced: 2026-03-30T15:08:47.659Z (2 months ago)
Language: Java
Homepage:
Size: 928 KB
Stars: 6
Watchers: 1
Forks: 5
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

          # dtlvnative

Provides pre-built native dependencies for

[Datalevin](https://github.com/juji-io/datalevin) database. This is done by

packaging the compiled native libraries and JavaCPP JNI library files in the

platform specific JAR files.

In addition to JavaCPP's JNI library, these native libraries are included:

* [`dlmdb`](https://github.com/huahaiy/dlmdb) a fork of

  [LMDB](https://www.symas.com/mdb) key value storage library.

* [`usearch`](https://github.com/unum-cloud/USearch) a vector indexing and

  similarity search library that is exposed directly for callers.

* [`llama.cpp`](https://github.com/ggml-org/llama.cpp) built as a CPU-only

  GGUF runtime for embeddings and prompt-based text generation.

* `dtlv` wraps DLMDB. It implements Datalevin iterators, counters and

  samplers.

The following platforms are currently supported:

* macosx-arm64

* freebsd-x86_64

* linux-arm64

* linux-x86_64

* windows-x86_64

The name of the released JAR is `org.clojars.huahaiy/dtlvnative-PLATFORM`, where

`PLATFORM` is one of the above.

Vector support using usearch on Windows is experimental.

## llama.cpp text + embedding

`dtlvnative` packages the CPU backend of `llama.cpp` with OpenMP enabled. The

packaged native API now supports embedding models, decoder-only text models for

prompt-based generation, and multimodal OCR with PaddleOCR-VL GGUF models.

### Embedding API

| Function | Description |

|---|---|

| `dtlv_llama_embedder_create` | Load a GGUF model and create an embedder |

| `dtlv_llama_embedder_n_embd` | Return the embedding dimension |

| `dtlv_llama_embedder_n_ctx` | Return the context size (max tokens) |

| `dtlv_llama_token_count` | Count tokens for a string without allocating |

| `dtlv_llama_tokenize` | Tokenize a string into a caller-owned `int[]` buffer |

| `dtlv_llama_detokenize` | Convert tokens back to a UTF-8 string |

| `dtlv_llama_embed` | Compute an embedding for a single string |

| `dtlv_llama_embed_batch` | Compute embeddings for multiple strings in one call |

| `dtlv_llama_embedder_destroy` | Free the embedder |

The model must be a GGUF embedding model. The current smoke test uses

`multilingual-e5-small-Q8_0.gguf`.

`dtlv_llama_embedder_create` takes `model_path`, `n_ctx`, `n_batch`,

`n_threads`, and `normalize`. Pass `0` for `n_ctx` and `n_batch` to use model

defaults. A non-zero `normalize` returns L2-normalized embeddings.

### Single embedding

```java

DTLV.dtlv_llama_embedder embedder = new DTLV.dtlv_llama_embedder();

int rc = DTLV.dtlv_llama_embedder_create(

        embedder,

        "multilingual-e5-small-Q8_0.gguf",

        0, 0, 4, 1);

int nEmbd = DTLV.dtlv_llama_embedder_n_embd(embedder);

float[] output = new float[nEmbd];

rc = DTLV.dtlv_llama_embed(embedder, "query: hello world", output, nEmbd);

DTLV.dtlv_llama_embedder_destroy(embedder);

```

### Token counting and tokenization

```java

// check token count before embedding

int nTokens = DTLV.dtlv_llama_token_count(embedder, text);

int maxTokens = DTLV.dtlv_llama_embedder_n_ctx(embedder);

// tokenize, truncate, detokenize

int[] tokens = new int[maxTokens];

int actual = DTLV.dtlv_llama_tokenize(embedder, text, tokens, maxTokens);

if (actual > maxTokens) {

    // truncate to fit

    actual = maxTokens;

}

byte[] buf = new byte[text.length() * 4];

int len = DTLV.dtlv_llama_detokenize(embedder, tokens, actual, buf, buf.length);

String truncated = new String(buf, 0, len, StandardCharsets.UTF_8);

```

### Batch embedding

```java

PointerPointer texts = new PointerPointer("query: hello", "query: world");

int nTexts = 2;

float[] output = new float[nTexts * nEmbd];

rc = DTLV.dtlv_llama_embed_batch(embedder, texts, nTexts, output, output.length);

// output[0..nEmbd-1] = embedding for "query: hello"

// output[nEmbd..2*nEmbd-1] = embedding for "query: world"

```

The Java test in `src/java/datalevin/dtlvnative/Test.java` will use

`target/embedding-models/multilingual-e5-small-Q8_0.gguf` if present, fall back

to a repository-root copy if present, and otherwise download the model from

Hugging Face before running the embedding smoke test.

### Text generation API

The text-generation API is aimed at decoder-only instruction models such as

Qwen 3.5 0.8B Instruct in GGUF format.

| Function | Description |

|---|---|

| `dtlv_llama_generator_create` | Load a GGUF decoder-only text model |

| `dtlv_llama_generator_n_ctx` | Return the context size |

| `dtlv_llama_generator_token_count` | Count tokens for a prompt/document |

| `dtlv_llama_generate` | Generate text for a raw prompt |

| `dtlv_llama_summarize` | Build a summarization prompt and generate a summary |

| `dtlv_llama_generator_destroy` | Free the generator |

`dtlv_llama_generate` and `dtlv_llama_summarize` return the number of UTF-8

bytes written to the caller-owned output buffer. When `n_predict <= 0`, they

default to a 128-token generation budget. Prompt text that exceeds the context

size is automatically truncated to the leading tokens that fit.

```java

DTLV.dtlv_llama_generator generator = new DTLV.dtlv_llama_generator();

int rc = DTLV.dtlv_llama_generator_create(

        generator,

        "Qwen3.5-0.8B-Instruct-Q4_K_M.gguf",

        2048, 0, 4);

byte[] output = new byte[8192];

int len = DTLV.dtlv_llama_summarize(

        generator,

        "Datalevin embeds data locally and can pair vector search with LMDB-backed storage.",

        128,

        output,

        output.length);

String summary = new String(output, 0, len, StandardCharsets.UTF_8);

DTLV.dtlv_llama_generator_destroy(generator);

```

If you want to supply your own instruction prompt instead of the built-in

summary helper, call `dtlv_llama_generate` directly.

### Vision / OCR API

The vision API is aimed at multimodal GGUF models with a matching projector

GGUF, such as `PaddleOCR-VL-1.5-GGUF`.

| Function | Description |

|---|---|

| `dtlv_llama_vision_generator_create` | Load a multimodal text GGUF and matching `mmproj` GGUF |

| `dtlv_llama_vision_generator_n_ctx` | Return the context size |

| `dtlv_llama_vision_generate` | Generate text for a single image plus prompt |

| `dtlv_llama_ocr` | Run OCR with the built-in `OCR:` prompt |

| `dtlv_llama_vision_generator_destroy` | Free the vision generator |

`dtlv_llama_vision_generator_create` takes `model_path`, `mmproj_path`,

`n_ctx`, `n_batch`, `n_threads`, `image_min_tokens`, and `image_max_tokens`.

Pass `0` for the numeric tuning parameters to keep the model defaults. The

runtime is CPU-only in this package.

`dtlv_llama_vision_generate` and `dtlv_llama_ocr` return the number of UTF-8

bytes written to the caller-owned output buffer. The image prompt is single

image only. If the prompt passed to `dtlv_llama_vision_generate` does not

contain the multimodal marker, the native layer prepends it automatically.

```java

DTLV.dtlv_llama_vision_generator generator = new DTLV.dtlv_llama_vision_generator();

int rc = DTLV.dtlv_llama_vision_generator_create(

        generator,

        "PaddleOCR-VL-1.5.gguf",

        "PaddleOCR-VL-1.5-mmproj.gguf",

        0, 0, 4, 0, 0);

byte[] output = new byte[8192];

int len = DTLV.dtlv_llama_ocr(

        generator,

        "page.png",

        16,

        output,

        output.length);

String text = new String(output, 0, len, StandardCharsets.UTF_8);

DTLV.dtlv_llama_vision_generator_destroy(generator);

```

### Local llama smoke test

To refresh the JavaCPP platform libraries and run the local llama smoke tests

with a real decoder model:

```bash

script/test-llama-summarization --text-model=target/text-models/qwen2.5-0.5b-instruct-q5_k_m.gguf

```

The script runs `Test.java` with `--llama-only`, so it covers both the llama

embedding smoke test and the summarization flow. If you prefer, set

`DTLV_TEXT_MODEL_PATH=/abs/path/model.gguf` instead of passing `--text-model`.

### OCR smoke test

To refresh the JavaCPP platform libraries and run only the PaddleOCR-VL smoke

test:

```bash

script/test-llama-ocr \

  --vision-model=/path/to/PaddleOCR-VL-1.5.gguf \

  --vision-mmproj=/path/to/PaddleOCR-VL-1.5-mmproj.gguf \

  --ocr-image=/path/to/image.png \

  --ocr-n-predict=16

```

The OCR script runs `Test.java` with `--ocr-only`, so it skips LMDB, usearch,

embedding, and summarization. It also prints the extracted OCR text. You can

set `DTLV_VISION_MODEL_PATH`, `DTLV_VISION_MMPROJ_PATH`, `DTLV_OCR_IMAGE_PATH`,

and `DTLV_OCR_N_PREDICT` instead of passing the flags explicitly.

For CPU-only smoke tests, keep `--ocr-n-predict` small. `16` is a practical

default for checking that OCR works end to end. Large document images are much

slower than small or resized inputs, so for quick validation it helps to reduce

the longest edge to around 512 pixels first.

## Additional dependencies

Right now, the included shared libraries depend on some system libraries.

* `libc`

* `libmvec`

* `libomp` or `libgomp`

We bundle `libomp` in the Jar. However, on systems that the bundled library is

not working, or `libc` is not available, you will have to install them yourself.

For example, on Ubuntu/Debian, `apt install libgomp1`, or `apt install gcc-12

g++-12`; on MacOS, `brew install libomp libllvm`

## License

Copyright © 2021-2026 Huahai Yang

This program and the accompanying materials are made available under the

terms of the Eclipse Public License 2.0 which is available at

http://www.eclipse.org/legal/epl-2.0.

This Source Code may also be made available under the following Secondary

Licenses when the conditions for such availability set forth in the Eclipse

Public License, v. 2.0 are satisfied: GNU General Public License as published by

the Free Software Foundation, either version 2 of the License, or (at your

option) any later version, with the GNU Classpath Exception which is available

at https://www.gnu.org/software/classpath/license.html.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/datalevin/dtlvnative

Awesome Lists containing this project

README