https://github.com/b7s/embedding.cpp

Last synced: 15 days ago
JSON representation

Host: GitHub
URL: https://github.com/b7s/embedding.cpp
Owner: b7s
License: mit
Created: 2026-05-27T11:54:07.000Z (22 days ago)
Default Branch: main
Last Pushed: 2026-05-27T17:26:44.000Z (22 days ago)
Last Synced: 2026-05-27T19:12:26.770Z (22 days ago)
Language: C++
Size: 137 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Embedding.cpp

Text embedding tool via `BERT` models upon [ggml](https://github.com/ggerganov/ggml), with critical bug fixes and improvements over the upstream.

## Improvements Over Upstream

This fork includes three critical bug fixes that make the library actually functional:

### 1. Fix SIGILL on tokenizer load (`tokenizer.cpp`)

`bert_tokenizer::load()` declared a `bool` return type but had no `return` statement. The compiler placed a `ud2` (undefined instruction) after the function body, causing an immediate **SIGILL (exit code 132)** on every call. This made the library completely unusable.

**Fix:** Added `return true;` at the end of `bert_tokenizer::load()`.

### 2. Fix SIGSEGV from use-after-free (`bert.cpp`)

In `bert_eval_batch()`, `ggml_free(ctx0)` was called *before* reading `gf->nodes[]` and `ggml_used_mem(ctx0)`. In release builds (`-O2`), the optimizer reuses the freed memory, causing a **SIGSEGV** crash.

**Fix:** Moved `ggml_free(ctx0)` to after all reads from `gf` and `ctx0`.

### 3. Fix garbage embeddings from wrong graph node (`bert.cpp`)

`bert_eval_batch()` read `gf->nodes[n_nodes - 2]` which is an intermediate `ggml_div` node producing the scalar `1.0f / length` — **not** the embedding vector. The actual normalized embedding is `gf->nodes[n_nodes - 1]` (the final `ggml_scale` output). This caused garbage embeddings with magnitude ~6.7e22 and mostly zero values.

**Fix:** Changed to `embeddings_tensor = gf->nodes[gf->n_nodes - 1]`.

---

## Feature (Origin)

* Plain C/C++ implementation without dependencies
* Inherit support for various architectures from ggml (x86 with AVX2, ARM, etc.)
* Choose your model size from 32/16/4 bits per model weight
* all-MiniLM-L6-v2 with 4bit quantization is only 14MB. Inference RAM usage depends on the length of the input
* Sample cpp server over tcp socket and a python test client
* Benchmarks to validate correctness and speed of inference

## Feature (Improve)

* Build tokenizer with [tokenizers-cpp](https://github.com/mlc-ai/tokenizers-cpp).
* Can correctly handle asian writing (CJK, and so on).
* Can process cased/uncased with respect to origin config in `tokenizer.json`.
* Upgrade to use [GGUF](https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md) model file format. So it is easy to expand and keep compatible.
* **Critical bug fixes** listed above — without these, the upstream code does not produce usable embeddings.

> With above, we can run embedding.cpp with more models like [m3e](), [e5]() and so on.

## Limitation

* Only support bert base model for embedding. other architecture like SGPT is not supported.
* Only run on CPU.
* All outputs are mean pooled and normalized.
* Batching support is WIP.
* Lack of real batching means that this library is slower than it could be in usecases where you have multiple sentences.

## Usage

### Checkout submodules

```sh
git submodule update --init --recursive
```

### Build

By default, it build both
- the native binaries, like the example server, with static libraries;
- and the dynamic library for usage from e.g. Python.

```sh
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make
cd ..
```

> rust should be installed. see [rust](https://www.rust-lang.org/tools/install) or run `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`

### Converting models to gguf format

Converting models is similar to llama.cpp. Use models/convert-to-gguf.py to make hf models into either f32 or f16 gguf models.
Then use ./build/bin/quantize to turn those into Q4_0, 4bit per weight models.

There is also models/run_conversions.sh which creates all 4 versions (f32, f16, Q4_0, Q4_1) at once.

```sh
pip install -r requirements.txt
cd models
# Clone a model from hf
git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
# Run conversions to 4 ggml formats (f32, f16, Q4_0, Q4_1)
sh run_conversions.sh all-MiniLM-L6-v2
```

## Acknowledgments

This project is a fork of [embedding.cpp](https://github.com/FFengIll/embedding.cpp) by FFengIll, which itself is a fork of [bert.cpp](https://github.com/skeskinen/bert.cpp) by skeskinen. Thank you to the original authors and contributors for the foundational work that made this possible.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/b7s/embedding.cpp

Awesome Lists containing this project

README