An open API service indexing awesome lists of open source software.

https://github.com/phimage/embedder


https://github.com/phimage/embedder

Last synced: 3 months ago
JSON representation

Awesome Lists containing this project

README

          

# Generic Text Embedder

A generic C++ application for generating text embeddings using ONNX models.

This is just used for testing. It's better to provide a service to do so (that does not load the model each time).

> [!CAUTION]
> The current results are incorrect — I’m just using this repository for learning purposes.

## Getting Models

You can download ONNX embedding models from Hugging Face using the `huggingface-cli` tool.

### Install Hugging Face CLI

```bash
pip install -U "huggingface_hub[cli]"
```

See the [official documentation](https://huggingface.co/docs/huggingface_hub/main/guides/cli) for more details.

### Download a Model

For example, to download the `nomic-embed-text-v1` model:

```bash
huggingface-cli download Xenova/nomic-embed-text-v1
```

This will download the model to your local cache. You can then set the environment variable to point to the cached model:

```bash
export EMBEDDING_MODEL_PATH=$HOME/.cache/huggingface/hub/models--Xenova--nomic-embed-text-v1/snapshots/0b85f78966a655763985a595b770f221374dda10
```

Note: The exact snapshot hash (the long string at the end) may vary depending on the model version.

## Building

Prerequisites:
- CMake 3.12+
- ONNX Runtime libraries
- C++17 compatible compiler

```bash
cmake .
make
```

## Usage

The embedder supports both single text processing and batch processing for better performance:

### Single Text Processing

The embedder can be used in two ways for single texts:

### Method 1: Specify model path as argument (traditional)

```bash
./embedder [--verbose]
```

### Method 2: Use environment variable (new)

```bash
export EMBEDDING_MODEL_PATH=/path/to/model
./embedder [--verbose]
```

### Batch Processing (NEW)

For better performance when processing multiple texts, use batch mode. **Important**: Batch mode now uses null bytes (`\0`) as the default delimiter to safely handle texts containing newlines.

```bash
# Batch processing with null delimiter (RECOMMENDED - safe for any text content)
printf "Text 1\0Text with\nnewlines\0Text 3\0" | ./embedder --batch [--verbose]

# Batch processing with custom delimiter
echo "Text 1|||Text 2|||Text 3" | ./embedder --batch --delimiter="|||" [--verbose]

# Batch processing with explicit model path
printf "Text 1\0Text 2\0" | ./embedder --batch [--verbose]

# From file with null-delimited content
cat null_delimited_texts.txt | ./embedder --batch [--verbose]

# UNSAFE: Line-based (only use if texts don't contain newlines)
echo -e "Text 1\nText 2\nText 3" | ./embedder --batch --delimiter="\n" [--verbose]
```

**Why null delimiter?** Text content often contains newlines, tabs, and other whitespace. Null bytes (`\0`) are the safest delimiter as they rarely appear in regular text content.

### Arguments

- `model_path`: Path to directory containing the model and vocabulary files (optional if `EMBEDDING_MODEL_PATH` is set)
- `input_text`: Text to generate embedding for (wrap in quotes if it contains spaces) - single mode only
- `--batch`: Enable batch processing mode (reads texts from stdin using delimiter)
- `--delimiter=DELIM`: Set custom delimiter for batch mode (default: `\0` null byte)
- `--verbose`: Optional flag to enable verbose output (shows model info and embedding dimension)

### Examples

```bash
# Traditional usage with explicit model path
./embedder ./model_directory "Hello world"

# Using environment variable
export EMBEDDING_MODEL_PATH=./model_directory
./embedder "Hello world"

# With verbose output
export EMBEDDING_MODEL_PATH=./model_directory
./embedder "Hello world" --verbose

# Batch processing examples (SAFE - handles texts with newlines)
export EMBEDDING_MODEL_PATH=./model_directory

# Process texts using null delimiter (recommended)
printf "Hello world\0Text with\nnewlines\0Third text\0" | ./embedder --batch

# Process texts using custom delimiter
echo "Text1|||Text2|||Text3" | ./embedder --batch --delimiter="|||"

# From file with null-delimited content
printf "First text\0Second text\nwith newlines\0" > texts.dat
cat texts.dat | ./embedder --batch --verbose

# UNSAFE: Line-based (only if no newlines in text content)
echo -e "Simple1\nSimple2\nSimple3" | ./embedder --batch --delimiter="\n"

# Batch with explicit model path
printf "Text1\0Text2\0" | ./embedder ./model_directory --batch

# Mixing approaches (environment variable as fallback)
export EMBEDDING_MODEL_PATH=./default_model
./embedder ./specific_model "Hello world" # Uses ./specific_model
./embedder "Hello world" # Uses ./default_model
```

## Model Directory Structure

The embedder supports two directory structures:

### Option 1: Direct model placement

```
model_directory/
├── model.onnx
└── vocab.txt
```

### Option 2: ONNX subdirectory

```
model_directory/
├── onnx/
│ └── model.onnx
└── vocab.txt
```

## Output

### Single Text Mode
Without `--verbose`: Outputs the full embedding as space-separated floating-point numbers.

With `--verbose`: Additionally shows:
- Model loading confirmation
- Input/output node information
- Vocabulary size
- Embedding dimension

### Batch Processing Mode
Without `--verbose`: Outputs one embedding per line, each as space-separated floating-point numbers.

With `--verbose`: Additionally shows:
- Batch processing information
- Number of texts processed
- Output tensor shape
- Model and vocabulary info

## Performance Benefits

Batch processing provides significant performance improvements when processing multiple texts:

- **Model Loading**: The model is loaded only once for the entire batch
- **Memory Efficiency**: Better GPU/CPU memory utilization
- **Parallel Processing**: Takes advantage of vectorized operations
- **Reduced Overhead**: Eliminates per-text setup costs

For example, processing 100 texts individually might take 10 seconds, while batch processing the same 100 texts could take only 2-3 seconds.

## Important: Handling Texts with Newlines

**⚠️ Critical Issue**: The original implementation used newlines (`\n`) as delimiters, which breaks when processing texts that contain newlines (which is common in real-world text data).

**✅ Solution**: This implementation now uses null bytes (`\0`) as the default delimiter, which safely handles texts containing newlines, tabs, and other whitespace characters.

**Examples of problematic texts** (that would break with line-based parsing):
- Multi-paragraph text
- Code snippets
- Formatted text with line breaks
- Text with embedded newlines

**Safe usage**:
```bash
# ✅ SAFE: Null-delimited (recommended)
printf "Text 1\0Text with\nnewlines\0Text 3\0" | ./embedder --batch

# ✅ SAFE: Custom delimiter
echo "Text1|||Text2|||Text3" | ./embedder --batch --delimiter="|||"

# ⚠️ UNSAFE: Line-based (only for simple texts without newlines)
echo -e "Text1\nText2\nText3" | ./embedder --batch --delimiter="\n"
```