https://github.com/phimage/embedder
https://github.com/phimage/embedder
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/phimage/embedder
- Owner: phimage
- Created: 2025-06-28T15:03:24.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-10-27T05:04:44.000Z (4 months ago)
- Last Synced: 2025-10-27T07:08:19.363Z (4 months ago)
- Language: C++
- Size: 676 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Generic Text Embedder
A generic C++ application for generating text embeddings using ONNX models.
This is just used for testing. It's better to provide a service to do so (that does not load the model each time).
> [!CAUTION]
> The current results are incorrect — I’m just using this repository for learning purposes.
## Getting Models
You can download ONNX embedding models from Hugging Face using the `huggingface-cli` tool.
### Install Hugging Face CLI
```bash
pip install -U "huggingface_hub[cli]"
```
See the [official documentation](https://huggingface.co/docs/huggingface_hub/main/guides/cli) for more details.
### Download a Model
For example, to download the `nomic-embed-text-v1` model:
```bash
huggingface-cli download Xenova/nomic-embed-text-v1
```
This will download the model to your local cache. You can then set the environment variable to point to the cached model:
```bash
export EMBEDDING_MODEL_PATH=$HOME/.cache/huggingface/hub/models--Xenova--nomic-embed-text-v1/snapshots/0b85f78966a655763985a595b770f221374dda10
```
Note: The exact snapshot hash (the long string at the end) may vary depending on the model version.
## Building
Prerequisites:
- CMake 3.12+
- ONNX Runtime libraries
- C++17 compatible compiler
```bash
cmake .
make
```
## Usage
The embedder supports both single text processing and batch processing for better performance:
### Single Text Processing
The embedder can be used in two ways for single texts:
### Method 1: Specify model path as argument (traditional)
```bash
./embedder [--verbose]
```
### Method 2: Use environment variable (new)
```bash
export EMBEDDING_MODEL_PATH=/path/to/model
./embedder [--verbose]
```
### Batch Processing (NEW)
For better performance when processing multiple texts, use batch mode. **Important**: Batch mode now uses null bytes (`\0`) as the default delimiter to safely handle texts containing newlines.
```bash
# Batch processing with null delimiter (RECOMMENDED - safe for any text content)
printf "Text 1\0Text with\nnewlines\0Text 3\0" | ./embedder --batch [--verbose]
# Batch processing with custom delimiter
echo "Text 1|||Text 2|||Text 3" | ./embedder --batch --delimiter="|||" [--verbose]
# Batch processing with explicit model path
printf "Text 1\0Text 2\0" | ./embedder --batch [--verbose]
# From file with null-delimited content
cat null_delimited_texts.txt | ./embedder --batch [--verbose]
# UNSAFE: Line-based (only use if texts don't contain newlines)
echo -e "Text 1\nText 2\nText 3" | ./embedder --batch --delimiter="\n" [--verbose]
```
**Why null delimiter?** Text content often contains newlines, tabs, and other whitespace. Null bytes (`\0`) are the safest delimiter as they rarely appear in regular text content.
### Arguments
- `model_path`: Path to directory containing the model and vocabulary files (optional if `EMBEDDING_MODEL_PATH` is set)
- `input_text`: Text to generate embedding for (wrap in quotes if it contains spaces) - single mode only
- `--batch`: Enable batch processing mode (reads texts from stdin using delimiter)
- `--delimiter=DELIM`: Set custom delimiter for batch mode (default: `\0` null byte)
- `--verbose`: Optional flag to enable verbose output (shows model info and embedding dimension)
### Examples
```bash
# Traditional usage with explicit model path
./embedder ./model_directory "Hello world"
# Using environment variable
export EMBEDDING_MODEL_PATH=./model_directory
./embedder "Hello world"
# With verbose output
export EMBEDDING_MODEL_PATH=./model_directory
./embedder "Hello world" --verbose
# Batch processing examples (SAFE - handles texts with newlines)
export EMBEDDING_MODEL_PATH=./model_directory
# Process texts using null delimiter (recommended)
printf "Hello world\0Text with\nnewlines\0Third text\0" | ./embedder --batch
# Process texts using custom delimiter
echo "Text1|||Text2|||Text3" | ./embedder --batch --delimiter="|||"
# From file with null-delimited content
printf "First text\0Second text\nwith newlines\0" > texts.dat
cat texts.dat | ./embedder --batch --verbose
# UNSAFE: Line-based (only if no newlines in text content)
echo -e "Simple1\nSimple2\nSimple3" | ./embedder --batch --delimiter="\n"
# Batch with explicit model path
printf "Text1\0Text2\0" | ./embedder ./model_directory --batch
# Mixing approaches (environment variable as fallback)
export EMBEDDING_MODEL_PATH=./default_model
./embedder ./specific_model "Hello world" # Uses ./specific_model
./embedder "Hello world" # Uses ./default_model
```
## Model Directory Structure
The embedder supports two directory structures:
### Option 1: Direct model placement
```
model_directory/
├── model.onnx
└── vocab.txt
```
### Option 2: ONNX subdirectory
```
model_directory/
├── onnx/
│ └── model.onnx
└── vocab.txt
```
## Output
### Single Text Mode
Without `--verbose`: Outputs the full embedding as space-separated floating-point numbers.
With `--verbose`: Additionally shows:
- Model loading confirmation
- Input/output node information
- Vocabulary size
- Embedding dimension
### Batch Processing Mode
Without `--verbose`: Outputs one embedding per line, each as space-separated floating-point numbers.
With `--verbose`: Additionally shows:
- Batch processing information
- Number of texts processed
- Output tensor shape
- Model and vocabulary info
## Performance Benefits
Batch processing provides significant performance improvements when processing multiple texts:
- **Model Loading**: The model is loaded only once for the entire batch
- **Memory Efficiency**: Better GPU/CPU memory utilization
- **Parallel Processing**: Takes advantage of vectorized operations
- **Reduced Overhead**: Eliminates per-text setup costs
For example, processing 100 texts individually might take 10 seconds, while batch processing the same 100 texts could take only 2-3 seconds.
## Important: Handling Texts with Newlines
**⚠️ Critical Issue**: The original implementation used newlines (`\n`) as delimiters, which breaks when processing texts that contain newlines (which is common in real-world text data).
**✅ Solution**: This implementation now uses null bytes (`\0`) as the default delimiter, which safely handles texts containing newlines, tabs, and other whitespace characters.
**Examples of problematic texts** (that would break with line-based parsing):
- Multi-paragraph text
- Code snippets
- Formatted text with line breaks
- Text with embedded newlines
**Safe usage**:
```bash
# ✅ SAFE: Null-delimited (recommended)
printf "Text 1\0Text with\nnewlines\0Text 3\0" | ./embedder --batch
# ✅ SAFE: Custom delimiter
echo "Text1|||Text2|||Text3" | ./embedder --batch --delimiter="|||"
# ⚠️ UNSAFE: Line-based (only for simple texts without newlines)
echo -e "Text1\nText2\nText3" | ./embedder --batch --delimiter="\n"
```