{"id":32649295,"url":"https://github.com/phimage/embedder","last_synced_at":"2025-10-31T06:53:47.220Z","repository":{"id":301785136,"uuid":"1010197850","full_name":"phimage/embedder","owner":"phimage","description":null,"archived":false,"fork":false,"pushed_at":"2025-10-27T05:04:44.000Z","size":692,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-27T07:08:19.363Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/phimage.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-28T15:03:24.000Z","updated_at":"2025-10-27T05:04:48.000Z","dependencies_parsed_at":null,"dependency_job_id":"d4fad9d8-917f-4c96-8b98-1f5896c3adea","html_url":"https://github.com/phimage/embedder","commit_stats":null,"previous_names":["phimage/embedder"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/phimage/embedder","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phimage%2Fembedder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phimage%2Fembedder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phimage%2Fembedder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phimage%2Fembedder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/phimage","download_url":"https://codeload.github.com/phimage/embedder/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phimage%2Fembedder/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281946316,"owners_count":26587973,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-31T02:00:07.401Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-31T06:53:33.287Z","updated_at":"2025-10-31T06:53:47.215Z","avatar_url":"https://github.com/phimage.png","language":"C++","readme":"# Generic Text Embedder\n\nA generic C++ application for generating text embeddings using ONNX models.\n\nThis is just used for testing. It's better to provide a service to do so (that does not load the model each time).\n\n\u003e [!CAUTION]\n\u003e The current results are incorrect — I’m just using this repository for learning purposes.\n\n## Getting Models\n\nYou can download ONNX embedding models from Hugging Face using the `huggingface-cli` tool.\n\n### Install Hugging Face CLI\n\n```bash\npip install -U \"huggingface_hub[cli]\"\n```\n\nSee the [official documentation](https://huggingface.co/docs/huggingface_hub/main/guides/cli) for more details.\n\n### Download a Model\n\nFor example, to download the `nomic-embed-text-v1` model:\n\n```bash\nhuggingface-cli download Xenova/nomic-embed-text-v1\n```\n\nThis will download the model to your local cache. You can then set the environment variable to point to the cached model:\n\n```bash\nexport EMBEDDING_MODEL_PATH=$HOME/.cache/huggingface/hub/models--Xenova--nomic-embed-text-v1/snapshots/0b85f78966a655763985a595b770f221374dda10\n```\n\nNote: The exact snapshot hash (the long string at the end) may vary depending on the model version.\n\n## Building\n\nPrerequisites:\n- CMake 3.12+\n- ONNX Runtime libraries\n- C++17 compatible compiler\n\n```bash\ncmake .\nmake\n```\n\n## Usage\n\nThe embedder supports both single text processing and batch processing for better performance:\n\n### Single Text Processing\n\nThe embedder can be used in two ways for single texts:\n\n### Method 1: Specify model path as argument (traditional)\n\n```bash\n./embedder \u003cmodel_path\u003e \u003cinput_text\u003e [--verbose]\n```\n\n### Method 2: Use environment variable (new)\n\n```bash\nexport EMBEDDING_MODEL_PATH=/path/to/model\n./embedder \u003cinput_text\u003e [--verbose]\n```\n\n### Batch Processing (NEW)\n\nFor better performance when processing multiple texts, use batch mode. **Important**: Batch mode now uses null bytes (`\\0`) as the default delimiter to safely handle texts containing newlines.\n\n```bash\n# Batch processing with null delimiter (RECOMMENDED - safe for any text content)\nprintf \"Text 1\\0Text with\\nnewlines\\0Text 3\\0\" | ./embedder --batch [--verbose]\n\n# Batch processing with custom delimiter\necho \"Text 1|||Text 2|||Text 3\" | ./embedder --batch --delimiter=\"|||\" [--verbose]\n\n# Batch processing with explicit model path\nprintf \"Text 1\\0Text 2\\0\" | ./embedder \u003cmodel_path\u003e --batch [--verbose]\n\n# From file with null-delimited content\ncat null_delimited_texts.txt | ./embedder --batch [--verbose]\n\n# UNSAFE: Line-based (only use if texts don't contain newlines)\necho -e \"Text 1\\nText 2\\nText 3\" | ./embedder --batch --delimiter=\"\\n\" [--verbose]\n```\n\n**Why null delimiter?** Text content often contains newlines, tabs, and other whitespace. Null bytes (`\\0`) are the safest delimiter as they rarely appear in regular text content.\n\n### Arguments\n\n- `model_path`: Path to directory containing the model and vocabulary files (optional if `EMBEDDING_MODEL_PATH` is set)\n- `input_text`: Text to generate embedding for (wrap in quotes if it contains spaces) - single mode only\n- `--batch`: Enable batch processing mode (reads texts from stdin using delimiter)\n- `--delimiter=DELIM`: Set custom delimiter for batch mode (default: `\\0` null byte)\n- `--verbose`: Optional flag to enable verbose output (shows model info and embedding dimension)\n\n### Examples\n\n```bash\n# Traditional usage with explicit model path\n./embedder ./model_directory \"Hello world\"\n\n# Using environment variable\nexport EMBEDDING_MODEL_PATH=./model_directory\n./embedder \"Hello world\"\n\n# With verbose output\nexport EMBEDDING_MODEL_PATH=./model_directory\n./embedder \"Hello world\" --verbose\n\n# Batch processing examples (SAFE - handles texts with newlines)\nexport EMBEDDING_MODEL_PATH=./model_directory\n\n# Process texts using null delimiter (recommended)\nprintf \"Hello world\\0Text with\\nnewlines\\0Third text\\0\" | ./embedder --batch\n\n# Process texts using custom delimiter\necho \"Text1|||Text2|||Text3\" | ./embedder --batch --delimiter=\"|||\"\n\n# From file with null-delimited content\nprintf \"First text\\0Second text\\nwith newlines\\0\" \u003e texts.dat\ncat texts.dat | ./embedder --batch --verbose\n\n# UNSAFE: Line-based (only if no newlines in text content)\necho -e \"Simple1\\nSimple2\\nSimple3\" | ./embedder --batch --delimiter=\"\\n\"\n\n# Batch with explicit model path\nprintf \"Text1\\0Text2\\0\" | ./embedder ./model_directory --batch\n\n# Mixing approaches (environment variable as fallback)\nexport EMBEDDING_MODEL_PATH=./default_model\n./embedder ./specific_model \"Hello world\"  # Uses ./specific_model\n./embedder \"Hello world\"                   # Uses ./default_model\n```\n\n## Model Directory Structure\n\nThe embedder supports two directory structures:\n\n### Option 1: Direct model placement\n\n```\nmodel_directory/\n├── model.onnx\n└── vocab.txt\n```\n\n### Option 2: ONNX subdirectory\n\n```\nmodel_directory/\n├── onnx/\n│   └── model.onnx\n└── vocab.txt\n```\n\n## Output\n\n### Single Text Mode\nWithout `--verbose`: Outputs the full embedding as space-separated floating-point numbers.\n\nWith `--verbose`: Additionally shows:\n- Model loading confirmation\n- Input/output node information  \n- Vocabulary size\n- Embedding dimension\n\n### Batch Processing Mode\nWithout `--verbose`: Outputs one embedding per line, each as space-separated floating-point numbers.\n\nWith `--verbose`: Additionally shows:\n- Batch processing information\n- Number of texts processed\n- Output tensor shape\n- Model and vocabulary info\n\n## Performance Benefits\n\nBatch processing provides significant performance improvements when processing multiple texts:\n\n- **Model Loading**: The model is loaded only once for the entire batch\n- **Memory Efficiency**: Better GPU/CPU memory utilization\n- **Parallel Processing**: Takes advantage of vectorized operations\n- **Reduced Overhead**: Eliminates per-text setup costs\n\nFor example, processing 100 texts individually might take 10 seconds, while batch processing the same 100 texts could take only 2-3 seconds.\n\n## Important: Handling Texts with Newlines\n\n**⚠️ Critical Issue**: The original implementation used newlines (`\\n`) as delimiters, which breaks when processing texts that contain newlines (which is common in real-world text data).\n\n**✅ Solution**: This implementation now uses null bytes (`\\0`) as the default delimiter, which safely handles texts containing newlines, tabs, and other whitespace characters.\n\n**Examples of problematic texts** (that would break with line-based parsing):\n- Multi-paragraph text\n- Code snippets  \n- Formatted text with line breaks\n- Text with embedded newlines\n\n**Safe usage**:\n```bash\n# ✅ SAFE: Null-delimited (recommended)\nprintf \"Text 1\\0Text with\\nnewlines\\0Text 3\\0\" | ./embedder --batch\n\n# ✅ SAFE: Custom delimiter\necho \"Text1|||Text2|||Text3\" | ./embedder --batch --delimiter=\"|||\"\n\n# ⚠️ UNSAFE: Line-based (only for simple texts without newlines)\necho -e \"Text1\\nText2\\nText3\" | ./embedder --batch --delimiter=\"\\n\"\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphimage%2Fembedder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fphimage%2Fembedder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphimage%2Fembedder/lists"}