{"id":20668604,"url":"https://github.com/simonpierreboucher/embedding-generator","last_synced_at":"2026-05-06T08:36:04.597Z","repository":{"id":262707093,"uuid":"888100083","full_name":"simonpierreboucher/Embedding-generator","owner":"simonpierreboucher","description":"A robust Python tool for generating embeddings from text files using OpenAI's API. This tool processes text files, splits them into chunks while preserving context headers, and generates embeddings using OpenAI's models, saving both text and embeddings in structured formats.","archived":false,"fork":false,"pushed_at":"2024-11-18T19:29:28.000Z","size":29,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-10T15:18:33.496Z","etag":null,"topics":["embeddings","json","npy","openai","semantic-search","text-embedding"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/simonpierreboucher.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-13T20:17:01.000Z","updated_at":"2024-11-18T19:29:31.000Z","dependencies_parsed_at":"2024-11-13T21:24:30.895Z","dependency_job_id":"704b7f0e-3b67-4d4a-b3f6-db1e6f491159","html_url":"https://github.com/simonpierreboucher/Embedding-generator","commit_stats":null,"previous_names":["simonpierreboucher/embedding-generator"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonpierreboucher%2FEmbedding-generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonpierreboucher%2FEmbedding-generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonpierreboucher%2FEmbedding-generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonpierreboucher%2FEmbedding-generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/simonpierreboucher","download_url":"https://codeload.github.com/simonpierreboucher/Embedding-generator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonpierreboucher%2FEmbedding-generator/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259043761,"owners_count":22797159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embeddings","json","npy","openai","semantic-search","text-embedding"],"created_at":"2024-11-16T20:09:58.399Z","updated_at":"2026-05-06T08:36:04.558Z","avatar_url":"https://github.com/simonpierreboucher.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Text Document Embedding Generator\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python Version](https://img.shields.io/badge/python-3.7%2B-blue.svg)](https://www.python.org/downloads/)\n[![GitHub Issues](https://img.shields.io/github/issues/simonpierreboucher/llm-generate-function)](https://github.com/simonpierreboucher/llm-generate-function/issues)\n[![GitHub Forks](https://img.shields.io/github/forks/simonpierreboucher/llm-generate-function)](https://github.com/simonpierreboucher/llm-generate-function/network)\n[![GitHub Stars](https://img.shields.io/github/stars/simonpierreboucher/llm-generate-function)](https://github.com/simonpierreboucher/llm-generate-function/stargazers)\n\nA Python tool for generating embeddings from text documents using multiple providers, including OpenAI, Mistral AI, Voyage AI, and Cohere. It splits documents into configurable-sized chunks and generates embeddings for each chunk.\n\n**New Feature**: The script now adds a contextual description generated by a language model (LLM) such as GPT-4 for each chunk. This description helps situate the chunk within the overall context of the text, enhancing the quality of the embeddings.\n\n## Features\n\n- **Support for Multiple Embedding Providers**:\n  - OpenAI\n  - Mistral AI\n  - Voyage AI\n  - Cohere\n- **Contextual Embeddings**:\n  - Generation of contextual descriptions for each chunk using an LLM\n  - Combination of the chunk and its description to form a new chunk for embedding\n- **Processing of Multiple Text Files**\n- **Configurable Chunk Sizing**\n- **Document Header Management**\n- **Multiple Output Formats** (CSV, JSON, NPY)\n- **Error Handling and Retries**\n- **YAML-based Configuration**\n\n## Prerequisites\n\n```bash\npip install openai tiktoken numpy pandas tqdm pyyaml requests\n```\n\n## Configuration\n\nCreate a `config.yaml` file with the following structure:\n\n```yaml\napi:\n  provider:\n    name: \"openai\"  # Options: \"openai\", \"mistral\", \"voyage\", \"cohere\"\n    key: \"your-api-key\"\n    model: \"text-embedding-ada-002\"  # The model varies by provider\n  llm_model: \"gpt-4\"  # or \"gpt-3.5-turbo\" if you don't have access to GPT-4\n  llm_max_input_tokens: 8192\n  llm_max_output_tokens: 256\n  max_retries: 3\n  retry_delay: 2\n\npaths:\n  input_folder: \"path/to/text/files\"\n  output_base: \"output\"\n\nprocessing:\n  chunk_sizes: [400, 800, 1200]\n  header_lines: 2\n\noutput:\n  formats:\n    - csv\n    - json\n    - npy\n```\n\n### Configuration Parameters\n\n#### API Provider Settings\n\n- **provider.name**: Name of the embedding provider to use. Options:\n  - `\"openai\"`\n  - `\"mistral\"`\n  - `\"voyage\"`\n  - `\"cohere\"`\n\n- **provider.key**: Your API key for the selected provider.\n\n- **provider.model**: The embedding model to use. Model names vary by provider.\n\n- **llm_model**: The LLM model used to generate contextual descriptions. Examples:\n  - `\"gpt-4\"`\n  - `\"gpt-3.5-turbo\"`\n\n- **llm_max_input_tokens**: Maximum number of tokens for the LLM input (prompt).\n\n- **llm_max_output_tokens**: Maximum number of tokens for the LLM output (response).\n\n- **max_retries**: Maximum number of attempts in case of API call failures.\n\n- **retry_delay**: Delay between attempts (in seconds).\n\n#### Provider API Keys\n\n- **OpenAI**\n  - API Key: `OPENAI_API_KEY`\n  - Embedding Model: `\"text-embedding-ada-002\"`\n\n- **Mistral AI**\n  - API Key: `MISTRAL_API_KEY`\n  - Embedding Model: `\"mistral-embed\"`\n\n- **Voyage AI**\n  - API Key: `VOYAGE_API_KEY`\n  - Embedding Model: `\"voyage-large-2\"`\n\n- **Cohere**\n  - API Key: `CO_API_KEY`\n  - Embedding Model: `\"embed-english-v3.0\"`\n\n#### Other Parameters\n\n- **paths.input_folder**: The folder containing the text files to process.\n- **paths.output_base**: The folder where results will be saved.\n- **processing.chunk_sizes**: List of chunk sizes in tokens.\n- **processing.header_lines**: Number of header lines to include in each chunk.\n- **output.formats**: Desired output formats (`csv`, `json`, `npy`).\n\n## Usage\n\n1. **Install the required packages**:\n\n    ```bash\n    pip install openai tiktoken numpy pandas tqdm pyyaml requests\n    ```\n\n2. **Set up your API keys for the chosen provider(s)**:\n\n    ```bash\n    # For OpenAI\n    export OPENAI_API_KEY='your-api-key'\n    # For Mistral AI\n    export MISTRAL_API_KEY='your-api-key'\n    # For Voyage AI\n    export VOYAGE_API_KEY='your-api-key'\n    # For Cohere\n    export CO_API_KEY='your-api-key'\n    ```\n\n3. **Configure your provider and the LLM in `config.yaml`**.\n\n4. **Prepare your text files** in the input directory specified in `config.yaml`.\n\n5. **Run the script**:\n\n    ```bash\n    python embedding_generator.py\n    ```\n\n## Provider-Specific Features\n\n### OpenAI\n\n- High-quality embeddings\n- Extensive model options\n- Reliable API performance\n\n### Mistral AI\n\n- Competitive pricing\n- Good performance for multiple languages\n- Modern embedding models\n\n### Voyage AI\n\n- Specialized for specific use cases\n- Competitive pricing\n- Good documentation\n\n### Cohere\n\n- Multiple embedding types\n- Classification-specific embeddings\n- Extensive language support\n\n## Output Structure\n\nFor each configured chunk size, the script generates:\n\n### CSV (`embeddings_results_{size}tok.csv`)\n\n- `filename`: Source file name\n- `chunk_id`: Chunk identifier\n- `text`: Chunk content combined with its contextual description\n- `embedding`: Embedding vector\n\n### JSON (`chunks.json`)\n\n```json\n[\n  {\n    \"text\": \"Chunk content combined with its description\",\n    \"embedding\": [embedding vector],\n    \"metadata\": {\n      \"filename\": \"file name\",\n      \"chunk_id\": \"chunk identifier\"\n    }\n  },\n  ...\n]\n```\n\n### NPY (`embeddings.npy`)\n\nNumPy array containing all embedding vectors.\n\n## Error Handling\n\n- Provider-specific error handling\n- Automatic retry on API failure\n- Exponential backoff between attempts\n- Error and warning logging\n- Continues processing if a provider fails\n- Separate handling of errors related to the LLM\n\n## Methods Description\n\n### `EmbeddingGenerator` Class\n\n#### `clean_text(text: str) -\u003e str`\n\nCleans and normalizes text by removing extra whitespace and line breaks.\n\n#### `split_into_chunks(text: str, max_tokens: int) -\u003e List[str]`\n\nSplits text into chunks while preserving headers and respecting token limits.\n\n#### `get_chunk_context_description(chunk_text: str, full_text: str) -\u003e str`\n\nGenerates a brief description of the chunk's role in the text using an LLM.\n\n#### `get_embedding(text: str) -\u003e Optional[List[float]]`\n\nObtains embeddings from the selected provider with error handling and retries.\n\n#### `process_file(file_path: str, chunk_size: int) -\u003e List[Dict[str, Any]]`\n\nProcesses a single file by generating chunks, contextual descriptions, and embeddings.\n\n#### `save_results(results: List[Dict[str, Any]], chunk_size: int) -\u003e None`\n\nSaves the results in the configured output formats.\n\n## Limitations\n\n- Requires valid API keys for the embedding provider and the LLM (if different)\n- Different rate limits per provider\n- Varying embedding dimensions between providers\n- Provider-specific model limitations\n- Processes `.txt` files only\n- Using the LLM may increase processing time and costs\n\n## Best Practices\n\n1. **File Preparation**\n\n   - Ensure text files are properly encoded (UTF-8)\n   - Remove any binary or non-text content\n\n2. **Configuration**\n\n   - Adjust chunk sizes based on your needs\n   - Configure appropriate retry settings\n   - Set a reasonable number of header lines\n   - Choose an appropriate LLM model for your use case and budget\n\n3. **Resource Management**\n\n   - Monitor API usage\n   - Consider rate limiting for large datasets\n   - Regularly back up output files\n   - Be aware of LLM usage to manage costs\n\n4. **Provider and LLM Selection**\n\n   - Choose the embedding provider based on your needs:\n     - OpenAI for general purpose\n     - Mistral AI for multilingual support\n     - Voyage AI for specialized cases\n     - Cohere for classification tasks\n   - Select an LLM model based on accessibility and cost:\n     - Use `gpt-4` for better quality if accessible\n     - Use `gpt-3.5-turbo` for lower cost and wider availability\n\n5. **API Management**\n\n   - Monitor usage across all APIs used\n   - Consider provider-specific rate limits\n   - Keep API keys secure\n   - Plan for quota limits, especially when using LLMs\n\n## Provider Comparison\n\n| Provider | Strengths                      | Use Cases                |\n|----------|--------------------------------|--------------------------|\n| OpenAI   | High quality, reliable         | General purpose          |\n| Mistral  | Good multilingual support      | International content    |\n| Voyage   | Specialized features           | Domain-specific          |\n| Cohere   | Classification focus           | Text classification      |\n\n## Troubleshooting\n\nCommon issues and solutions:\n\n1. **API Errors**\n\n   - Verify API keys\n   - Check API rate limits and quotas\n   - Ensure network connectivity\n   - For LLM-related errors, check if the input exceeds token limits\n\n2. **File Processing Issues**\n\n   - Check file encoding\n   - Verify file permissions\n   - Ensure valid file content\n\n3. **Output Errors**\n\n   - Check disk space\n   - Verify write permissions\n   - Validate output directory structure\n\n4. **LLM Usage Issues**\n\n   - Monitor the number of tokens used in prompts and responses\n   - Adjust `llm_max_input_tokens` and `llm_max_output_tokens` if necessary\n   - Ensure the combined size of the chunk and context fits within the LLM's token limits\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\n[MIT License](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonpierreboucher%2Fembedding-generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsimonpierreboucher%2Fembedding-generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonpierreboucher%2Fembedding-generator/lists"}