{"id":29040446,"url":"https://github.com/devshero/db2vec","last_synced_at":"2025-06-26T14:06:08.666Z","repository":{"id":288717903,"uuid":"968963425","full_name":"DevsHero/db2vec","owner":"DevsHero","description":"db2vec: High-performance Rust CLI to parse database dumps (.sql, .surql), generate vector embeddings via Ollama, TEI, Gemini, and load into vector databases (Pinecone, Redis, Chroma, Milvus, Qdrant, SurrealDB). Optimized for speed on large datasets.","archived":false,"fork":false,"pushed_at":"2025-05-18T02:44:34.000Z","size":33265,"stargazers_count":20,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-26T14:05:30.397Z","etag":null,"topics":["chroma","cli","dump","embedding","migration","milvus","mssql","mysql","ollama","oracle","pinecone","qdrant","redis","rust","sqlite","surrealdb","tei","vector"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DevsHero.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-19T04:42:21.000Z","updated_at":"2025-05-18T02:43:41.000Z","dependencies_parsed_at":"2025-05-18T03:42:12.715Z","dependency_job_id":null,"html_url":"https://github.com/DevsHero/db2vec","commit_stats":null,"previous_names":["devshero/db2vec"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/DevsHero/db2vec","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DevsHero%2Fdb2vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DevsHero%2Fdb2vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DevsHero%2Fdb2vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DevsHero%2Fdb2vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DevsHero","download_url":"https://codeload.github.com/DevsHero/db2vec/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DevsHero%2Fdb2vec/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262081117,"owners_count":23255662,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chroma","cli","dump","embedding","migration","milvus","mssql","mysql","ollama","oracle","pinecone","qdrant","redis","rust","sqlite","surrealdb","tei","vector"],"created_at":"2025-06-26T14:05:30.524Z","updated_at":"2025-06-26T14:06:08.646Z","avatar_url":"https://github.com/DevsHero.png","language":"Rust","readme":"# db2vec: From Database Dumps to Vector Search at Speed \n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nTired of waiting hours for Python scripts to embed large database exports, especially on machines without powerful GPUs? So was I. Processing millions of records demands performance, even on standard hardware. `db2vec` is a high‑performance Rust tool designed for efficient **CPU-based embedding generation**. It parses your database dumps, generates vector embeddings using local models (Ollama, text-embeddings-inference(TEI) ) or cloud APIs (Google Gemini), and loads them into your vector database of choice – all optimized for speed without requiring a dedicated GPU.\n\n![db2vec CLI running](assets/db2vec_screenshot.png)\n\n---\n\n## Core Features\n\n*   🚀 **Blazing Fast:** Built in Rust for maximum throughput on large datasets, optimized for CPU.\n*   🔄 **Parallel Processing:** Adjustable concurrency and batch‑size for embedding generation (`--num‑threads`, `--embedding‑concurrency`, `--embedding‑batch-size`).\n*   📦 **Batch Inserts:** Configurable batch size (`-c, --chunk-size`) and payload limits (`-m, --max-payload-size-mb`) for efficient bulk loading into the target vector database.\n*   🛡️ **Data Filtering:** Exclude sensitive tables or fields via configuration for data privacy and reduced processing time.\n*   🔧 **Highly Configurable:** Fine-tune performance and behavior with extensive CLI arguments for embedding, database connections, batching, and more.\n*   📄 **Supported Dump Formats:**\n    *   `.sql` (MySQL, PostgreSQL, MSSQL, SQLite, Oracle)\n        *   **MSSQL:**\n            ```bash\n            sqlcmd -S server -U user -P pass -Q \"SET NOCOUNT ON; SELECT * FROM dbo.TableName;\" -o dump.sql\n            ```\n        *   *Oracle requires exporting via SQL Developer or similar into standard SQL.*\n    *   `.surql` (SurrealDB)\n*   🧠 **Flexible Embeddings:** Supports multiple providers:\n    *   **Ollama** – best for local CPU/GPU, extremely fast.\n    *   **TEI** – CPU-only Text Embeddings Inference (v1.7.0), slower than Ollama but faster than cloud. See [docs/TEI.md](docs/TEI.md) for details.\n    *   **Google Gemini** – cloud API, ideal if you have very limited local resources. Beware of rate limits; use small batch sizes to avoid throttling.\n*   💾 **Vector DB Targets:** Inserts vectors + metadata into:\n    *   Chroma\n    *   Milvus\n    *   Pinecone (Cloud \u0026 Local Dev Image)\n    *   Qdrant\n    *   Redis Stack\n    *   SurrealDB\n*   ⚙️ **Pure Regex Parsing:** Fast, reliable record extraction (no AI).\n*   🔒 **Authentication:** Supports user/password, API key, tenants/namespaces per DB.\n*   ☁️ **Pinecone Cloud Support:** Automatically creates/describes indexes, uses namespaces.\n*   🐞 **Debug Mode:** `--debug` prints parsed JSON records before embedding.\n\n---\n\n## Requirements\n\n*   **Rust:** Latest stable (Edition 2021+).\n*   **Embedding Provider:** One of the following configured:\n    *   **Ollama:** Running locally with your desired model(s) pulled (e.g., `ollama pull nomic-embed-text`).\n    *   **TEI:** Requires TEI binary (`tei-metal`) and compatible model (e.g., `nomic-embed-text-v2-moe`). See [docs/TEI.md](docs/TEI.md) for setup.\n    *   **Google Gemini:** A valid Google Cloud API key (`--secret` or `EMBEDDING_API_KEY`) with the Generative Language API enabled for your project.\n*   **Target DB:** One of Chroma, Milvus, Pinecone, Qdrant, Redis Stack, SurrealDB (Docker recommended for local).\n*   **(Optional) `.env`:** For setting default configuration values.\n\n---\n\n## Configuration\n\nConfiguration can be set using CLI flags or by creating a `.env` file in the project root. CLI flags always override values set in the `.env` file.\n\nRefer to the `.env-example` file for a comprehensive list of available environment variables, their descriptions, and default values.\n\n---\n\n## How It Works\n\n1.  **Read \u0026 Detect:** Load dump (`.sql`/`.surql`), detect SQL dialect or SurrealDB.\n2.  **Parse (Regex):** Extract records and types.\n3.  **Apply Exclusions:** Skip tables or fields based on your exclusion rules (if enabled).\n4.  **Embed:** Call the selected embedding provider (`ollama`, `tei` on CPU, `google`) to get vectors.\n5.  **Auto-Schema:** Automatically create:\n    *   Target database if it doesn't exist\n    *   Collections/indices from table names in the dump\n    *   Proper dimension settings based on your `--dimension` parameter\n    *   Distance metrics using your specified `--metric` value\n6.  **Store:** Insert into your vector DB with metadata.\n\n---\n\n## Data Exclusion\n\nThe exclusion feature allows you to skip entire tables or specific fields within records, which is useful for:\n\n* Protecting sensitive data (passwords, PII)\n* Improving performance by excluding large tables or fields not needed for search\n* Reducing storage costs in your vector database\n\n### How to Use Exclusions\n\n1. Create a `config/exclude.json` file with your exclusion rules\n2. Enable exclusions with the `--use-exclude` flag\n\n### Sample exclude.json\n\n```json\n[\n  {\n    \"table\": \"users\",\n    \"ignore_table\": false,\n    \"exclude_fields\": {\n      \"password\": true,\n      \"email\": true,\n      \"profile\": [\"ssn\", \"tax_id\"]\n    }\n  },\n  {\n    \"table\": \"audit_logs\",\n    \"ignore_table\": true\n  }\n]\n```\nThis configuration:\n\nKeeps the \"users\" table but removes password and email fields\nFor the \"profile\" object field, only removes the \"ssn\" and \"tax_id\" subfields\nCompletely skips the \"audit_logs\" table\n---\n\n## Automatic Collection Creation\n\nFor each table in your source data dump, `db2vec` automatically:\n\n*   Creates a corresponding collection/index in the target vector database\n*   Names the collection after the source table name\n*   Configures proper dimensions and metric type based on your CLI arguments\n*   Creates the database first if it doesn't exist\n\nThis zero-config schema creation means you don't need to manually set up your vector database structure before import.\n\n\u003e **Note:** When using Redis with `--group-redis`, collections aren't created in the traditional sense. Instead, records are grouped by table name into Redis data structures (e.g., `table:profile` → [records]). Without this flag, Redis stores each record as an individual entry with a table label in the metadata.\n\u003e\n\u003e **Warning:** If collections already exist, their dimension must match the `--dimension` parameter you provide. Some databases like Pinecone will reject vectors with mismatched dimensions, causing the import to fail.\n\n---\n\n## Quick Start\n\n1.  **Clone \u0026 build**\n    ```bash\n    git clone https://github.com/DevsHero/db2vec.git\n    cd db2vec\n    cargo build --release\n    ```\n2.  **Prepare your dump**\n    *   MySQL/Postgres/Oracle: export `.sql`\n    *   MSSQL: `sqlcmd … \u003e mssql_dump.sql`\n    *   SQLite: `sqlite3 mydb.db .dump \u003e sqlite_dump.sql`\n    *   SurrealDB: `.surql` file\n3.  **(Optional) Create `.env`:** Copy `.env-example` to `.env` and customize defaults.\n4.  **Run**\n    ```bash\n    # MySQL → Milvus (using Ollama)\n    ./target/release/db2vec \\\n      -f samples/mysql_sample.sql \\\n      -t milvus \\\n      --host http://127.0.0.1:19530 \\\n      --database mydb \\\n      --embedding-provider ollama \\\n      --embedding-model nomic-embed-text \\\n      --dimension 768 \\\n      -u root -p secret --use-auth \\\n      --debug\n\n    # SurrealDB → Pinecone (using TEI)\n    ./target/release/db2vec \\\n      -f samples/surreal_sample.surql \\\n      -t pinecone \\\n      --host https://index-123.svc.us-east-1.pinecone.io \\\n      --namespace myns \\\n      --embedding-provider tei \\\n      --tei-binary-path tei/tei-metal \\\n      --embedding-model nomic-embed-text-v2-moe \\\n      --dimension 768\n\n    # SQLite → Qdrant (using Google Gemini)\n    ./target/release/db2vec \\\n      -f samples/oracle_sample.sql \\\n      -t qdrant \\\n      --host http://localhost:6333 \\\n      --embedding-provider google \\\n      --embedding-model text-embedding-004 \\\n      --dimension 768 \\\n      --embedding-api-key \u003cGOOGLE_API_KEY\u003e \\\n      --dimension 768 \\\n      --debug\n    ```\n\n---\n\n## Usage\n\n```bash\n# Cargo\ncargo run -- [OPTIONS]\n\n# Binary\n./target/release/db2vec [OPTIONS]\n\n# Logging\nRUST_LOG=info ./target/release/db2vec [OPTIONS]\nRUST_LOG=debug ./target/release/db2vec --debug [OPTIONS]\n```\n\n## Compatibility\n\nSee [docs/compatible.md](docs/compatible.md) for the full compatibility matrix of supported vector database versions and import file formats.\n\n\n---\n\n## Docker Setup\n\nRun supported vector DBs locally via Docker – see [DOCKER_SETUP.md](docs/DOCKER_SETUP.md) for commands.\n\n\n---\n\n## Target Environment\n\nPrimarily developed and tested against Docker‑hosted or cloud vector databases via RESTful APIs. Ensure your target is reachable from where you run `db2vec`. **Designed to run efficiently even on standard CPU hardware.**\n\n---\n\n## Testing\n\n### Integration Tests\n\ndb2vec includes comprehensive integration tests that verify functionality across all supported database types and embedding providers.\n\n#### Prerequisites\n\n- **Docker**: Required to run containerized instances of all supported vector databases\n- **Embedding Provider**: At least one of the supported embedding providers (Ollama/TEI/Google)\n\n#### Running Integration Tests\n\nThe integration test suite will:\n\n1. Spin up Docker containers for each supported vector database\n2. Test all database import formats (MySQL, PostgreSQL, MSSQL, SQLite, Oracle, SurrealDB)\n3. Generate embeddings using the specified provider\n4. Verify proper storage and retrieval from each vector database\n\n```bash\n# Test with Ollama (fastest, requires Ollama running locally)\nEMBEDDING_PROVIDER=ollama cargo test --test integration_test -- --nocapture\n\n# Test with TEI (CPU-based, no external dependencies)\nEMBEDDING_PROVIDER=tei cargo test --test integration_test -- --nocapture\n\n# Test with mock embeddings (no external provider required)\nEMBEDDING_PROVIDER=mock cargo test --test integration_test -- --nocapture\n```\n\n---\n\n## Contributing\n\nIssues, PRs, and feedback welcome!\n\n---\n\n## License\n\nMIT – see [LICENSE](LICENSE).","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevshero%2Fdb2vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdevshero%2Fdb2vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevshero%2Fdb2vec/lists"}