{"id":37064321,"url":"https://github.com/oceanbase/pyseekdb","last_synced_at":"2026-01-14T07:30:59.613Z","repository":{"id":322577604,"uuid":"1089950079","full_name":"oceanbase/pyseekdb","owner":"oceanbase","description":"The python sdk for OceanBase or OceanBase seekdb","archived":false,"fork":false,"pushed_at":"2026-01-14T03:19:23.000Z","size":1055,"stargazers_count":44,"open_issues_count":25,"forks_count":18,"subscribers_count":1,"default_branch":"develop","last_synced_at":"2026-01-14T07:09:57.590Z","etag":null,"topics":["ai-search","fulltext-search","python3","sdk","sdk-python","vector-sdk"],"latest_commit_sha":null,"homepage":"https://www.oceanbase.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oceanbase.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-05T03:07:17.000Z","updated_at":"2026-01-14T03:19:28.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/oceanbase/pyseekdb","commit_stats":null,"previous_names":["oceanbase/pyseekdb"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/oceanbase/pyseekdb","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oceanbase%2Fpyseekdb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oceanbase%2Fpyseekdb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oceanbase%2Fpyseekdb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oceanbase%2Fpyseekdb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oceanbase","download_url":"https://codeload.github.com/oceanbase/pyseekdb/tar.gz/refs/heads/develop","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oceanbase%2Fpyseekdb/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28413323,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T05:26:33.345Z","status":"ssl_error","status_checked_at":"2026-01-14T05:21:57.251Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-search","fulltext-search","python3","sdk","sdk-python","vector-sdk"],"created_at":"2026-01-14T07:30:58.949Z","updated_at":"2026-01-14T07:30:59.600Z","avatar_url":"https://github.com/oceanbase.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pyseekdb\n\npyseekdb aims to provide developers with simple and easy-to-use APIs, reducing the learning curve and entry barriers. Compared to relational databases, database systems like MongoDB, ES, and Milvus simplify their main operations into KV operations in their Client APIs, making them more beginner-friendly. This SDK provides efficient and easy-to-use APIs for applications to access seekdb and OceanBase's AI-related features. Advanced users can still use MySQL-compatible drivers to directly manipulate database objects in seekdb and OceanBase through SQL statements.\n\nTo achieve the above design goals, this SDK follows the following design principles:\n1. Fixed data model with schema-free interfaces. For beginners and application prototype development, users do not need to explicitly define relational table structures\n2. The main data managed is text (or text fragments) and their attributes\n3. All data operations are single Collection operations, cross-collection operations are not supported\n4. Each record stores and manages **one** text\n\n## Table of Contents\n\n1. [Installation](#installation)\n2. [Client Connection](#1-client-connection)\n3. [AdminClient Connection and Database Management](#2-adminclient-connection-and-database-management)\n4. [Collection (Table) Management](#3-collection-table-management)\n5. [DML Operations](#4-dml-operations)\n6. [DQL Operations](#5-dql-operations)\n7. [Embedding Functions](#6-embedding-functions)\n8. [RAG Demo](#rag-demo)\n9. [Testing](#testing)\n\n## Installation\n\n```bash\npip install -U pyseekdb\n```\n\n## 1. Client Connection\n\nThe `Client` class provides a unified interface for connecting to seekdb in different modes. It automatically selects the appropriate connection mode based on the parameters provided.\n\n### 1.1 Embedded seekdb Client\n\nConnect to a local embedded seekdb instance:\n\n```python\nimport pyseekdb\n\n# Create embedded client with explicit path\nclient = pyseekdb.Client(\n    path=\"./seekdb\",      # Path to seekdb data directory\n    database=\"demo\"        # Database name\n)\n\n# Create embedded client with default path (current working directory)\n# If path is not provided, uses seekdb.db in the current process working directory\nclient = pyseekdb.Client(\n    database=\"demo\"        # Database name (path defaults to current working directory/seekdb.db)\n)\n```\n\n### 1.2 Remote Server Client\n\nConnect to a remote server (supports both seekdb Server and OceanBase Server):\n\n```python\nimport pyseekdb\n\n# Create remote server client (seekdb Server)\nclient = pyseekdb.Client(\n    host=\"127.0.0.1\",      # Server host\n    port=2881,              # Server port (default: 2881)\n    database=\"demo\",        # Database name\n    user=\"root\",            # Username (default: \"root\")\n    password=\"\"             # Password (can be retrieved from SEEKDB_PASSWORD environment variable)\n)\n\n# Create remote server client (OceanBase Server)\nclient = pyseekdb.Client(\n    host=\"127.0.0.1\",      # Server host\n    port=2881,              # Server port (default: 2881)\n    tenant=\"sys\",          # Tenant name (default: sys)\n    database=\"demo\",       # Database name\n    user=\"root\",           # Username (default: \"root\")\n    password=\"\"             # Password (can be retrieved from SEEKDB_PASSWORD environment variable)\n)\n```\n\n**Note:** If the `password` parameter is not provided (empty string), the client will automatically retrieve it from the `SEEKDB_PASSWORD` environment variable. This is useful for keeping passwords out of your code:\n\n```bash\nexport SEEKDB_PASSWORD=\"your_password\"\n```\n\n```python\n# Password will be automatically retrieved from SEEKDB_PASSWORD environment variable\nclient = pyseekdb.Client(\n    host=\"127.0.0.1\",\n    port=2881,\n    database=\"demo\",\n    user=\"root\"\n    # password parameter omitted - will use SEEKDB_PASSWORD from environment\n)\n```\n\n### 1.3 Client Methods and Properties\n\n| Method / Property     | Description                                                    |\n|-----------------------|----------------------------------------------------------------|\n| `create_collection()`  | Create a new collection (see Collection Management)            |\n| `get_collection()`    | Get an existing collection object                              |\n| `delete_collection()` | Delete a collection                                            |\n| `list_collections()`  | List all collections in the current database                   |\n| `has_collection()`    | Check if a collection exists                                   |\n| `get_or_create_collection()` | Get an existing collection or create it if it doesn't exist |\n| `count_collection()`  | Count the number of collections in the current database         |\n\n**Note:** The `Client` factory function returns a proxy that only exposes collection operations. For database management operations, use `AdminClient` (see section 2).\n\n## 2. AdminClient Connection and Database Management\n\nThe `AdminClient` class provides database management operations. It uses the same connection modes as `Client` but only exposes database management methods.\n\n### 2.1 Embedded/Server AdminClient\n\n```python\nimport pyseekdb\n\n# Embedded mode - Database management\nadmin = pyseekdb.AdminClient(path=\"./seekdb\")\n\n# Remote server mode - Database management (seekdb Server)\nadmin = pyseekdb.AdminClient(\n    host=\"127.0.0.1\",\n    port=2881,\n    user=\"root\",\n    password=\"\"  # Can be retrieved from SEEKDB_PASSWORD environment variable\n)\n\n# Remote server mode - Database management (OceanBase Server)\nadmin = pyseekdb.AdminClient(\n    host=\"127.0.0.1\",\n    port=2881,\n    tenant=\"sys\",  # Default tenant for OceanBase\n    user=\"root\",\n    password=\"\"  # Can be retrieved from SEEKDB_PASSWORD environment variable\n)\n```\n\n\n### 2.2 AdminClient Methods\n\n| Method                    | Description                                        |\n|---------------------------|----------------------------------------------------|\n| `create_database(name, tenant=DEFAULT_TENANT)` | Create a new database (uses client's tenant for remote oceanbase server mode) |\n| `get_database(name, tenant=DEFAULT_TENANT)`    | Get database object with metadata (uses client's tenant for remote oceanbase server mode) |\n| `delete_database(name, tenant=DEFAULT_TENANT)`  | Delete a database (uses client's tenant for remote oceanbase server mode) |\n| `list_databases(limit=None, offset=None, tenant=DEFAULT_TENANT)` | List all databases with optional pagination (uses client's tenant for remote oceanbase server mode) |\n\n**Parameters:**\n- `name` (str): Database name\n- `tenant` (str, optional): Tenant name (uses client's tenant if different, ignored for seekdb)\n- `limit` (int, optional): Maximum number of results to return\n- `offset` (int, optional): Number of results to skip for pagination\n\n### 2.4 Database Object\n\nThe `get_database()` and `list_databases()` methods return `Database` objects with the following properties:\n\n- `name` (str): Database name\n- `tenant` (str, optional): Tenant name (None for embedded/server mode)\n- `charset` (str, optional): Character set\n- `collation` (str, optional): Collation\n- `metadata` (dict): Additional metadata\n\n## 3. Collection (Table) Management\n\nCollections are the primary data structures in pyseekdb, similar to tables in traditional databases. Each collection stores documents with vector embeddings, metadata, and full-text search capabilities.\n\n### 3.1 Creating a Collection\n\n```python\nimport pyseekdb\nfrom pyseekdb import (\n    DefaultEmbeddingFunction,\n    HNSWConfiguration,\n    Configuration,\n    FulltextParserConfig\n)\n\n# Create a client\nclient = pyseekdb.Client(host=\"127.0.0.1\", port=2881, database=\"test\")\n\n# Create a collection with default configuration\ncollection = client.create_collection(\n    name=\"my_collection\"\n    # embedding_function defaults to DefaultEmbeddingFunction() (384 dimensions)\n)\n\n# Create a collection with custom embedding function\n# Dimension will be automatically calculated from embedding function\nef = UserDefinedEmbeddingFunction(model_name='all-MiniLM-L6-v2')\ncollection = client.create_collection(\n    name=\"my_collection\",\n    embedding_function=ef\n)\n\n# Recommended: Create a collection with Configuration wrapper\n# Using IK parser (default for Chinese text)\nconfig = Configuration(\n    hnsw=HNSWConfiguration(dimension=384, distance='cosine'),\n    fulltext_config=FulltextParserConfig(parser='ik')\n)\ncollection = client.create_collection(\n    name=\"my_collection\",\n    configuration=config,\n    embedding_function=ef\n)\n\n# Recommended: Create a collection with Configuration (only HNSW config, uses default parser)\nconfig = Configuration(\n    hnsw=HNSWConfiguration(dimension=384, distance='cosine')\n)\ncollection = client.create_collection(\n    name=\"my_collection\",\n    configuration=config,\n    embedding_function=ef\n)\n\n# Create a collection with Space parser (for space-separated languages)\nconfig = Configuration(\n    hnsw=HNSWConfiguration(dimension=384, distance='cosine'),\n    fulltext_config=FulltextParserConfig(parser='space')\n)\ncollection = client.create_collection(\n    name=\"my_collection\",\n    configuration=config,\n    embedding_function=ef\n)\n\n# Create a collection with Ngram parser and custom parameters\nconfig = Configuration(\n    hnsw=HNSWConfiguration(dimension=384, distance='cosine'),\n    fulltext_config=FulltextParserConfig(parser='ngram', params={'ngram_token_size': 3})\n)\ncollection = client.create_collection(\n    name=\"my_collection\",\n    configuration=config,\n    embedding_function=ef\n)\n\n# Create a collection without embedding function (embeddings must be provided manually)\n# Recommended: Use Configuration wrapper\nconfig = Configuration(\n    hnsw=HNSWConfiguration(dimension=128, distance='cosine')\n)\ncollection = client.create_collection(\n    name=\"my_collection\",\n    configuration=config,\n    embedding_function=None  # Explicitly disable embedding function\n)\n\n# Get or create collection (creates if doesn't exist)\ncollection = client.get_or_create_collection(\n    name=\"my_collection\",\n)\n```\n\n**Parameters:**\n- `name` (str): Collection name (required). Must be non-empty, use only letters/digits/underscore (`[a-zA-Z0-9_]`), and be at most 512 characters.\n- `configuration` (Configuration, HNSWConfiguration, or None, optional): Index configuration\n  - **Recommended:** `Configuration` - Wrapper class that can include both `HNSWConfiguration` and `FulltextParserConfig`\n    - Use `Configuration(hnsw=HNSWConfiguration(...))` even when only vector index config is needed\n    - Allows easy addition of fulltext parser config later\n  - `HNSWConfiguration`: Vector index configuration with `dimension` and `distance` metric (backward compatibility)\n  - If not provided, uses default (dimension=384, distance='cosine', parser='ik')\n  - If set to `None`, dimension will be calculated from `embedding_function`\n- `embedding_function` (EmbeddingFunction, optional): Function to convert documents to embeddings\n  - If not provided, uses `DefaultEmbeddingFunction()` (384 dimensions)\n  - If set to `None`, collection will not have an embedding function\n  - If provided, the dimension will be automatically calculated and validated against `configuration.dimension`\n\n**Fulltext Parser Options:**\n- `'ik'` (default): IK parser for Chinese text segmentation\n- `'space'`: Space-separated tokenizer for languages like English\n- `'ngram'`: N-gram tokenizer\n- `'ngram2'`: 2-gram tokenizer\n- `'beng'`: Bengali text parser\n\nFor more information about parser, please refer to [create_index section tokenizer_option](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000004479548#tokenizer_option).\n\n**Note:** When `embedding_function` is provided, the system will automatically calculate the vector dimension by calling the function. If `configuration.dimension` is also provided, it must match the embedding function's dimension, otherwise a `ValueError` will be raised.\n\n### 3.2 Getting a Collection\n\n```python\n# Get an existing collection (uses default embedding function if collection doesn't have one)\ncollection = client.get_collection(\"my_collection\")\n\n# Get collection with specific embedding function\nef = DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2')\ncollection = client.get_collection(\"my_collection\", embedding_function=ef)\n\n# Get collection without embedding function\ncollection = client.get_collection(\"my_collection\", embedding_function=None)\n\n# Check if collection exists\nif client.has_collection(\"my_collection\"):\n    collection = client.get_collection(\"my_collection\")\n```\n\n**Parameters:**\n- `name` (str): Collection name (required)\n- `embedding_function` (EmbeddingFunction, optional): Embedding function to use for this collection\n  - If not provided, uses `DefaultEmbeddingFunction()` by default\n  - If set to `None`, collection will not have an embedding function\n  - **Important:** The embedding function set here will be used for all operations on this collection (add, upsert, update, query, hybrid_search) when documents/texts are provided without embeddings\n\n### 3.3 Listing Collections\n\n```python\n# List all collections\ncollections = client.list_collections()\nfor coll in collections:\n    print(f\"Collection: {coll.name}, Dimension: {coll.dimension}\")\n\n# Count collections in database\ncollection_count = client.count_collection()\nprint(f\"Database has {collection_count} collections\")\n```\n\n### 3.4 Deleting a Collection\n\n```python\n# Delete a collection\nclient.delete_collection(\"my_collection\")\n```\n\n### 3.5 Collection Properties\n\nEach `Collection` object has the following properties:\n\n- `name` (str): Collection name\n- `id` (str, optional): Collection unique identifier\n- `dimension` (int, optional): Vector dimension\n- `embedding_function` (EmbeddingFunction, optional): Embedding function associated with this collection\n- `distance` (str): Distance metric used by the index (e.g., 'l2', 'cosine', 'inner_product')\n- `metadata` (dict): Collection metadata\n\n**Accessing Embedding Function:**\n```python\ncollection = client.get_collection(\"my_collection\")\nif collection.embedding_function is not None:\n    print(f\"Collection uses embedding function: {collection.embedding_function}\")\n    print(f\"Embedding dimension: {collection.embedding_function.dimension}\")\n```\n\n## 4. DML Operations\n\nDML (Data Manipulation Language) operations allow you to insert, update, and delete data in collections.\n\n### 4.1 Add Data\n\nThe `add()` method inserts new records into a collection. If a record with the same ID already exists, an error will be raised.\n\n**Behavior with Embedding Function:**\n\n1. **If `embeddings` are provided:** Embeddings are used directly, `embedding_function` is NOT called (even if provided)\n2. **If `embeddings` are NOT provided but `documents` are provided:**\n   - If collection has an `embedding_function` (set during creation or retrieval), it will automatically generate embeddings from documents\n   - If collection does NOT have an `embedding_function`, a `ValueError` will be raised\n3. **If neither `embeddings` nor `documents` are provided:** A `ValueError` will be raised\n\n```python\n# Add single item with embeddings (embedding_function not used)\ncollection.add(\n    ids=\"item1\",\n    embeddings=[0.1, 0.2, 0.3],\n    documents=\"This is a document\",\n    metadatas={\"category\": \"AI\", \"score\": 95}\n)\n\n# Add multiple items with embeddings (embedding_function not used)\ncollection.add(\n    ids=[\"item1\", \"item2\", \"item3\"],\n    embeddings=[\n        [0.1, 0.2, 0.3],\n        [0.4, 0.5, 0.6],\n        [0.7, 0.8, 0.9]\n    ],\n    documents=[\n        \"Document 1\",\n        \"Document 2\",\n        \"Document 3\"\n    ],\n    metadatas=[\n        {\"category\": \"AI\", \"score\": 95},\n        {\"category\": \"ML\", \"score\": 88},\n        {\"category\": \"DL\", \"score\": 92}\n    ]\n)\n\n# Add with only embeddings (no documents)\ncollection.add(\n    ids=[\"vec1\", \"vec2\"],\n    embeddings=[[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]\n)\n\n# Add with only documents - embeddings auto-generated by embedding_function\n# Requires: collection must have embedding_function set\ncollection.add(\n    ids=[\"doc1\", \"doc2\"],\n    documents=[\"Text document 1\", \"Text document 2\"],\n    metadatas=[{\"tag\": \"A\"}, {\"tag\": \"B\"}]\n)\n# The collection's embedding_function will automatically convert documents to embeddings\n```\n\n**Parameters:**\n- `ids` (str or List[str]): Single ID or list of IDs (required)\n- `embeddings` (List[float] or List[List[float]], optional): Single embedding or list of embeddings\n  - If provided, used directly (embedding_function is ignored)\n  - If not provided, must provide `documents` and collection must have `embedding_function`\n- `documents` (str or List[str], optional): Single document or list of documents\n  - If `embeddings` not provided, `documents` will be converted to embeddings using collection's `embedding_function`\n- `metadatas` (dict or List[dict], optional): Single metadata dict or list of metadata dicts\n\n**Note:** The `embedding_function` used is the one associated with the collection (set during `create_collection()` or `get_collection()`). You cannot override it per-operation.\n\n### 4.2 Update Data\n\nThe `update()` method updates existing records in a collection. Records must exist, otherwise an error will be raised.\n\n**Behavior with Embedding Function:**\n\n1. **If `embeddings` are provided:** Embeddings are used directly, `embedding_function` is NOT called\n2. **If `embeddings` are NOT provided but `documents` are provided:**\n   - If collection has an `embedding_function`, it will automatically generate embeddings from documents\n   - If collection does NOT have an `embedding_function`, a `ValueError` will be raised\n3. **If neither `embeddings` nor `documents` are provided:** Only metadata will be updated (metadata-only update is allowed)\n\n```python\n# Update single item - metadata only (embedding_function not used)\ncollection.update(\n    ids=\"item1\",\n    metadatas={\"category\": \"AI\", \"score\": 98}  # Update metadata only\n)\n\n# Update multiple items with embeddings (embedding_function not used)\ncollection.update(\n    ids=[\"item1\", \"item2\"],\n    embeddings=[[0.9, 0.8, 0.7], [0.6, 0.5, 0.4]],  # Update embeddings\n    documents=[\"Updated document 1\", \"Updated document 2\"]  # Update documents\n)\n\n# Update with documents only - embeddings auto-generated by embedding_function\n# Requires: collection must have embedding_function set\ncollection.update(\n    ids=\"item1\",\n    documents=\"New document text\",  # Embeddings will be auto-generated\n    metadatas={\"category\": \"AI\"}\n)\n\n# Update specific fields - only document (embeddings auto-generated)\ncollection.update(\n    ids=\"item1\",\n    documents=\"New document text\"  # Only update document, embeddings auto-generated\n)\n```\n\n**Parameters:**\n- `ids` (str or List[str]): Single ID or list of IDs to update (required)\n- `embeddings` (List[float] or List[List[float]], optional): New embeddings\n  - If provided, used directly (embedding_function is ignored)\n  - If not provided, can provide `documents` to auto-generate embeddings\n- `documents` (str or List[str], optional): New documents\n  - If `embeddings` not provided, `documents` will be converted to embeddings using collection's `embedding_function`\n- `metadatas` (dict or List[dict], optional): New metadata\n\n**Note:** Metadata-only updates (no embeddings, no documents) are allowed. The `embedding_function` used is the one associated with the collection.\n\n### 4.3 Upsert Data\n\nThe `upsert()` method inserts new records or updates existing ones. If a record with the given ID exists, it will be updated; otherwise, a new record will be inserted.\n\n**Behavior with Embedding Function:**\n\n1. **If `embeddings` are provided:** Embeddings are used directly, `embedding_function` is NOT called\n2. **If `embeddings` are NOT provided but `documents` are provided:**\n   - If collection has an `embedding_function`, it will automatically generate embeddings from documents\n   - If collection does NOT have an `embedding_function`, a `ValueError` will be raised\n3. **If neither `embeddings` nor `documents` are provided:** Only metadata will be upserted (metadata-only upsert is allowed)\n\n```python\n# Upsert single item with embeddings (embedding_function not used)\ncollection.upsert(\n    ids=\"item1\",\n    embeddings=[0.1, 0.2, 0.3],\n    documents=\"Document text\",\n    metadatas={\"category\": \"AI\", \"score\": 95}\n)\n\n# Upsert multiple items with embeddings (embedding_function not used)\ncollection.upsert(\n    ids=[\"item1\", \"item2\", \"item3\"],\n    embeddings=[\n        [0.1, 0.2, 0.3],\n        [0.4, 0.5, 0.6],\n        [0.7, 0.8, 0.9]\n    ],\n    documents=[\"Doc 1\", \"Doc 2\", \"Doc 3\"],\n    metadatas=[\n        {\"category\": \"AI\"},\n        {\"category\": \"ML\"},\n        {\"category\": \"DL\"}\n    ]\n)\n\n# Upsert with documents only - embeddings auto-generated by embedding_function\n# Requires: collection must have embedding_function set\ncollection.upsert(\n    ids=[\"item1\", \"item2\"],\n    documents=[\"Document 1\", \"Document 2\"],\n    metadatas=[{\"category\": \"AI\"}, {\"category\": \"ML\"}]\n)\n# The collection's embedding_function will automatically convert documents to embeddings\n```\n\n**Parameters:**\n- `ids` (str or List[str]): Single ID or list of IDs (required)\n- `embeddings` (List[float] or List[List[float]], optional): Embeddings\n  - If provided, used directly (embedding_function is ignored)\n  - If not provided, can provide `documents` to auto-generate embeddings\n- `documents` (str or List[str], optional): Documents\n  - If `embeddings` not provided, `documents` will be converted to embeddings using collection's `embedding_function`\n- `metadatas` (dict or List[dict], optional): Metadata\n\n**Note:** Metadata-only upserts (no embeddings, no documents) are allowed. The `embedding_function` used is the one associated with the collection.\n\n### 4.4 Delete Data\n\nThe `delete()` method removes records from a collection. You can delete by IDs, metadata filters, or document filters.\n\n```python\n# Delete by IDs\ncollection.delete(ids=[\"item1\", \"item2\", \"item3\"])\n\n# Delete by single ID\ncollection.delete(ids=\"item1\")\n\n# Delete by metadata filter\ncollection.delete(where={\"category\": {\"$eq\": \"AI\"}})\n\n# Delete by comparison operator\ncollection.delete(where={\"score\": {\"$lt\": 50}})\n\n# Delete by document filter\ncollection.delete(where_document={\"$contains\": \"obsolete\"})\n\n# Delete with combined filters\ncollection.delete(\n    where={\"category\": {\"$eq\": \"AI\"}},\n    where_document={\"$contains\": \"deprecated\"}\n)\n```\n\n**Parameters:**\n- `ids` (str or List[str], optional): Single ID or list of IDs to delete\n- `where` (dict, optional): Metadata filter conditions (see Filter Operators section)\n- `where_document` (dict, optional): Document filter conditions\n\n**Note:** At least one of `ids`, `where`, or `where_document` must be provided.\n\n## 5. DQL Operations\n\nDQL (Data Query Language) operations allow you to retrieve data from collections using various query methods.\n\n### 5.1 Query (Vector Similarity Search)\n\nThe `query()` method performs vector similarity search to find the most similar documents to the query vector(s).\n\n**Behavior with Embedding Function:**\n\n1. **If `query_embeddings` are provided:** embeddings are used directly, `embedding_function` is NOT called\n2. **If `query_embeddings` are NOT provided but `query_texts` are provided:**\n   - If collection has an `embedding_function`, it will automatically generate query embeddings from texts\n   - If collection does NOT have an `embedding_function`, a `ValueError` will be raised\n3. **If neither `query_embeddings` nor `query_texts` are provided:** A `ValueError` will be raised\n\n```python\n# Basic vector similarity query (embedding_function not used)\nresults = collection.query(\n    query_embeddings=[1.0, 2.0, 3.0],\n    n_results=3\n)\n\n# Iterate over results\nfor i in range(len(results[\"ids\"][0])):\n    print(f\"ID: {results['ids'][0][i]}, Distance: {results['distances'][0][i]}\")\n    if results.get(\"documents\"):\n        print(f\"Document: {results['documents'][0][i]}\")\n    if results.get(\"metadatas\"):\n        print(f\"Metadata: {results['metadatas'][0][i]}\")\n\n# Query by texts - embeddings auto-generated by embedding_function\n# Requires: collection must have embedding_function set\nresults = collection.query(\n    query_texts=[\"my query text\"],\n    n_results=10\n)\n# The collection's embedding_function will automatically convert query_texts to query_embeddings\n\n# Query by multiple texts (batch query)\nresults = collection.query(\n    query_texts=[\"query text 1\", \"query text 2\"],\n    n_results=5\n)\n# Returns dict with lists of lists, one list per query text\nfor i in range(len(results[\"ids\"])):\n    print(f\"Query {i}: {len(results['ids'][i])} results\")\n\n# Query with metadata filter (using query_texts)\nresults = collection.query(\n    query_texts=[\"AI research\"],\n    where={\"category\": {\"$eq\": \"AI\"}},\n    n_results=5\n)\n\n# Query with comparison operator (using query_texts)\nresults = collection.query(\n    query_texts=[\"machine learning\"],\n    where={\"score\": {\"$gte\": 90}},\n    n_results=5\n)\n\n# Query with document filter (using query_texts)\nresults = collection.query(\n    query_texts=[\"neural networks\"],\n    where_document={\"$contains\": \"machine learning\"},\n    n_results=5\n)\n\n# Query with combined filters (using query_texts)\nresults = collection.query(\n    query_texts=[\"AI research\"],\n    where={\"category\": {\"$eq\": \"AI\"}, \"score\": {\"$gte\": 90}},\n    where_document={\"$contains\": \"machine\"},\n    n_results=5\n)\n\n# Query with multiple embeddings (batch query)\nresults = collection.query(\n    query_embeddings=[[1.0, 2.0, 3.0], [2.0, 3.0, 4.0]],\n    n_results=2\n)\n# Returns dict with lists of lists, one list per query embedding\nfor i in range(len(results[\"ids\"])):\n    print(f\"Query {i}: {len(results['ids'][i])} results\")\n\n# Query with specific fields\nresults = collection.query(\n    query_embeddings=[1.0, 2.0, 3.0],\n    include=[\"documents\", \"metadatas\", \"embeddings\"],\n    n_results=3\n)\n```\n\n**Parameters:**\n- `query_embeddings` (List[float] or List[List[float]], optional): Single embedding or list of embeddings for batch queries\n  - If provided, used directly (embedding_function is ignored)\n  - If not provided, must provide `query_texts` and collection must have `embedding_function`\n- `query_texts` (str or List[str], optional): Query text(s) to be embedded\n  - If `query_embeddings` not provided, `query_texts` will be converted to embeddings using collection's `embedding_function`\n- `n_results` (int, required): Number of similar results to return (default: 10)\n- `where` (dict, optional): Metadata filter conditions (see Filter Operators section)\n- `where_document` (dict, optional): Document content filter\n- `include` (List[str], optional): List of fields to include: `[\"documents\", \"metadatas\", \"embeddings\"]`\n\n**Returns:**\nDict with keys (chromadb-compatible format):\n- `ids`: `List[List[str]]` - List of ID lists, one list per query\n- `documents`: `Optional[List[List[str]]]` - List of document lists, one list per query (if included)\n- `metadatas`: `Optional[List[List[Dict]]]` - List of metadata lists, one list per query (if included)\n- `embeddings`: `Optional[List[List[List[float]]]]` - List of embedding lists, one list per query (if included)\n- `distances`: `Optional[List[List[float]]]` - List of distance lists, one list per query\n\n**Usage:**\n```python\n# Single query\nresults = collection.query(query_embeddings=[0.1, 0.2, 0.3], n_results=5)\n# results[\"ids\"][0] contains IDs for the query\n# results[\"documents\"][0] contains documents for the query\n# results[\"distances\"][0] contains distances for the query\n\n# Multiple queries\nresults = collection.query(query_embeddings=[[0.1, 0.2], [0.3, 0.4]], n_results=5)\n# results[\"ids\"][0] contains IDs for first query\n# results[\"ids\"][1] contains IDs for second query\n```\n\n**Note:** The `embedding_function` used is the one associated with the collection. You cannot override it per-query.\n\n### 5.2 Get (Retrieve by IDs or Filters)\n\nThe `get()` method retrieves documents from a collection without vector similarity search. It supports filtering by IDs, metadata, and document content.\n\n```python\n# Get by single ID\nresults = collection.get(ids=\"123\")\n\n# Get by multiple IDs\nresults = collection.get(ids=[\"1\", \"2\", \"3\"])\n\n# Get by metadata filter (simplified equality - both forms are supported)\nresults = collection.get(\n    where={\"category\": \"AI\"},\n    limit=10\n)\n# Or use explicit $eq operator:\n# where={\"category\": {\"$eq\": \"AI\"}}\n\n# Get by comparison operator\nresults = collection.get(\n    where={\"score\": {\"$gte\": 90}},\n    limit=10\n)\n\n# Get by $in operator\nresults = collection.get(\n    where={\"tag\": {\"$in\": [\"ml\", \"python\"]}},\n    limit=10\n)\n\n# Get by logical operators ($or) - simplified equality\nresults = collection.get(\n    where={\n        \"$or\": [\n            {\"category\": \"AI\"},\n            {\"tag\": \"python\"}\n        ]\n    },\n    limit=10\n)\n\n# Get by document content filter\nresults = collection.get(\n    where_document={\"$contains\": \"machine learning\"},\n    limit=10\n)\n\n# Get with combined filters\nresults = collection.get(\n    where={\"category\": {\"$eq\": \"AI\"}},\n    where_document={\"$contains\": \"machine\"},\n    limit=10\n)\n\n# Get with pagination\nresults = collection.get(limit=2, offset=1)\n\n# Get with specific fields\nresults = collection.get(\n    ids=[\"1\", \"2\"],\n    include=[\"documents\", \"metadatas\", \"embeddings\"]\n)\n\n# Get all data (up to limit)\nresults = collection.get(limit=100)\n```\n\n**Parameters:**\n- `ids` (str or List[str], optional): Single ID or list of IDs to retrieve\n- `where` (dict, optional): Metadata filter conditions (see Filter Operators section)\n- `where_document` (dict, optional): Document content filter using `$contains` for full-text search\n- `limit` (int, optional): Maximum number of results to return\n- `offset` (int, optional): Number of results to skip for pagination\n- `include` (List[str], optional): List of fields to include: `[\"documents\", \"metadatas\", \"embeddings\"]`\n\n**Returns:**\nDict with keys (chromadb-compatible format):\n- `ids`: `List[str]` - List of IDs\n- `documents`: `Optional[List[str]]` - List of documents (if included)\n- `metadatas`: `Optional[List[Dict]]` - List of metadata dictionaries (if included)\n- `embeddings`: `Optional[List[List[float]]]` - List of embeddings (if included)\n\n**Usage:**\n```python\n# Get by single ID\nresults = collection.get(ids=\"123\")\n# results[\"ids\"] contains [\"123\"]\n# results[\"documents\"] contains document for ID \"123\"\n\n# Get by multiple IDs\nresults = collection.get(ids=[\"1\", \"2\", \"3\"])\n# results[\"ids\"] contains [\"1\", \"2\", \"3\"]\n# results[\"documents\"] contains documents for all IDs\n\n# Get by filter\nresults = collection.get(where={\"category\": {\"$eq\": \"AI\"}}, limit=10)\n# results[\"ids\"] contains all matching IDs\n# results[\"documents\"] contains all matching documents\n```\n\n**Note:** If no parameters provided, returns all data (up to limit).\n\n### 5.3 Hybrid Search\n\n`collection.hybrid_search()` runs full-text/scalar queries and vector KNN search in parallel, then fuses the results (RRF is supported). You can pass raw dicts/lists or a `HybridSearch` builder (the builder can be given as the first argument or via `search=`; when present it overrides other parameters).\n\n**Parameters（dict mode）**\n- `query` (dict or List[dict], optional): full-text/scalar routes\n  - `where_document`: `$contains` / `$not_contains` plus `$and` / `$or` combinations of those clauses\n  - `where`: metadata filters (see 5.4) including logical operators and `#id`\n  - `boost`: weight for this text route when results are fused\n- `knn` (dict or List[dict], optional): vector routes\n  - `query_embeddings`: `List[float]` or `List[List[float]]`; validated against `collection.dimension` when present\n  - `query_texts`: str or List[str]; auto-embedded with the collection's `embedding_function` (missing function raises `ValueError`)\n  - `where`: metadata filters for this vector route\n  - `n_results`: candidates per vector route (k, default 10)\n  - `boost`: weight for this vector route\n- `rank` (dict, optional): ranking config; RRF tested via `{\"rrf\": {...}}` or `{}`. Omit to use single-route ordering.\n- `n_results` (int): final fused result count (default 10).\n- `include` (List[str], optional): fields to return. `ids`/`distances` are always returned; `documents`/`metadatas` are returned by default when `include` is `None`; add `\"embeddings\"` to fetch vectors.\n- `search` (`HybridSearch`, optional): fluent builder; overrides `query`/`knn`/`rank`/`include`/`n_results`.\n\n**Return format**\n- Query-compatible dict: `ids`, `distances`, optionally `documents` / `metadatas` / `embeddings`. Hybrid search returns a single outer list (one fused result set).\n\n**Examples**\n```python\n# Full-text + vector with rank fusion (dict style)\nresults = collection.hybrid_search(\n    query={\n        \"where_document\": {\"$contains\": \"machine learning\"},\n        \"where\": {\"category\": {\"$eq\": \"science\"}},\n        \"boost\": 0.5,\n    },\n    knn={\n        \"query_texts\": [\"AI research\"],  # auto-embedded via collection.embedding_function\n        \"where\": {\"year\": {\"$gte\": 2020}},\n        \"n_results\": 10,  # k per vector route\n        \"boost\": 0.8,\n    },\n    rank={\"rrf\": {\"rank_window_size\": 60, \"rank_constant\": 60}},\n    n_results=5,\n    include=[\"documents\", \"metadatas\", \"embeddings\"],\n)\n\n# Vector-only search using explicit embeddings (dimension is validated)\nresults = collection.hybrid_search(\n    knn={\"query_embeddings\": [[0.1, 0.2, 0.3]], \"n_results\": 8},\n    n_results=5,\n    include=[\"documents\", \"metadatas\"],\n)\n\n# Pass a HybridSearch builder (takes precedence over other args)\nfrom pyseekdb import (\n    HybridSearch,\n    DOCUMENT,\n    TEXT,\n    EMBEDDINGS,\n    K,\n    DOCUMENTS,\n    METADATAS,\n)\n\nsearch = (\n    HybridSearch()\n    .query(DOCUMENT.contains(\"machine learning\"), K(\"category\") == \"AI\", boost=0.6)\n    .knn(TEXT(\"AI research\"), K(\"year\") \u003e= 2020, n_results=10, boost=0.8)\n    .limit(5)\n    .select(DOCUMENTS, METADATAS, EMBEDDINGS)\n    .rank({\"rrf\": {}})\n)\nresults = collection.hybrid_search(search)\n```\n\n**HybridSearch builder tips**\n- Chain multiple `.query(...)` / `.knn(...)` calls to emit multiple routes; providing multiple `query_texts` / `query_embeddings` also expands KNN routes automatically.\n- `.limit(n)` sets the final fused `n_results`; `.select(...)` controls `include` (e.g., `DOCUMENTS`, `METADATAS`, `EMBEDDINGS`).\n- Handy builders for conditions: `DOCUMENT.contains(...)` / `DOCUMENT.not_contains(...)`, `TEXT(\"...\")`, `EMBEDDINGS([...])`, `K(\"field\")` with `==`, `!=`, `\u003c`, `\u003c=`, `\u003e`, `\u003e=`, `.in_`, `.nin`; combine document/metadata expressions with `\u0026` and `|`.\n- Embeddings supplied through `EMBEDDINGS(...)` are dimension-checked when the collection defines a dimension.\n\n#### Building a HybridSearch (builder how-to)\n1) Import \u0026 create\n```python\nfrom pyseekdb import HybridSearch, DOCUMENT, TEXT, EMBEDDINGS, K, DOCUMENTS, METADATAS\nhs = HybridSearch()\n```\n2) Add full-text / scalar routes (can be called multiple times)\n```python\nhs = hs.query(\n    DOCUMENT.contains(\"machine learning\") \u0026 DOCUMENT.not_contains(\"deprecated\"),\n    K(\"category\") == \"AI\",\n    K(\"year\") \u003e= 2020,\n    n_results=8,       # candidates per text route\n    boost=0.5          # weight for this text route\n)\n```\n3) Add vector routes (text or explicit embeddings; can be called multiple times)\n```python\n# Text-to-vec (requires collection.embedding_function)\nhs = hs.knn(TEXT([\"AI research\", \"deep learning\"]), K(\"score\") \u003e= 80, n_results=12, boost=1.0)\n\n# Direct embeddings (dimension-validated)\nhs = hs.knn(EMBEDDINGS([0.1, 0.2, 0.3]), K(\"tag\").is_in([\"ml\", \"python\"]), n_results=6, boost=0.7)\n\n# Or pass a ready-to-use knn dict\nhs = hs.knn({\"query_texts\": [\"semantic search\"], \"where\": {\"topic\": {\"$eq\": \"nlp\"}}, \"n_results\": 10, \"boost\": 0.9})\n```\n4) Ranking and final wiring\n```python\nhs = hs.rank()  # defaults to rrf; or hs.rank(\"rrf\", rank_window_size=60, rank_constant=60)\nhs = hs.limit(5)            # final fused result count\nhs = hs.select(DOCUMENTS, METADATAS, EMBEDDINGS)  # include embeddings explicitly when needed\n```\n5) Execute\n```python\nresults = collection.hybrid_search(hs)\n```\n6) Key behaviors \u0026 gotchas\n- Multiple `.query(...)` / `.knn(...)` calls produce multiple routes; `TEXT([...])` or multiple embeddings also auto-expand into multiple routes.\n- `.rank()` defaults to `rrf`; only `rrf` is supported, with optional `rank_window_size` and `rank_constant` keyword args. Dict form is still accepted but should not mix with kwargs.\n- `query_texts` requires the collection’s `embedding_function`; otherwise use `query_embeddings`.\n- Dimension mismatches (when `collection.dimension` is known) raise `ValueError`.\n- `ids`/`distances` always return; `documents`/`metadatas` return by default when `include=None`; add `embeddings` via `.select(...)` or `include` to fetch vectors.\n\n### 5.4 Filter Operators\n\n#### Metadata Filters (`where` parameter)\n- `$eq` (or direct equality) / `$ne` / `$gt` / `$gte` / `$lt` / `$lte`\n- `$in` / `$nin` for membership checks\n- `$or` / `$and` for logical composition\n- `$not` for negation\n- `#id` to filter by primary key (e.g., `{\"#id\": {\"$in\": [\"id1\", \"id2\"]}}`)\n\n#### Document Filters (`where_document` parameter)\n- `$contains`: full-text match\n- `$not_contains`: exclude matches\n- `$or` / `$and` combining multiple `$contains` clauses\n\n### 5.5 Collection Information Methods\n\n```python\n# Get item count\ncount = collection.count()\nprint(f\"Collection has {count} items\")\n\n# Preview first few items in collection (returns all columns by default)\npreview = collection.peek(limit=5)\nfor i in range(len(preview[\"ids\"])):\n    print(f\"ID: {preview['ids'][i]}, Document: {preview['documents'][i]}\")\n    print(f\"Metadata: {preview['metadatas'][i]}, Embedding: {preview['embeddings'][i]}\")\n\n# Count collections in database\ncollection_count = client.count_collection()\nprint(f\"Database has {collection_count} collections\")\n```\n\n**Methods:**\n- `collection.count()` - Get the number of items in the collection\n- `collection.peek(limit=10)` - Quickly preview the first few items in the collection\n- `client.count_collection()` - Count the number of collections in the current database\n\n## 6. Embedding Functions\n\nEmbedding functions convert text documents into vector embeddings for similarity search. pyseekdb supports both built-in and custom embedding functions.\n\n### 6.1 Default Embedding Function\n\nThe `DefaultEmbeddingFunction` uses all-MiniLM-L6-v2' and is the default embedding function if none is specified.\n\n```python\nfrom pyseekdb import DefaultEmbeddingFunction\n\n# Use default model (all-MiniLM-L6-v2, 384 dimensions)\nef = DefaultEmbeddingFunction()\n\n# Use custom model\nef = DefaultEmbeddingFunction(model_name='all-MiniLM-L6-v2')\n\n# Get embedding dimension\nprint(f\"Dimension: {ef.dimension}\")  # 384\n\n# Generate embeddings\nembeddings = ef([\"Hello world\", \"How are you?\"])\nprint(f\"Generated {len(embeddings)} embeddings, each with {len(embeddings[0])} dimensions\")\n```\n\n### 6.2 Creating Custom Embedding Functions\n\nYou can create custom embedding functions by implementing the `EmbeddingFunction` protocol. The function must:\n\n1. Implement `__call__` method that accepts `Documents` (str or List[str]) and returns `Embeddings` (List[List[float]])\n2. Optionally implement a `dimension` property to return the vector dimension\n\n#### Example: Sentence-Transformer Custom Embedding Function\n\n```python\nfrom typing import List, Union\nfrom pyseekdb import EmbeddingFunction\n\nDocuments = Union[str, List[str]]\nEmbeddings = List[List[float]]\nEmbedding = List[float]\n\nclass SentenceTransformerCustomEmbeddingFunction(EmbeddingFunction[Documents]):\n    \"\"\"\n    A custom embedding function using sentence-transformers with a specific model.\n    \"\"\"\n\n    def __init__(self, model_name: str = \"all-MiniLM-L6-v2\", device: str = \"cpu\"):\n        \"\"\"\n        Initialize the sentence-transformer embedding function.\n\n        Args:\n            model_name: Name of the sentence-transformers model to use\n            device: Device to run the model on ('cpu' or 'cuda')\n        \"\"\"\n        self.model_name = model_name\n        self.device = device\n        self._model = None\n        self._dimension = None\n\n    def _ensure_model_loaded(self):\n        \"\"\"Lazy load the embedding model\"\"\"\n        if self._model is None:\n            try:\n                from sentence_transformers import SentenceTransformer\n                self._model = SentenceTransformer(self.model_name, device=self.device)\n                # Get dimension from model\n                test_embedding = self._model.encode([\"test\"], convert_to_numpy=True)\n                self._dimension = len(test_embedding[0])\n            except ImportError:\n                raise ImportError(\n                    \"sentence-transformers is not installed. \"\n                    \"Please install it with: pip install sentence-transformers\"\n                )\n\n    @property\n    def dimension(self) -\u003e int:\n        \"\"\"Get the dimension of embeddings produced by this function\"\"\"\n        self._ensure_model_loaded()\n        return self._dimension\n\n    def __call__(self, input: Documents) -\u003e Embeddings:\n        \"\"\"\n        Generate embeddings for the given documents.\n\n        Args:\n            input: Single document (str) or list of documents (List[str])\n\n        Returns:\n            List of embedding embeddings\n        \"\"\"\n        self._ensure_model_loaded()\n\n        # Handle single string input\n        if isinstance(input, str):\n            input = [input]\n\n        # Handle empty input\n        if not input:\n            return []\n\n        # Generate embeddings\n        embeddings = self._model.encode(\n            input,\n            convert_to_numpy=True,\n            show_progress_bar=False\n        )\n\n        # Convert numpy arrays to lists\n        return [embedding.tolist() for embedding in embeddings]\n\n# Use the custom embedding function\nfrom pyseekdb import Configuration, HNSWConfiguration\nef = SentenceTransformerCustomEmbeddingFunction(\n    model_name='all-MiniLM-L6-v2',\n    device='cpu'\n)\ncollection = client.create_collection(\n    name=\"my_collection\",\n    configuration=Configuration(\n        hnsw=HNSWConfiguration(dimension=384, distance='cosine')\n    ),\n    embedding_function=ef\n)\n```\n\n#### Example: OpenAI Embedding Function\n\n```python\nfrom typing import List, Union\nimport os\nimport openai\nfrom pyseekdb import EmbeddingFunction\n\nDocuments = Union[str, List[str]]\nEmbeddings = List[List[float]]\nEmbedding = List[float]\n\nclass OpenAIEmbeddingFunction(EmbeddingFunction[Documents]):\n    \"\"\"\n    A custom embedding function using OpenAI's embedding API.\n    \"\"\"\n\n    def __init__(self, model_name: str = \"text-embedding-ada-002\", api_key: str = None):\n        \"\"\"\n        Initialize the OpenAI embedding function.\n\n        Args:\n            model_name: Name of the OpenAI embedding model\n            api_key: OpenAI API key (if not provided, uses OPENAI_API_KEY env var)\n        \"\"\"\n        self.model_name = model_name\n        self.api_key = api_key or os.environ.get('OPENAI_API_KEY')\n        if not self.api_key:\n            raise ValueError(\"OpenAI API key is required\")\n\n        # Dimension for text-embedding-ada-002 is 1536\n        self._dimension = 1536 if \"ada-002\" in model_name else None\n\n    @property\n    def dimension(self) -\u003e int:\n        \"\"\"Get the dimension of embeddings produced by this function\"\"\"\n        if self._dimension is None:\n            # Call API to get dimension (or use known values)\n            raise ValueError(\"Dimension not set for this model\")\n        return self._dimension\n\n    def __call__(self, input: Documents) -\u003e Embeddings:\n        \"\"\"\n        Generate embeddings using OpenAI API.\n\n        Args:\n            input: Single document (str) or list of documents (List[str])\n\n        Returns:\n            List of embedding embeddings\n        \"\"\"\n        # Handle single string input\n        if isinstance(input, str):\n            input = [input]\n\n        # Handle empty input\n        if not input:\n            return []\n\n        # Call OpenAI API\n        response = openai.Embedding.create(\n            model=self.model_name,\n            input=input,\n            api_key=self.api_key\n        )\n\n        # Extract embeddings\n        embeddings = [item['embedding'] for item in response['data']]\n        return embeddings\n\n# Use the custom embedding function\nfrom pyseekdb import Configuration, HNSWConfiguration\nef = OpenAIEmbeddingFunction(\n    model_name='text-embedding-ada-002',\n    api_key='your-api-key'\n)\ncollection = client.create_collection(\n    name=\"my_collection\",\n    configuration=Configuration(\n        hnsw=HNSWConfiguration(dimension=1536, distance='cosine')\n    ),\n    embedding_function=ef\n)\n```\n\n### 6.3 Embedding Function Requirements\n\nWhen creating a custom embedding function, ensure:\n\n1. **Implement `__call__` method:**\n   - Accepts: `str` or `List[str]` (single document or list of documents)\n   - Returns: `List[List[float]]` (list of embeddings)\n   - Each vector must have the same dimension\n\n2. **Implement `dimension` property (recommended):**\n   - Returns: `int` (the dimension of embeddings produced by this function)\n   - This helps validate dimension consistency when creating collections\n\n3. **Handle edge cases:**\n   - Single string input should be converted to list\n   - Empty input should return empty list\n   - All embeddings in the output must have the same dimension\n\n### 6.4 Using Custom Embedding Functions\n\nOnce you've created a custom embedding function, use it when creating or getting collections:\n\n```python\nfrom pyseekdb import Configuration, HNSWConfiguration\n\n# Create collection with custom embedding function\nef = MyCustomEmbeddingFunction()\ncollection = client.create_collection(\n    name=\"my_collection\",\n    configuration=Configuration(\n        hnsw=HNSWConfiguration(dimension=ef.dimension, distance='cosine')\n    ),\n    embedding_function=ef\n)\n\n# Get collection with custom embedding function\ncollection = client.get_collection(\"my_collection\", embedding_function=ef)\n\n# Use the collection - documents will be automatically embedded\ncollection.add(\n    ids=[\"doc1\", \"doc2\"],\n    documents=[\"Document 1\", \"Document 2\"],  # Embeddings auto-generated\n    metadatas=[{\"tag\": \"A\"}, {\"tag\": \"B\"}]\n)\n\n# Query with texts - query embeddings auto-generated\nresults = collection.query(\n    query_texts=[\"my query\"],\n    n_results=10\n)\n```\n\n## RAG Demo\n\nWe provide a complete RAG (Retrieval-Augmented Generation) demo application that demonstrates how to build a hybrid search knowledge base using pyseekdb. The demo includes:\n\n- **Document Import**: Import Markdown files or directory into seekdb\n- **Vector Search**: Semantic search over imported documents\n- **RAG Interface**: Interactive Streamlit web interface for querying\n\nThe demo supports three embedding modes:\n\n- **`default`**: Uses pyseekdb's built-in `DefaultEmbeddingFunction` (ONNX-based, 384 dimensions). No API key required, automatically downloads models on first use.\n- **`local`**: Uses sentence-transformers models (e.g., all-mpnet-base-v2, 768 dimensions). Requires installing sentence-transformers library.\n- **`api`**: Uses OpenAI-compatible Embedding API services (e.g., DashScope, OpenAI). Requires API key configuration.\n\nFor detailed instructions, see [demo/rag/README.md](demo/rag/README.md).\n\n## Testing\n\n```bash\n# Run all tests (unit + integration)\npython3 -m pytest -v\n\n# Run tests with log output\npython3 -m pytest -v -s\n\n# Run unit tests only\npython3 -m pytest tests/unit_tests/ -v\n\n# Run integration tests only\npython3 -m pytest tests/integration_tests/ -v\n\n# Run integration tests for specific mode\npython3 -m pytest tests/integration_tests/ -v -k \"embedded\"   # embedded mode\npython3 -m pytest tests/integration_tests/ -v -k \"server\"     # server mode (requires seekdb server)\npython3 -m pytest tests/integration_tests/ -v -k \"oceanbase\"  # oceanbase mode (requires OceanBase)\n\n# Run specific test file\npython3 -m pytest tests/integration_tests/test_collection_query.py -v\n\n# Run specific test function\npython3 -m pytest tests/integration_tests/test_collection_query.py::TestCollectionQuery::test_collection_query -v\n```\n\n## License\n\nThis package is licensed under Apache 2.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foceanbase%2Fpyseekdb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foceanbase%2Fpyseekdb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foceanbase%2Fpyseekdb/lists"}