https://github.com/AI-Northstar-Tech/vector-io

Comprehensive Vector Data Tooling. The universal interface for all vector database, datasets and RAG platforms. Easily export, import, backup, re-embed (using any model) or access your vector data from any vector databases or repository.
https://github.com/AI-Northstar-Tech/vector-io

chromadb data-backup data-exploration-and-preprocessing data-export data-import datastax huggingface huggingface-datasets kdb lancedb milvus parquet pinecone qdrant turbopuffer vector-database vector-search-engine visualization zilliz

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/AI-Northstar-Tech/vector-io
Owner: AI-Northstar-Tech
License: apache-2.0
Created: 2023-05-04T06:31:11.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-02-24T18:32:02.000Z (5 months ago)
Last Synced: 2025-03-02T01:11:29.331Z (4 months ago)
Topics: chromadb, data-backup, data-exploration-and-preprocessing, data-export, data-import, datastax, huggingface, huggingface-datasets, kdb, lancedb, milvus, parquet, pinecone, qdrant, turbopuffer, vector-database, vector-search-engine, visualization, zilliz
Language: Jupyter Notebook
Homepage: https://vector-io.com
Size: 4.4 MB
Stars: 226
Watchers: 6
Forks: 27
Open Issues: 24
Metadata Files:
- Readme: README.html
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

awesome-vector-database - vector-io

README

        
Vector IO



    

    

    





    

    



This library uses a universal format for vector datasets to easily

    export and import data from all vector databases.

Request support for a VectorDB by voting/commenting on this

        poll

See the Contributing section to add

    support for your favorite vector database.

Supported Vector Databases


    

        Fully Supported

    

    


        

        

    

    

        

            

                Vector Database

                Import

                Export

            

        

        

            

                Pinecone

                ✅

                ✅

            

            

                Qdrant

                ✅

                ✅

            

            

                Milvus

                ✅

                ✅

            

            

                GCP Vertex AI Vector Search

                ✅

                ✅

            

            

                KDB.AI

                ✅

                ✅

            

            

                LanceDB

                ✅

                ✅

            

            

                DataStax Astra DB

                ✅

                ✅

            

            

                Chroma

                ✅

                ✅

            

            

                Turbopuffer

                ✅

                ✅

            

        

    



    

        Partial

    

    

        

            

                Vector Database

                Import

                Export

            

        

        

        

    



    

        In Progress

    

    

        

            

                Vector Database

                Import

                Export

            

        

        

            

                pgvector

                ❌

                ❌

            

            

                Azure AI Search

                ❌

                ❌

            

            

                Weaviate

                ❌

                ❌

            

            

                MongoDB Atlas

                ❌

                ❌

            

            

                Apache Cassandra

                ❌

                ❌

            

            

                txtai

                ❌

                ❌

            

            

                SQLite-VSS

                ❌

                ❌

            

        

    



    

        Not Supported

    

    

        

            

                Vector Database

                Import

                Export

            

        

        

            

                Vespa

                ❌

                ❌

            

            

                AWS Neptune

                ❌

                ❌

            

            

                Neo4j

                ❌

                ❌

            

            

                Marqo

                ❌

                ❌

            

            

                OpenSearch

                ❌

                ❌

            

            

                Elasticsearch

                ❌

                ❌

            

            

                Apache Solr

                ❌

                ❌

            

            

                Redis Search

                ❌

                ❌

            

            

                ClickHouse

                ❌

                ❌

            

            

                USearch

                ❌

                ❌

            

            

                Rockset

                ❌

                ❌

            

            

                Epsilla

                ❌

                ❌

            

            

                Activeloop Deep Lake

                ❌

                ❌

            

            

                ApertureDB

                ❌

                ❌

            

            

                CrateDB

                ❌

                ❌

            

            

                Meilisearch

                ❌

                ❌

            

            

                MyScale

                ❌

                ❌

            

            

                Nuclia DB

                ❌

                ❌

            

            

                OramaSearch

                ❌

                ❌

            

            

                Typesense

                ❌

                ❌

            

            

                Anari AI

                ❌

                ❌

            

            

                Vald

                ❌

                ❌

            

        

    

Installation

Using pip



    pip install vdf-io



From source



    git clone https://github.com/AI-Northstar-Tech/vector-io.git

cd vector-io

pip install -r requirements.txt



Universal

    Vector Dataset Format (VDF) specification



    VDF_META.json: It is a json file with the following schema VDFMeta

        defined in src/vdf_io/meta_types.py:





    class NamespaceMeta(BaseModel):

    namespace: str

    index_name: str

    total_vector_count: int

    exported_vector_count: int

    dimensions: int

    model_name: str | None = None

    vector_columns: List[str] = ["vector"]

    data_path: str

    metric: str | None = None

    index_config: Optional[Dict[Any, Any]] = None

    schema_dict: Optional[Dict[str, Any]] = None





class VDFMeta(BaseModel):

    version: str

    file_structure: List[str]

    author: str

    exported_from: str

    indexes: Dict[str, List[NamespaceMeta]]

    exported_at: str

    id_column: Optional[str] = None





    Parquet files/folders for metadata and vectors.



Export Script



    export_vdf --help

usage: export_vdf [-h] [-m MODEL_NAME]

                  [--max_file_size MAX_FILE_SIZE]

                  [--push_to_hub | --no-push_to_hub]

                  [--public | --no-public]

                  {pinecone,qdrant,kdbai,milvus,vertexai_vectorsearch}

                  ...



Export data from various vector databases to the VDF format for vector datasets



options:

  -h, --help            show this help message and exit

  -m MODEL_NAME, --model_name MODEL_NAME

                        Name of model used

  --max_file_size MAX_FILE_SIZE

                        Maximum file size in MB (default:

                        1024)

  --push_to_hub, --no-push_to_hub

                        Push to hub

  --public, --no-public

                        Make dataset public (default:

                        False)



Vector Databases:

  Choose the vectors database to export data from



  {pinecone,qdrant,kdbai,milvus,vertexai_vectorsearch}

    pinecone            Export data from Pinecone

    qdrant              Export data from Qdrant

    kdbai               Export data from KDB.AI

    milvus              Export data from Milvus

    vertexai_vectorsearch

                        Export data from Vertex AI Vector

                        Search



Import script



    import_vdf --help

usage: import_vdf [-h] [-d DIR] [-s | --subset | --no-subset]

                  [--create_new | --no-create_new]

                  {milvus,pinecone,qdrant,vertexai_vectorsearch,kdbai}

                  ...



Import data from VDF to a vector database



options:

  -h, --help            show this help message and exit

  -d DIR, --dir DIR     Directory to import

  -s, --subset, --no-subset

                        Import a subset of data (default: False)

  --create_new, --no-create_new

                        Create a new index (default: False)



Vector Databases:

  Choose the vectors database to export data from



  {milvus,pinecone,qdrant,vertexai_vectorsearch,kdbai}

    milvus              Import data to Milvus

    pinecone            Import data to Pinecone

    qdrant              Import data to Qdrant

    vertexai_vectorsearch

                        Import data to Vertex AI Vector Search

    kdbai               Import data to KDB.AI



Re-embed script

This Python script is used to re-embed a vector dataset. It takes a

    directory of vector dataset in the VDF format and re-embeds it using a

    new model. The script also allows you to specify the name of the column

    containing text to be embedded.



    reembed_vdf --help

usage: reembed_vdf [-h] -d DIR [-m NEW_MODEL_NAME]

                  [-t TEXT_COLUMN]



Reembed a vector dataset



options:

  -h, --help            show this help message and exit

  -d DIR, --dir DIR     Directory of vector dataset in

                        the VDF format

  -m NEW_MODEL_NAME, --new_model_name NEW_MODEL_NAME

                        Name of new model to be used

  -t TEXT_COLUMN, --text_column TEXT_COLUMN

                        Name of the column containing

                        text to be embedded



Examples



    export_vdf -m hkunlp/instructor-xl --push_to_hub pinecone --environment gcp-starter



import_vdf -d /path/to/vdf/dataset milvus



reembed_vdf -d /path/to/vdf/dataset -m sentence-transformers/all-MiniLM-L6-v2 -t title



Follow the prompt to select the index and id range to export.

Contributing

Adding a new vector database

If you wish to add an import/export implementation for a new vector

    database, you must also implement the other side of the import/export

    for the same database. Please fork the repo and send a PR for both the

    import and export scripts.

Steps to add a new vector database (ABC):



    Add your database name in src/vdf_io/names.py in the DBNames enum

        class.

    Create new files src/vdf_io/export_vdf/export_abc.py

        and src/vdf_io/import_vdf/import_abc.py for the new

        DB.



Export:



    In your export file, define a class ExportABC which inherits from

        ExportVDF.

    Specify a DB_NAME_SLUG for the class

    The class should implement:

        

            make_parser() function to add database specific arguments to the

                export_vdf CLI

            export_vdb() function to prompt user for info not provided in the

                CLI. It should then call the get_data() function.

            get_data() function to download points (in a batched manner) with

                all the metadata from the specified index of the vector database. This

                data should be stored in a series of parquet files/folders. The metadata

                should be stored in a json file with the schema

                    above.

        

    

    Use the script to export data from an example index of the vector

        database and verify that the data is exported correctly.



Import:



    In your import file, define a class ImportABC which inherits from

        ImportVDF.

    Specify a DB_NAME_SLUG for the class

    The class should implement:

        

            make_parser() function to add database specific arguments to the

                import_vdf CLI, such as the url of the database, any authentication

                tokens, etc.

            import_vdb() function to prompt user for info not provided in the

                CLI. It should then call the upsert_data() function.

            upsert_data() function to upload points from a vdf dataset (in a

                batched manner) with all the metadata to the specified index of the

                vector database. All metadata about the dataset should be read from the

                VDF_META.json file in the vdf folder.

        

    

    Use the script to import data from the example vdf dataset exported

        in the previous step and verify that the data is imported

        correctly.



Changing the VDF

    specification

If you wish to change the VDF specification, please open an issue to

    discuss the change before sending a PR.

Efficiency improvements

If you wish to improve the efficiency of the import/export scripts,

    please fork the repo and send a PR.

Telemetry

Running the scripts in the repo will send anonymous usage data to AI

    Northstar Tech to help improve the library.

You can opt out this by setting the environment variable

    DISABLE_TELEMETRY_VECTORIO to 1.



Questions

If you have any questions, please open an issue on the repo or

    message Dhruv Anand on LinkedIn

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/AI-Northstar-Tech/vector-io

Awesome Lists containing this project

README

Vector IO

Supported Vector Databases

Installation

Using pip

From source

Universal
Vector Dataset Format (VDF) specification

Export Script

Import script

Re-embed script

Examples

Contributing

Adding a new vector database

Changing the VDF
specification

Efficiency improvements

Telemetry

Questions

https://github.com/AI-Northstar-Tech/vector-io

Awesome Lists containing this project

README

Vector IO

Supported Vector Databases

Installation

Using pip

From source

Universal Vector Dataset Format (VDF) specification

Export Script

Import script

Re-embed script

Examples

Contributing

Adding a new vector database

Changing the VDF specification

Efficiency improvements

Telemetry

Questions

Universal
Vector Dataset Format (VDF) specification

Changing the VDF
specification