https://github.com/danjamk/vector-database-loader

Utility library for curating and loading vector databases
https://github.com/danjamk/vector-database-loader

Last synced: 4 months ago
JSON representation

Utility library for curating and loading vector databases

Host: GitHub
URL: https://github.com/danjamk/vector-database-loader
Owner: danjamk
License: mit
Created: 2025-01-29T16:27:13.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-04-18T13:12:57.000Z (about 1 year ago)
Last Synced: 2025-09-12T22:08:06.113Z (9 months ago)
Language: Python
Size: 137 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README-PYPI.md
- License: LICENSE

Awesome Lists containing this project

README

          # vector-database-loader

Loading content info a vector database is relatively easy to do, especially with frameworks like [LangChain](https://www.langchain.com/).

However, the process of curating the content and loading it into the database can be a bit more complex.  If you are building 

a RAG application or similar, the quality and relevance of the content is critical.  This project is meant to help with that process.

A use case for this type of project is discussed in more depth in the blog [A Cost-Effective AI Chatbot Architecture with AWS Bedrock, Lambda, and Pinecone](https://medium.com/@dan.jam.kuhn/a-cost-effective-ai-chatbot-architecture-with-aws-bedrock-lambda-and-pinecone-40935b9ec361)

## Features

- **Vector Database Support** - The framework is built to support multiple vector databases but is currently implementing support for [Pinecone](https://www.pinecone.io/).

More vector databases will be added, but if needed you can fork the project and handle your own needs by extending the base class.  

- **Embedding Support** - You can use any embedding provided by [Langchain](https://python.langchain.com/docs/integrations/text_embedding/), which includes OpenAI, AWS Bedrock, HuggingFace, Cohere and much, much more.

- **Content Curation** - The framework is built configure some common content types and sources, but again is meant to be extended a needed.

  - Sources include websites, local folders and google drive

  - Types include PDF, Word, and Web content and google docs

- **Text Splitter** - This framework uses the [RecursiveCharacterTextSplitter](https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/recursive_text_splitter/) from LangChain.  

This is a powerful tool that can split text into chunks of a specified size, while maintaining the context of the text.  This is especially useful for long documents like web pages or PDFs.

## Example

**Install the package with pip:**

Note: The dependencies are kind of beefy!

```pip install vector-database-loader```

**Configuration details:**

Add your Pinecone API and OpenAI keys to your .env file, see sample.env for an examples.

**Code**

```python

import time

from dotenv import load_dotenv, find_dotenv

from langchain_openai import OpenAIEmbeddings

from vector_database_loader.pinecone_vector_db import PineconeVectorLoader, PineconeVectorQuery

# Define your content sources and add them to the array

web_page_content_source = {"name": "SpaceX", "type": "Website", "items": [

    "https://en.wikipedia.org/wiki/SpaceX"

], "chunk_size": 512}

content_sources = [web_page_content_source]

# Load into your vector database.  Be sure to add your Pinecone and OpenAI API keys to your .env file

load_dotenv(find_dotenv())

embedding_client = OpenAIEmbeddings()

index_name = "my-vectordb-index"

vector_db_loader = PineconeVectorLoader(index_name=index_name,

                                        embedding_client=embedding_client)

vector_db_loader.load_sources(content_sources, delete_index=True)

# Query your vector database

print("Waiting 30 seconds before running the query, to make sure the data is available")

time.sleep(30)  # This is needed because there is a latency in the data being available

vector_db_query = PineconeVectorQuery(index_name=index_name,

                                      embedding_client=embedding_client)

query = "What is SpaceX's most recent rocket model being tested?"

documents = vector_db_query.query(query)

print(f"Query: {query} returned {len(documents)} results")

for doc in documents:

    print(f"   {doc.metadata['title']}")

    print(f"   {doc.page_content}")

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/danjamk/vector-database-loader

Awesome Lists containing this project

README