https://github.com/apify/actor-vector-database-integrations
Transfer data from Apify Actors to vector databases (Chroma, Milvus, Pinecone, PostgreSQL (PG-Vector), Qdrant, and Weaviate)
https://github.com/apify/actor-vector-database-integrations
Last synced: 8 months ago
JSON representation
Transfer data from Apify Actors to vector databases (Chroma, Milvus, Pinecone, PostgreSQL (PG-Vector), Qdrant, and Weaviate)
- Host: GitHub
- URL: https://github.com/apify/actor-vector-database-integrations
- Owner: apify
- License: apache-2.0
- Created: 2024-05-09T07:06:29.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2025-04-07T09:51:26.000Z (about 1 year ago)
- Last Synced: 2025-04-11T22:11:25.759Z (about 1 year ago)
- Language: Python
- Homepage:
- Size: 105 MB
- Stars: 7
- Watchers: 9
- Forks: 6
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Apify Vector Database Integrations
#### Vector database integrations (Actors)
| Actor | Actor badge |
|-----------------------------|---------------------|
| [Chroma](https://apify.com/apify/chroma-integration) | [](https://apify.com/apify/chroma-integration) |
| [Milvus](https://apify.com/apify/milvus-integration) | [](https://apify.com/apify/milvus-integration) |
| [OpenSearch](https://apify.com/apify/opensearch-integration) | [](https://apify.com/apify/opensearch-integration) |
| [PGVector](https://apify.com/apify/pgvector-integration) | [](https://apify.com/apify/pgvector-integration) |
| [Pinecone](https://apify.com/apify/pinecone-integration) | [](https://apify.com/apify/pinecone-integration) |
| [Qdrant](https://apify.com/apify/qdrant-integration) | [](https://apify.com/apify/adrant-integration) |
| [Weaviate](https://apify.com/apify/weaviate-integration) | [](https://apify.com/apify/weaviate-integration) |
The Apify Vector Database Integrations facilitate the transfer of data from Apify Actors to a vector database.
This process includes data processing, optional splitting into chunks, embedding computation, and data storage
These integrations support incremental updates, ensuring that only changed data is updated.
This reduces unnecessary embedding computation and storage operations, making it ideal for search and retrieval augmented generation (RAG) use cases.
This repository contains Actors for different vector databases.
## How does it work?
1. Retrieve a dataset as output from an Actor.
2. _[Optional]_ Split text data into chunks using [langchain](https://python.langchain.com).
3. _[Optional]_ Update only changed data.
4. Compute embeddings, e.g. using [OpenAI](https://platform.openai.com/docs/guides/embeddings) or [Cohere](https://cohere.com/embeddings).
5. Save data into the database.
## Supported Vector Embeddings
- [OpenAI](https://platform.openai.com/docs/guides/embeddings)
- [Cohere](https://cohere.com/embeddings)
## How to add a new integration (an example for PG-Vector)?
1. Add database to [docker-compose.yml](docker-compose.yaml) for local testing (if the database is available in docker).
```
version: '3.8'
services:
pgvector-container:
image: pgvector/pgvector:pg16
environment:
- POSTGRES_PASSWORD=password
- POSTGRES_DB=apify
ports:
- "5432:5432"
```
1. Add postgres dependency to `pyproject.toml`:
```bash
poetry add --group=pgvector "langchain_postgres"
```
and mark the group pgvector as optional (in `pyproject.toml`):
```toml
[tool.poetry.group.postgres]
optional = true
```
1. Create a new Actor in the `actors` directory, e.g. `actors/pgvector` and add the following files:
- `README.md` - the Actor documentation
- `.actor/actor.json` - the Actor definition
- `.actor/input_schema.json` - the Actor input schema
-
1. Create a pydantic model for the Actor input schema. Edit Makefile to generate the input schema from the model:
```bash
datamodel-codegen --input $(DIRS_WITH_ACTORS)/pgvector/.actor/input_schema.json --output $(DIRS_WITH_CODE)/src/models/pgvector_input_model.py --input-file-type jsonschema --field-constraints
```
and then run
```bash
make pydantic-model
```
1. Import the created model in `src/models/__init__.py`:
```python
from .pgvector_input_model import PgvectorIntegration
``
1. Create a new module (`pgvector.py`) in the `vector_stores` directory, e.g. `vector_stores/pgvector` and implement all class `PGVectorDatabase` and all required methods.
1. Add PGVector into `SupportedVectorStores` in the `constants.py`
```python
class SupportedVectorStores(str, enum.Enum):
pgvector = "pgvector"
```
1. Add PGVectorDatabase into `entrypoint.py`
```python
if actor_type == SupportedVectorStores.pgvector.value:
await run_actor(PgvectorIntegration(**actor_input), actor_input)
```
1. Add `PGVectorDatabase` and `PgvectorIntegration` into `_types.py`
```python
ActorInputsDb: TypeAlias = ChromaIntegration | PgvectorIntegration | PineconeIntegration | QdrantIntegration
VectorDb: TypeAlias = ChromaDatabase | PGVectorDatabase | PineconeDatabase | QdrantDatabase
```
1. Add `PGVectorDatabase` into `vector_stores/vcs.py`
```python
if isinstance(actor_input, PgvectorIntegration):
from .vector_stores.pgvector import PGVectorDatabase
return PGVectorDatabase(actor_input, embeddings)
```
1. Add `PGVectorDatabase` fixture into `tests/conftets.py`
```python
@pytest.fixture()
def db_pgvector(crawl_1: list[Document]) -> PGVectorDatabase:
db = PGVectorDatabase(
actor_input=PgvectorIntegration(
postgresSqlConnectionStr=os.getenv("POSTGRESQL_CONNECTION_STR"),
postgresCollectionName=INDEX_NAME,
embeddingsProvider="OpenAI",
embeddingsApiKey=os.getenv("OPENAI_API_KEY"),
datasetFields=["text"],
),
embeddings=embeddings,
)
db.unit_test_wait_for_index = 0
db.delete_all()
# Insert initially crawled objects
db.add_documents(documents=crawl_1, ids=[d.metadata["id"] for d in crawl_1])
yield db
db.delete_all()
```
1. Add the `db_pgvector` fixture into `tests/test_vector_stores.py`
```python
DATABASE_FIXTURES = ["db_pinecone", "db_chroma", "db_qdrant", "db_pgvector"]
```
1. Update README.md in the `actors/pgvector` directory
1. Add the `pgvector` to the README.md in the root directory
1. Run tests
```bash
make test
```
1. Run the Actor locally
```bash
export ACTOR_PATH_IN_DOCKER_CONTEXT=actors/pgvector
apify run -p
````
1. Setup Actor on Apify platform at `https://console.apify.com`
Build configuration
```
Git URL: https://github.com/apify/store-vector-db
Branch: master
Folder: actors/pgvector
```
1. Test the Actor on the Apify platform