https://github.com/pingcap/tidb-vector-python
TiDB Vector SDK for Python, including code examples. Join our Discord: https://discord.gg/XzSW23Jg9p
https://github.com/pingcap/tidb-vector-python
python tidb-vector tidbcloud
Last synced: 8 months ago
JSON representation
TiDB Vector SDK for Python, including code examples. Join our Discord: https://discord.gg/XzSW23Jg9p
- Host: GitHub
- URL: https://github.com/pingcap/tidb-vector-python
- Owner: pingcap
- License: apache-2.0
- Created: 2024-01-12T05:54:21.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-11-15T11:21:09.000Z (about 1 year ago)
- Last Synced: 2025-05-05T19:59:34.389Z (8 months ago)
- Topics: python, tidb-vector, tidbcloud
- Language: Python
- Homepage: https://tidb.cloud/ai
- Size: 585 KB
- Stars: 56
- Watchers: 6
- Forks: 16
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# tidb-vector-python
Use TiDB Vector Search with Python.
## Usage
TiDB is a SQL database so that this package introduces Vector Search capability for Python ORMs:
- [#SQLAlchemy](#sqlalchemy)
- [#Peewee](#peewee)
- [#Django](#django)
Pick one that you are familiar with to get started. If you are not using any of them, we recommend [#SQLAlchemy](#sqlalchemy).
We also provide a Vector Search client for simple usage:
- [#TiDB Vector Client](#tidb-vector-client)
### SQLAlchemy
Install:
```bash
pip install tidb-vector sqlalchemy pymysql
```
Usage:
```python
from sqlalchemy import Integer, Column
from sqlalchemy import create_engine, select
from sqlalchemy.dialects.mysql import LONGTEXT
from sqlalchemy.orm import Session, declarative_base
import tidb_vector
from tidb_vector.sqlalchemy import VectorType, VectorAdaptor
engine = create_engine("mysql+pymysql://root@127.0.0.1:4000/test")
Base = declarative_base()
# Define table schema
class Doc(Base):
__tablename__ = "doc"
id = Column(Integer, primary_key=True)
embedding = Column(VectorType(dim=3))
content = Column(LONGTEXT)
# Create empty table
Base.metadata.drop_all(engine) # clean data from last run
Base.metadata.create_all(engine)
# Create index for L2 distance
VectorAdaptor(engine).create_vector_index(
Doc.embedding, tidb_vector.DistanceMetric.L2, skip_existing=True
# For cosine distance, use tidb_vector.DistanceMetric.COSINE
)
# Insert content with vectors
with Session(engine) as session:
session.add(Doc(id=1, content="dog", embedding=[1, 2, 1]))
session.add(Doc(id=2, content="fish", embedding=[1, 2, 4]))
session.add(Doc(id=3, content="tree", embedding=[1, 0, 0]))
session.commit()
# Perform Vector Search for Top K=1
with Session(engine) as session:
results = session.execute(
select(Doc.id, Doc.content)
.order_by(Doc.embedding.l2_distance([1, 2, 3]))
# For cosine distance, use Doc.embedding.cosine_distance(...)
.limit(1)
).all()
print(results)
# Perform filtered Vector Search by adding a Where Clause:
with Session(engine) as session:
results = session.execute(
select(Doc.id, Doc.content)
.where(Doc.content == "dog")
.order_by(Doc.embedding.l2_distance([1, 2, 3]))
.limit(1)
).all()
print(results)
```
### Peewee
Install:
```bash
pip install tidb-vector peewee pymysql
```
Usage:
```python
import tidb_vector
from peewee import Model, MySQLDatabase, IntegerField, TextField
from tidb_vector.peewee import VectorField, VectorAdaptor
db = MySQLDatabase(
database="test",
user="root",
password="",
host="127.0.0.1",
port=4000,
)
# Define table schema
class Doc(Model):
class Meta:
database = db
table_name = "peewee_test"
id = IntegerField(primary_key=True)
embedding = VectorField(3)
content = TextField()
# Create empty table and index for L2 distance
db.drop_tables([Doc]) # clean data from last run
db.create_tables([Doc])
# For cosine distance, use tidb_vector.DistanceMetric.COSINE
VectorAdaptor(db).create_vector_index(Doc.embedding, tidb_vector.DistanceMetric.L2)
# Insert content with vectors
Doc.insert_many(
[
{"id": 1, "content": "dog", "embedding": [1, 2, 1]},
{"id": 2, "content": "fish", "embedding": [1, 2, 4]},
{"id": 3, "content": "tree", "embedding": [1, 0, 0]},
]
).execute()
# Perform Vector Search for Top K=1
cursor = (
Doc.select(Doc.id, Doc.content)
# For cosine distance, use Doc.embedding.cosine_distance(...)
.order_by(Doc.embedding.l2_distance([1, 2, 3]))
.limit(1)
)
for row in cursor:
print(row.id, row.content)
# Perform filtered Vector Search by adding a Where Clause:
cursor = (
Doc.select(Doc.id, Doc.content)
.where(Doc.content == "dog")
.order_by(Doc.embedding.l2_distance([1, 2, 3]))
.limit(1)
)
for row in cursor:
print(row.id, row.content)
```
### Django
> [!TIP]
>
> Django is a full-featured web framework, not just an ORM. The following usage introducutions are provided for existing Django users.
>
> For new users to get started, consider using SQLAlchemy or Peewee.
Install:
```bash
pip install 'django-tidb[vector]~=5.0.0' 'django~=5.0.0' mysqlclient
```
Usage:
1\. Configure `django_tidb` as engine, like:
```python
DATABASES = {
'default': {
'ENGINE': 'django_tidb',
'NAME': 'django',
'USER': 'root',
'PASSWORD': '',
'HOST': '127.0.0.1',
'PORT': 4000,
},
}
```
2\. Define a model with a vector field and vector index:
```python
from django.db import models
from django_tidb.fields.vector import VectorField, VectorIndex, L2Distance
class Doc(models.Model):
id = models.IntegerField(primary_key=True)
embedding = VectorField(dimensions=3)
content = models.TextField()
class Meta:
indexes = [VectorIndex(L2Distance("embedding"), name="idx")]
```
3\. Insert data:
```python
Doc.objects.create(id=1, content="dog", embedding=[1, 2, 1])
Doc.objects.create(id=2, content="fish", embedding=[1, 2, 4])
Doc.objects.create(id=3, content="tree", embedding=[1, 0, 0])
```
4\. Perform Vector Search for Top K=1:
```python
queryset = (
Doc.objects
.order_by(L2Distance("embedding", [1, 2, 3]))
.values("id", "content")[:1]
)
print(queryset)
```
5\. Perform filtered Vector Search by adding a Where Clause:
```python
queryset = (
Doc.objects
.filter(content="dog")
.order_by(L2Distance("embedding", [1, 2, 3]))
.values("id", "content")[:1]
)
print(queryset)
```
For more details, see [django-tidb](https://github.com/pingcap/django-tidb?tab=readme-ov-file#vector-beta).
### TiDB Vector Client
Within the framework, you can directly utilize the built-in `TiDBVectorClient`, as demonstrated by integrations like [Langchain](https://python.langchain.com/docs/integrations/vectorstores/tidb_vector) and [Llama index](https://docs.llamaindex.ai/en/stable/community/integrations/vector_stores.html#using-a-vector-store-as-an-index), to seamlessly interact with TiDB Vector. This approach abstracts away the need to manage the underlying ORM, simplifying your interaction with the vector store.
We provide `TiDBVectorClient` which is based on sqlalchemy, you need to use `pip install tidb-vector[client]` to install it.
Create a `TiDBVectorClient` instance:
```python
from tidb_vector.integrations import TiDBVectorClient
TABLE_NAME = 'vector_test'
CONNECTION_STRING = 'mysql+pymysql://:@:4000/?ssl_verify_cert=true&ssl_verify_identity=true'
tidb_vs = TiDBVectorClient(
# the table which will store the vector data
table_name=TABLE_NAME,
# tidb connection string
connection_string=CONNECTION_STRING,
# the dimension of the vector, in this example, we use the ada model, which has 1536 dimensions
vector_dimension=1536,
# if recreate the table if it already exists
drop_existing_table=True,
)
```
Bulk insert:
```python
ids = [
"f8e7dee2-63b6-42f1-8b60-2d46710c1971",
"8dde1fbc-2522-4ca2-aedf-5dcb2966d1c6",
"e4991349-d00b-485c-a481-f61695f2b5ae",
]
documents = ["foo", "bar", "baz"]
embeddings = [
text_to_embedding("foo"),
text_to_embedding("bar"),
text_to_embedding("baz"),
]
metadatas = [
{"page": 1, "category": "P1"},
{"page": 2, "category": "P1"},
{"page": 3, "category": "P2"},
]
tidb_vs.insert(
ids=ids,
texts=documents,
embeddings=embeddings,
metadatas=metadatas,
)
```
Query:
```python
tidb_vs.query(text_to_embedding("foo"), k=3)
# query with filter
tidb_vs.query(text_to_embedding("foo"), k=3, filter={"category": "P1"})
```
Bulk delete:
```python
tidb_vs.delete(["f8e7dee2-63b6-42f1-8b60-2d46710c1971"])
# delete with filter
tidb_vs.delete(["f8e7dee2-63b6-42f1-8b60-2d46710c1971"], filter={"category": "P1"})
```
## Examples
There are some examples to show how to use the tidb-vector-python to interact with TiDB Vector in different scenarios.
- [OpenAI Embedding](./examples/openai_embedding/README.md): use the OpenAI embedding model to generate vectors for text data, store them in TiDB Vector, and search for similar text.
- [Image Search](./examples/image_search/README.md): use the OpenAI CLIP model to generate vectors for image and text, store them in TiDB Vector, and search for similar images.
- [LlamaIndex RAG with UI](./examples/llamaindex-tidb-vector-with-ui/README.md): use the LlamaIndex to build an [RAG(Retrieval-Augmented Generation)](https://docs.llamaindex.ai/en/latest/getting_started/concepts/) application.
- [Chat with URL](./llamaindex-tidb-vector/README.md): use LlamaIndex to build an [RAG(Retrieval-Augmented Generation)](https://docs.llamaindex.ai/en/latest/getting_started/concepts/) application that can chat with a URL.
- [GraphRAG](./examples/graphrag-demo/README.md): 20 lines code of using TiDB Serverless to build a Knowledge Graph based RAG application.
- [GraphRAG Step by Step Tutorial](./examples/graphrag-step-by-step-tutorial/README.md): Step by step tutorial to build a Knowledge Graph based RAG application with Colab notebook. In this tutorial, you will learn how to extract knowledge from a text corpus, build a Knowledge Graph, store the Knowledge Graph in TiDB Serverless, and search from the Knowledge Graph.
- [Vector Search Notebook with SQLAlchemy](https://colab.research.google.com/drive/1LuJn4mtKsjr3lHbzMa2RM-oroUvpy83y?usp=sharing): use [SQLAlchemy](https://www.sqlalchemy.org/) to interact with TiDB Serverless: connect db, index&store data and then search vectors.
- [Build RAG with Jina AI Embeddings](./examples/jina-ai-embeddings-demo/README.md): use Jina AI to generate embeddings for text data, store the embeddings in TiDB Vector Storage, and search for similar embeddings.
- [Semantic Cache](./examples/semantic-cache/README.md): build a semantic cache with Jina AI and TiDB Vector.
for more examples, see the [examples](./examples) directory.
## Contributing
Please feel free to reach out to the maintainers if you have any questions or need help with the project. Before contributing, please read the [CONTRIBUTING.md](./CONTRIBUTING.md) file.