https://github.com/cocoindex-io/cocoindex
ETL framework to turn your data AI-ready - with realtime incremental updates and support custom logic like lego.
https://github.com/cocoindex-io/cocoindex
ai change-data-capture data data-engineering data-indexing data-infrastructure data-processing dataflow etl help-wanted indexing knowledge-graph llm pipeline python rag real-time rust semantic-search streaming
Last synced: 6 months ago
JSON representation
ETL framework to turn your data AI-ready - with realtime incremental updates and support custom logic like lego.
- Host: GitHub
- URL: https://github.com/cocoindex-io/cocoindex
- Owner: cocoindex-io
- License: apache-2.0
- Created: 2025-03-03T23:03:09.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-05-11T01:21:47.000Z (6 months ago)
- Last Synced: 2025-05-11T01:24:35.162Z (6 months ago)
- Topics: ai, change-data-capture, data, data-engineering, data-indexing, data-infrastructure, data-processing, dataflow, etl, help-wanted, indexing, knowledge-graph, llm, pipeline, python, rag, real-time, rust, semantic-search, streaming
- Language: Rust
- Homepage: https://cocoindex.io
- Size: 6.88 MB
- Stars: 1,088
- Watchers: 5
- Forks: 66
- Open Issues: 44
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- fucking-awesome-rust - cocoindex - ETL framework to build fresh index (Libraries / Data processing)
- awesome-rust - cocoindex - ETL framework to build fresh index (Libraries / Data processing)
- trackawesomelist - CocoIndex (⭐199) - ETL framework to build fresh index for AI, with realtime incremental updates. (Recently Updated / [Mar 15, 2025](/content/2025/03/15/README.md))
- StarryDivineSky - cocoindex-io/cocoindex - io/cocoindex 是一个为人工智能设计的实时数据转换框架。它具有超高性能,并支持增量处理,这意味着它只处理数据的变化部分,而不是每次都处理整个数据集,从而提高效率。该框架主要用于快速转换和处理AI模型所需的数据。具体实现细节和使用方法请参考项目文档。该项目旨在提供一种高效、实时的数据处理解决方案,以满足AI应用对数据处理速度和效率的需求。专为AI打造的超高性能数据转换框架,核心引擎采用Rust编写。开箱即支持增量处理与数据血缘追踪。提供卓越的开发效率,从第0天起即具备生产就绪能力。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
- awesome-data-engineering - CocoIndex - An open source ETL framework to build fresh index for AI. (Stream Processing)
- awesome-streaming - CocoIndex - ETL framework to build fresh index for AI, with realtime incremental updates. (Table of Contents / Streaming Engine)
- awesome-github-repos - cocoindex-io/cocoindex - Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it! (Rust)
- awesome - cocoindex-io/cocoindex - Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it! (Rust)
- Awesome-RAG - CocoIndex
- awesome-ccamel - cocoindex-io/cocoindex - Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it! (Rust)
- best-of-python - GitHub - 32% open · ⏱️ 23.10.2025): (Data Pipelines & Streaming)
README
Extract, Transform, Index Data. Easy and Fresh. 🌴
[](https://github.com/cocoindex-io/cocoindex)
[](https://opensource.org/licenses/Apache-2.0)
[](https://pypi.org/project/cocoindex/)
[](https://www.python.org/)
[](https://github.com/cocoindex-io/cocoindex/actions/workflows/CI.yml)
[](https://github.com/cocoindex-io/cocoindex/actions/workflows/release.yml)
[](https://discord.com/invite/zpA9S2DR7s)
[](https://www.linkedin.com/company/cocoindex)
[](https://twitter.com/intent/follow?screen_name=cocoindex_io)
CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental updates specialized for data indexing.
With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.
## Quick Start:
If you're new to CocoIndex 🤗, we recommend checking out the 📖 [Documentation](https://cocoindex.io/docs) and ⚡ [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart). We also have a ▶️ [quick start video tutorial](https://youtu.be/gv5R8nOXsWU?si=9ioeKYkMEnYevTXT) for you to jump start.
### Setup
1. Install CocoIndex Python library
```bash
pip install -U cocoindex
```
2. Setup Postgres with pgvector extension; or bring up a Postgres database using docker compose:
- Make sure Docker Compose is installed: [docs](https://docs.docker.com/compose/install/)
- Start a Postgres SQL database for cocoindex using our docker compose config:
```bash
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d
```
### Start your first indexing flow!
Follow [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart) to define your first indexing flow.
A common indexing flow looks like:
```python
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
# Add a data source to read files from a directory
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
# Add a collector for data to be exported to the vector index
doc_embeddings = data_scope.add_collector()
# Transform data of each document
with data_scope["documents"].row() as doc:
# Split the document into chunks, put into `chunks` field
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=2000, chunk_overlap=500)
# Transform data of each chunk
with doc["chunks"].row() as chunk:
# Embed the chunk, put into `embedding` field
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))
# Collect the chunk into the collector.
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])
# Export collected data to a vector index.
doc_embeddings.export(
"doc_embeddings",
cocoindex.storages.Postgres(),
primary_key_fields=["filename", "location"],
vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
```
It defines an index flow like this:

### Play with existing example and demo
Go to the [examples directory](examples) to try out with any of the examples, following instructions under specific example directory.
| Example | Description |
|---------|-------------|
| [Text Embedding](examples/text_embedding) | Index text documents with embeddings for semantic search |
| [Code Embedding](examples/code_embedding) | Index code embeddings for semantic search |
| [PDF Embedding](examples/pdf_embedding) | Parse PDF and index text embeddings for semantic search |
| [Manuals LLM Extraction](examples/manuals_llm_extraction) | Extract structured information from a manual using LLM |
| [Google Drive Text Embedding](examples/gdrive_text_embedding) | Index text documents from Google Drive |
| [Docs to Knowledge Graph](examples/docs_to_kg) | Extract relationships from Markdown documents and build a knowledge graph |
More coming and stay tuned! If there's any specific examples you would like to see, please let us know in our [Discord community](https://discord.com/invite/zpA9S2DR7s) 🌱.
## 📖 Documentation
For detailed documentation, visit [CocoIndex Documentation](https://cocoindex.io/docs), including a [Quickstart guide](https://cocoindex.io/docs/getting_started/quickstart).
## 🤝 Contributing
We love contributions from our community ❤️. For details on contributing or running the project for development, check out our [contributing guide](https://cocoindex.io/docs/about/contributing).
## 👥 Community
Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.
Join our community here:
- 🌟 [Star us on GitHub](https://github.com/cocoindex-io/cocoindex)
- 💬 [Start a GitHub Discussion](https://github.com/cocoindex-io/cocoindex/discussions)
- 👋 [Join our Discord community](https://discord.com/invite/zpA9S2DR7s)
- 𝕏 [Follow us on X](https://x.com/cocoindex_io)
- 🐚 [Follow us on LinkedIn](https://www.linkedin.com/company/cocoindex/about/)
- ▶️ [Subscribe to our YouTube channel](https://www.youtube.com/@cocoindex-io)
- 📜 [Read our blog posts](https://cocoindex.io/blogs/)
## License
CocoIndex is Apache 2.0 licensed.