https://github.com/cocoindex-io/cocoindex
ETL framework to turn your data AI-ready - with realtime incremental updates and support custom logic like lego.
https://github.com/cocoindex-io/cocoindex
ai change-data-capture data data-engineering data-indexing data-infrastructure data-processing dataflow etl help-wanted indexing knowledge-graph llm pipeline python rag real-time rust semantic-search streaming
Last synced: 8 days ago
JSON representation
ETL framework to turn your data AI-ready - with realtime incremental updates and support custom logic like lego.
- Host: GitHub
- URL: https://github.com/cocoindex-io/cocoindex
- Owner: cocoindex-io
- License: apache-2.0
- Created: 2025-03-03T23:03:09.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-05-11T01:21:47.000Z (9 months ago)
- Last Synced: 2025-05-11T01:24:35.162Z (9 months ago)
- Topics: ai, change-data-capture, data, data-engineering, data-indexing, data-infrastructure, data-processing, dataflow, etl, help-wanted, indexing, knowledge-graph, llm, pipeline, python, rag, real-time, rust, semantic-search, streaming
- Language: Rust
- Homepage: https://cocoindex.io
- Size: 6.88 MB
- Stars: 1,088
- Watchers: 5
- Forks: 66
- Open Issues: 44
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- fucking-awesome-rust - cocoindex - ETL framework to build fresh index (Libraries / Data processing)
- awesome-rust - cocoindex - ETL framework to build fresh index (Libraries / Data processing)
- trackawesomelist - CocoIndex (⭐199) - ETL framework to build fresh index for AI, with realtime incremental updates. (Recently Updated / [Mar 15, 2025](/content/2025/03/15/README.md))
- StarryDivineSky - cocoindex-io/cocoindex - io/cocoindex 是一个为人工智能设计的实时数据转换框架。它具有超高性能,并支持增量处理,这意味着它只处理数据的变化部分,而不是每次都处理整个数据集,从而提高效率。该框架主要用于快速转换和处理AI模型所需的数据。具体实现细节和使用方法请参考项目文档。该项目旨在提供一种高效、实时的数据处理解决方案,以满足AI应用对数据处理速度和效率的需求。专为AI打造的超高性能数据转换框架,核心引擎采用Rust编写。开箱即支持增量处理与数据血缘追踪。提供卓越的开发效率,从第0天起即具备生产就绪能力。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
- AiTreasureBox - cocoindex-io/cocoindex - 01-03_5633_28](https://img.shields.io/github/stars/cocoindex-io/cocoindex.svg)|Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it!| (Repos)
- awesome-streaming - CocoIndex - ETL framework to build fresh index for AI, with realtime incremental updates. (Table of Contents / Streaming Engine)
- awesome - cocoindex-io/cocoindex - Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it! (Rust)
- Awesome-RAG - CocoIndex
- best-of-python - GitHub - 30% open · ⏱️ 05.11.2025): (Data Pipelines & Streaming)
- awesome-ccamel - cocoindex-io/cocoindex - Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it! (Rust)
- awesome-github-repos - cocoindex-io/cocoindex - Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it! (Rust)
- awesome-repositories - cocoindex-io/cocoindex - Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it! (Rust)
- awesome-data-engineering - CocoIndex - An open source ETL framework to build fresh index for AI. (Stream Processing)
- awesome - cocoindex-io/cocoindex - Data transformation framework for AI. Ultra performant, with incremental processing. 🌟 Star if you like it! (<a name="Rust"></a>Rust)
README
Data transformation for AI
[](https://github.com/cocoindex-io/cocoindex)
[](https://cocoindex.io/docs/getting_started/quickstart)
[](https://opensource.org/licenses/Apache-2.0)
[](https://pypi.org/project/cocoindex/)
[](https://pepy.tech/projects/cocoindex)
[](https://github.com/cocoindex-io/cocoindex/actions/workflows/CI.yml)
[](https://github.com/cocoindex-io/cocoindex/actions/workflows/release.yml)
[](https://github.com/cocoindex-io/cocoindex/actions/workflows/links.yml)
[](https://discord.com/invite/zpA9S2DR7s)
Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.
⭐ Drop a star to help us grow!
[Deutsch](https://readme-i18n.com/cocoindex-io/cocoindex?lang=de) |
[English](https://readme-i18n.com/cocoindex-io/cocoindex?lang=en) |
[Español](https://readme-i18n.com/cocoindex-io/cocoindex?lang=es) |
[français](https://readme-i18n.com/cocoindex-io/cocoindex?lang=fr) |
[日本語](https://readme-i18n.com/cocoindex-io/cocoindex?lang=ja) |
[한국어](https://readme-i18n.com/cocoindex-io/cocoindex?lang=ko) |
[Português](https://readme-i18n.com/cocoindex-io/cocoindex?lang=pt) |
[Русский](https://readme-i18n.com/cocoindex-io/cocoindex?lang=ru) |
[中文](https://readme-i18n.com/cocoindex-io/cocoindex?lang=zh)
CocoIndex makes it effortless to transform data with AI, and keep source data and target in sync. Whether you’re building a vector index, creating knowledge graphs for context engineering or performing any custom data transformations — goes beyond SQL.
## Exceptional velocity
Just declare transformation in dataflow with ~100 lines of python
```python
# import
data['content'] = flow_builder.add_source(...)
# transform
data['out'] = data['content']
.transform(...)
.transform(...)
# collect data
collector.collect(...)
# export to db, vector db, graph db ...
collector.export(...)
```
CocoIndex follows the idea of [Dataflow](https://en.wikipedia.org/wiki/Dataflow_programming) programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.
**Particularly**, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.
## Plug-and-Play Building Blocks
Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components - as easy as assembling building blocks.
## Data Freshness
CocoIndex keep source data and target in sync effortlessly.
It has out-of-box support for incremental indexing:
- minimal recomputation on source or logic change.
- (re-)processing necessary portions; reuse cache when possible
## Quick Start
If you're new to CocoIndex, we recommend checking out
- 📖 [Documentation](https://cocoindex.io/docs)
- ⚡ [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart)
- 🎬 [Quick Start Video Tutorial](https://youtu.be/gv5R8nOXsWU?si=9ioeKYkMEnYevTXT)
### Setup
1. Install CocoIndex Python library
```sh
pip install -U cocoindex
```
2. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. CocoIndex uses it for incremental processing.
3. (Optional) Install Claude Code skill for enhanced development experience. Run these commands in [Claude Code](https://claude.com/claude-code):
```
/plugin marketplace add cocoindex-io/cocoindex-claude
/plugin install cocoindex-skills@cocoindex
```
## Define data flow
Follow [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart) to define your first indexing flow. An example flow looks like:
```python
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
# Add a data source to read files from a directory
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
# Add a collector for data to be exported to the vector index
doc_embeddings = data_scope.add_collector()
# Transform data of each document
with data_scope["documents"].row() as doc:
# Split the document into chunks, put into `chunks` field
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=2000, chunk_overlap=500)
# Transform data of each chunk
with doc["chunks"].row() as chunk:
# Embed the chunk, put into `embedding` field
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))
# Collect the chunk into the collector.
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])
# Export collected data to a vector index.
doc_embeddings.export(
"doc_embeddings",
cocoindex.targets.Postgres(),
primary_key_fields=["filename", "location"],
vector_indexes=[
cocoindex.VectorIndexDef(
field_name="embedding",
metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
```
It defines an index flow like this:
## 🚀 Examples and demo
| Example | Description |
|---------|-------------|
| [Text Embedding](examples/text_embedding) | Index text documents with embeddings for semantic search |
| [Code Embedding](examples/code_embedding) | Index code embeddings for semantic search |
| [PDF Embedding](examples/pdf_embedding) | Parse PDF and index text embeddings for semantic search |
| [PDF Elements Embedding](examples/pdf_elements_embedding) | Extract text and images from PDFs; embed text with SentenceTransformers and images with CLIP; store in Qdrant for multimodal search |
| [Manuals LLM Extraction](examples/manuals_llm_extraction) | Extract structured information from a manual using LLM |
| [Amazon S3 Embedding](examples/amazon_s3_embedding) | Index text documents from Amazon S3 |
| [Azure Blob Storage Embedding](examples/azure_blob_embedding) | Index text documents from Azure Blob Storage |
| [Google Drive Text Embedding](examples/gdrive_text_embedding) | Index text documents from Google Drive |
| [Meeting Notes to Knowledge Graph](examples/meeting_notes_graph) | Extract structured meeting info from Google Drive and build a knowledge graph |
| [Docs to Knowledge Graph](examples/docs_to_knowledge_graph) | Extract relationships from Markdown documents and build a knowledge graph |
| [Embeddings to Qdrant](examples/text_embedding_qdrant) | Index documents in a Qdrant collection for semantic search |
| [Embeddings to LanceDB](examples/text_embedding_lancedb) | Index documents in a LanceDB collection for semantic search |
| [FastAPI Server with Docker](examples/fastapi_server_docker) | Run the semantic search server in a Dockerized FastAPI setup |
| [Product Recommendation](examples/product_recommendation) | Build real-time product recommendations with LLM and graph database|
| [Image Search with Vision API](examples/image_search) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend|
| [Face Recognition](examples/face_recognition) | Recognize faces in images and build embedding index |
| [Paper Metadata](examples/paper_metadata) | Index papers in PDF files, and build metadata tables for each paper |
| [Multi Format Indexing](examples/multi_format_indexing) | Build visual document index from PDFs and images with ColPali for semantic search |
| [Custom Source HackerNews](examples/custom_source_hn) | Index HackerNews threads and comments, using *CocoIndex Custom Source* |
| [Custom Output Files](examples/custom_output_files) | Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets* |
| [Patient intake form extraction](examples/patient_intake_extraction) | Use LLM to extract structured data from patient intake forms with different formats |
| [HackerNews Trending Topics](examples/hn_trending_topics) | Extract trending topics from HackerNews threads and comments, using *CocoIndex Custom Source* and LLM |
| [Patient Intake Form Extraction with BAML](examples/patient_intake_extraction_baml) | Extract structured data from patient intake forms using BAML |
| [Patient Intake Form Extraction with DSPy](examples/patient_intake_extraction_dspy) | Extract structured data from patient intake forms using DSPy |
More coming and stay tuned 👀!
## 📖 Documentation
For detailed documentation, visit [CocoIndex Documentation](https://cocoindex.io/docs), including a [Quickstart guide](https://cocoindex.io/docs/getting_started/quickstart).
## 🤝 Contributing
We love contributions from our community ❤️. For details on contributing or running the project for development, check out our [contributing guide](https://cocoindex.io/docs/about/contributing).
## 👥 Community
Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.
Join our community here:
- 🌟 [Star us on GitHub](https://github.com/cocoindex-io/cocoindex)
- 👋 [Join our Discord community](https://discord.com/invite/zpA9S2DR7s)
- ▶️ [Subscribe to our YouTube channel](https://www.youtube.com/@cocoindex-io)
- 📜 [Read our blog posts](https://cocoindex.io/blogs/)
## Support us
We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo [](https://github.com/cocoindex-io/cocoindex) to stay tuned and help us grow.
## License
CocoIndex is Apache 2.0 licensed.