https://github.com/cocoindex-io/cocoindex
ETL framework to index data for AI, such as RAG; with realtime incremental updates and support custom logic like lego.
https://github.com/cocoindex-io/cocoindex
ai change-data-capture data data-engineering data-indexing data-infrastructure data-processing dataflow etl help-wanted indexing knowledge-graph llm pipeline python rag real-time rust semantic-search streaming
Last synced: 1 day ago
JSON representation
ETL framework to index data for AI, such as RAG; with realtime incremental updates and support custom logic like lego.
- Host: GitHub
- URL: https://github.com/cocoindex-io/cocoindex
- Owner: cocoindex-io
- License: apache-2.0
- Created: 2025-03-03T23:03:09.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2025-04-01T20:02:28.000Z (11 days ago)
- Last Synced: 2025-04-01T21:22:22.987Z (11 days ago)
- Topics: ai, change-data-capture, data, data-engineering, data-indexing, data-infrastructure, data-processing, dataflow, etl, help-wanted, indexing, knowledge-graph, llm, pipeline, python, rag, real-time, rust, semantic-search, streaming
- Language: Rust
- Homepage: https://cocoindex.io
- Size: 3.31 MB
- Stars: 420
- Watchers: 4
- Forks: 28
- Open Issues: 25
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- fucking-awesome-rust - cocoindex - ETL framework to build fresh index (Libraries / Data processing)
- awesome-rust - cocoindex - ETL framework to build fresh index (Libraries / Data processing)
- trackawesomelist - CocoIndex (โญ199) - ETL framework to build fresh index for AI, with realtime incremental updates. (Recently Updated / [Mar 15, 2025](/content/2025/03/15/README.md))
- awesome-data-engineering - CocoIndex - An open source ETL framework to build fresh index for AI. (Stream Processing)
- awesome-streaming - CocoIndex - ETL framework to build fresh index for AI, with realtime incremental updates. (Table of Contents / Streaming Engine)
- awesome-github-repos - cocoindex-io/cocoindex - ETL framework to turn your data AI-ready - with realtime incremental updates and support custom logic like lego. (Rust)
- Awesome-RAG - CocoIndex
README
![]()
Extract, Transform, Index Data. Easy and Fresh. ๐ด
[](https://github.com/cocoindex-io/cocoindex)
[](https://opensource.org/licenses/Apache-2.0)
[](https://pypi.org/project/cocoindex/)
[](https://www.python.org/)
[](https://github.com/cocoindex-io/cocoindex/actions/workflows/CI.yml)
[](https://github.com/cocoindex-io/cocoindex/actions/workflows/release.yml)
[](https://discord.com/invite/zpA9S2DR7s)
[](https://www.linkedin.com/company/cocoindex)
[](https://twitter.com/intent/follow?screen_name=cocoindex_io)CocoIndex is the world's first open-source engine that supports both custom transformation logic and incremental updates specialized for data indexing.
![]()
With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.## Quick Start:
If you're new to CocoIndex ๐ค, we recommend checking out the ๐ [Documentation](https://cocoindex.io/docs) and โก [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart). We also have a โถ๏ธ [quick start video tutorial](https://youtu.be/gv5R8nOXsWU?si=9ioeKYkMEnYevTXT) for you to jump start.### Setup
1. Install CocoIndex Python library```bash
pip install -U cocoindex
```2. Setup Postgres with pgvector extension; or bring up a Postgres database using docker compose:
- Make sure Docker Compose is installed: [docs](https://docs.docker.com/compose/install/)
- Start a Postgres SQL database for cocoindex using our docker compose config:```bash
docker compose -f <(curl -L https://raw.githubusercontent.com/cocoindex-io/cocoindex/refs/heads/main/dev/postgres.yaml) up -d
```### Start your first indexing flow!
Follow [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart) to define your first indexing flow.
A common indexing flow looks like:```python
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
# Add a data source to read files from a directory
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))# Add a collector for data to be exported to the vector index
doc_embeddings = data_scope.add_collector()# Transform data of each document
with data_scope["documents"].row() as doc:
# Split the document into chunks, put into `chunks` field
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(),
language="markdown", chunk_size=2000, chunk_overlap=500)# Transform data of each chunk
with doc["chunks"].row() as chunk:
# Embed the chunk, put into `embedding` field
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))# Collect the chunk into the collector.
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])# Export collected data to a vector index.
doc_embeddings.export(
"doc_embeddings",
cocoindex.storages.Postgres(),
primary_key_fields=["filename", "location"],
vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
```It defines an index flow like this:
### Play with existing example and demo
Go to the [examples directory](examples) to try out with any of the examples, following instructions under specific example directory.| Example | Description |
|---------|-------------|
| [Text Embedding](examples/text_embedding) | Index text documents with embeddings for semantic search |
| [Code Embedding](examples/code_embedding) | Index code embeddings for semantic search |
| [PDF Embedding](examples/pdf_embedding) | Parse PDF and index text embeddings for semantic search |
| [Manuals LLM Extraction](examples/manuals_llm_extraction) | Extract structured information from a manual using LLM |
| [Google Drive Text Embedding](examples/gdrive_text_embedding) | Index text documents from Google Drive |
| [Docs to Knowledge Graph](examples/docs_to_kg) | Extract relationships from Markdown documents and build a knowledge graph |More coming and stay tuned! If there's any specific examples you would like to see, please let us know in our [Discord community](https://discord.com/invite/zpA9S2DR7s) ๐ฑ.
## ๐ Documentation
For detailed documentation, visit [CocoIndex Documentation](https://cocoindex.io/docs), including a [Quickstart guide](https://cocoindex.io/docs/getting_started/quickstart).## ๐ค Contributing
We love contributions from our community โค๏ธ. For details on contributing or running the project for development, check out our [contributing guide](https://cocoindex.io/docs/about/contributing).## ๐ฅ Community
Welcome with a huge coconut hug ๐ฅฅโ๏ฝกห๐ค. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.Join our community here:
- ๐ [Star us on GitHub](https://github.com/cocoindex-io/cocoindex)
- ๐ฌ [Start a GitHub Discussion](https://github.com/cocoindex-io/cocoindex/discussions)
- ๐ [Join our Discord community](https://discord.com/invite/zpA9S2DR7s)
- ๐ [Follow us on X](https://x.com/cocoindex_io)
- ๐ [Follow us on LinkedIn](https://www.linkedin.com/company/cocoindex/about/)
- โถ๏ธ [Subscribe to our YouTube channel](https://www.youtube.com/@cocoindex-io)
- ๐ [Read our blog posts](https://cocoindex.io/blogs/)## License
CocoIndex is Apache 2.0 licensed.