{"id":25939088,"url":"https://github.com/cocoindex-io/cocoindex","last_synced_at":"2026-04-20T08:02:18.114Z","repository":{"id":280532531,"uuid":"942312530","full_name":"cocoindex-io/cocoindex","owner":"cocoindex-io","description":"Data transformation framework for AI. Ultra performant, with incremental processing.  🌟 Star if you like it!","archived":false,"fork":false,"pushed_at":"2026-04-18T17:13:03.000Z","size":108858,"stargazers_count":6900,"open_issues_count":53,"forks_count":500,"subscribers_count":41,"default_branch":"main","last_synced_at":"2026-04-18T19:16:46.902Z","etag":null,"topics":["agentic-data-framework","ai","ai-agents","change-data-capture","context-engineering","data","data-engineering","data-indexing","data-processing","etl","help-wanted","indexing","knowledge-graph","llm","long-horizon-agent","python","rag","real-time","rust","semantic-search"],"latest_commit_sha":null,"homepage":"https://cocoindex.io","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cocoindex-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":".github/SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-03-03T23:03:09.000Z","updated_at":"2026-04-18T14:35:16.000Z","dependencies_parsed_at":"2026-03-17T08:03:30.300Z","dependency_job_id":null,"html_url":"https://github.com/cocoindex-io/cocoindex","commit_stats":null,"previous_names":["cocoindex-io/cocoindex"],"tags_count":191,"template":false,"template_full_name":null,"purl":"pkg:github/cocoindex-io/cocoindex","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cocoindex-io%2Fcocoindex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cocoindex-io%2Fcocoindex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cocoindex-io%2Fcocoindex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cocoindex-io%2Fcocoindex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cocoindex-io","download_url":"https://codeload.github.com/cocoindex-io/cocoindex/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cocoindex-io%2Fcocoindex/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32038456,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T00:18:06.643Z","status":"online","status_checked_at":"2026-04-20T02:00:06.527Z","response_time":94,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-data-framework","ai","ai-agents","change-data-capture","context-engineering","data","data-engineering","data-indexing","data-processing","etl","help-wanted","indexing","knowledge-graph","llm","long-horizon-agent","python","rag","real-time","rust","semantic-search"],"created_at":"2025-03-04T04:15:41.620Z","updated_at":"2026-04-20T08:02:18.081Z","avatar_url":"https://github.com/cocoindex-io.png","language":"Rust","funding_links":[],"categories":["Rust","Libraries","Graph ETL","Recently Updated","A01_文本生成_文本对话","Repos","Table of Contents","Stream Processing","🧰 Frameworks that Facilitate RAG","Python","Corporate and Analytical Applications","Vector Databases \u0026 Retrieval Platforms","\u003ca name=\"Rust\"\u003e\u003c/a\u003eRust","Data Pipelines \u0026 Streaming","Frameworks","开源工具","🤖 AI \u0026 Machine Learning"],"sub_categories":["Data processing","Triple Stores (RDF Databases)","[Mar 15, 2025](/content/2025/03/15/README.md)","大语言对话模型及数据","Streaming Engine","Data Integration and Specialized Solutions","RAG Survey 2024","好用工具"],"readme":"\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://cocoindex.io/images/github.svg\" alt=\"CocoIndex\"\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eData transformation for AI\u003c/h1\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n[![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex)\n[![Documentation](https://img.shields.io/badge/Documentation-394e79?logo=readthedocs\u0026logoColor=00B9FF)](https://cocoindex.io/docs/getting_started/quickstart)\n[![License](https://img.shields.io/badge/license-Apache%202.0-5B5BD6?logoColor=white)](https://opensource.org/licenses/Apache-2.0)\n[![PyPI version](https://img.shields.io/pypi/v/cocoindex?color=5B5BD6)](https://pypi.org/project/cocoindex/)\n\u003c!--[![PyPI - Downloads](https://img.shields.io/pypi/dm/cocoindex)](https://pypistats.org/packages/cocoindex) --\u003e\n[![PyPI Downloads](https://static.pepy.tech/badge/cocoindex/month)](https://pepy.tech/projects/cocoindex)\n[![CI](https://github.com/cocoindex-io/cocoindex/actions/workflows/CI.yml/badge.svg?event=push\u0026color=5B5BD6)](https://github.com/cocoindex-io/cocoindex/actions/workflows/CI.yml)\n[![release](https://github.com/cocoindex-io/cocoindex/actions/workflows/release.yml/badge.svg?event=push\u0026color=5B5BD6)](https://github.com/cocoindex-io/cocoindex/actions/workflows/release.yml)\n[![Link Check](https://github.com/cocoindex-io/cocoindex/actions/workflows/links.yml/badge.svg)](https://github.com/cocoindex-io/cocoindex/actions/workflows/links.yml)\n[![prek](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/j178/prek/master/docs/assets/badge-v0.json)](https://github.com/j178/prek)\n[![Discord](https://img.shields.io/discord/1314801574169673738?logo=discord\u0026color=5B5BD6\u0026logoColor=white)](https://discord.com/invite/zpA9S2DR7s)\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003ca href=\"https://trendshift.io/repositories/13939\" target=\"_blank\"\u003e\u003cimg src=\"https://trendshift.io/api/badge/repositories/13939\" alt=\"cocoindex-io%2Fcocoindex | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"/\u003e\u003c/a\u003e\n\u003c/div\u003e\n\nUltra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box.  Exceptional developer velocity. Production-ready at day 0.\n\n⭐ Drop a star to help us grow!\n\n\u003cdiv align=\"center\"\u003e\n\n\u003c!-- Keep these links. Translations will automatically update with the README. --\u003e\n[Deutsch](https://readme-i18n.com/cocoindex-io/cocoindex?lang=de) |\n[English](https://readme-i18n.com/cocoindex-io/cocoindex?lang=en) |\n[Español](https://readme-i18n.com/cocoindex-io/cocoindex?lang=es) |\n[français](https://readme-i18n.com/cocoindex-io/cocoindex?lang=fr) |\n[日本語](https://readme-i18n.com/cocoindex-io/cocoindex?lang=ja) |\n[한국어](https://readme-i18n.com/cocoindex-io/cocoindex?lang=ko) |\n[Português](https://readme-i18n.com/cocoindex-io/cocoindex?lang=pt) |\n[Русский](https://readme-i18n.com/cocoindex-io/cocoindex?lang=ru) |\n[中文](https://readme-i18n.com/cocoindex-io/cocoindex?lang=zh)\n\n\u003c/div\u003e\n\n\u003c/br\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://cocoindex.io/images/transformation.svg\" alt=\"CocoIndex Transformation\"\u003e\n\u003c/p\u003e\n\n\u003c/br\u003e\n\nCocoIndex makes it effortless to transform data with AI, and keep source data and target in sync. Whether you’re building a vector index, creating knowledge graphs for context engineering or performing any custom data transformations — goes beyond SQL.\n\n\u003c/br\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg alt=\"CocoIndex Features\" src=\"https://cocoindex.io/images/venn2.svg\" /\u003e\n\u003c/p\u003e\n\n\u003c/br\u003e\n\n## Exceptional velocity\n\nJust declare transformation in dataflow with ~100 lines of python\n\n```python\n# import\ndata['content'] = flow_builder.add_source(...)\n\n# transform\ndata['out'] = data['content']\n    .transform(...)\n    .transform(...)\n\n# collect data\ncollector.collect(...)\n\n# export to db, vector db, graph db ...\ncollector.export(...)\n```\n\nCocoIndex follows the idea of [Dataflow](https://en.wikipedia.org/wiki/Dataflow_programming) programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.\n\n**Particularly**, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.\n\n## Plug-and-Play Building Blocks\n\nNative builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components - as easy as assembling building blocks.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://cocoindex.io/images/components.svg\" alt=\"CocoIndex Features\"\u003e\n\u003c/p\u003e\n\n## Data Freshness\n\nCocoIndex keep source data and target in sync effortlessly.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://github.com/user-attachments/assets/f4eb29b3-84ee-4fa0-a1e2-80eedeeabde6\" alt=\"Incremental Processing\" width=\"700\"\u003e\n\u003c/p\u003e\n\nIt has out-of-box support for incremental indexing:\n\n- minimal recomputation on source or logic change.\n- (re-)processing necessary portions; reuse cache when possible\n\n## Quick Start\n\nIf you're new to CocoIndex, we recommend checking out\n\n- 📖 [Documentation](https://cocoindex.io/docs)\n- ⚡  [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart)\n- 🎬 [Quick Start Video Tutorial](https://youtu.be/gv5R8nOXsWU?si=9ioeKYkMEnYevTXT)\n\n### Setup\n\n1. Install CocoIndex Python library\n\n```sh\npip install -U cocoindex\n```\n\n2. [Install Postgres](https://cocoindex.io/docs/getting_started/installation#-install-postgres) if you don't have one. CocoIndex uses it for incremental processing.\n\n3. (Optional) Install Claude Code skill for enhanced development experience. Run these commands in [Claude Code](https://claude.com/claude-code):\n\n```\n/plugin marketplace add cocoindex-io/cocoindex-claude\n/plugin install cocoindex-skills@cocoindex\n```\n\n## Define data flow\n\nFollow [Quick Start Guide](https://cocoindex.io/docs/getting_started/quickstart) to define your first indexing flow. An example flow looks like:\n\n```python\n@cocoindex.flow_def(name=\"TextEmbedding\")\ndef text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):\n    # Add a data source to read files from a directory\n    data_scope[\"documents\"] = flow_builder.add_source(cocoindex.sources.LocalFile(path=\"markdown_files\"))\n\n    # Add a collector for data to be exported to the vector index\n    doc_embeddings = data_scope.add_collector()\n\n    # Transform data of each document\n    with data_scope[\"documents\"].row() as doc:\n        # Split the document into chunks, put into `chunks` field\n        doc[\"chunks\"] = doc[\"content\"].transform(\n            cocoindex.functions.SplitRecursively(),\n            language=\"markdown\", chunk_size=2000, chunk_overlap=500)\n\n        # Transform data of each chunk\n        with doc[\"chunks\"].row() as chunk:\n            # Embed the chunk, put into `embedding` field\n            chunk[\"embedding\"] = chunk[\"text\"].transform(\n                cocoindex.functions.SentenceTransformerEmbed(\n                    model=\"sentence-transformers/all-MiniLM-L6-v2\"))\n\n            # Collect the chunk into the collector.\n            doc_embeddings.collect(filename=doc[\"filename\"], location=chunk[\"location\"],\n                                   text=chunk[\"text\"], embedding=chunk[\"embedding\"])\n\n    # Export collected data to a vector index.\n    doc_embeddings.export(\n        \"doc_embeddings\",\n        cocoindex.targets.Postgres(),\n        primary_key_fields=[\"filename\", \"location\"],\n        vector_indexes=[\n            cocoindex.VectorIndexDef(\n                field_name=\"embedding\",\n                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])\n```\n\nIt defines an index flow like this:\n\n\u003cp align=\"center\"\u003e\n    \u003cimg width=\"400\" alt=\"Data Flow\" src=\"https://github.com/user-attachments/assets/2ea7be6d-3d94-42b1-b2bd-22515577e463\" /\u003e\n\u003c/p\u003e\n\n## 🚀 Examples and demo\n\n| Example | Description |\n|---------|-------------|\n| [Text Embedding](examples/text_embedding) | Index text documents with embeddings for semantic search |\n| [Code Embedding](examples/code_embedding) | Index code embeddings for semantic search |\n| [PDF Embedding](examples/pdf_embedding) | Parse PDF and index text embeddings for semantic search |\n| [PDF Elements Embedding](examples/pdf_elements_embedding) | Extract text and images from PDFs; embed text with SentenceTransformers and images with CLIP; store in Qdrant for multimodal search |\n| [Manuals LLM Extraction](examples/manuals_llm_extraction) | Extract structured information from a manual using LLM |\n| [Amazon S3 Embedding](examples/amazon_s3_embedding) | Index text documents from Amazon S3 |\n| [Azure Blob Storage Embedding](examples/azure_blob_embedding) | Index text documents from Azure Blob Storage |\n| [Google Drive Text Embedding](examples/gdrive_text_embedding) | Index text documents from Google Drive |\n| [Meeting Notes to Knowledge Graph](examples/meeting_notes_graph) | Extract structured meeting info from Google Drive and build a knowledge graph |\n| [Docs to Knowledge Graph](examples/docs_to_knowledge_graph) | Extract relationships from Markdown documents and build a knowledge graph |\n| [Embeddings to Qdrant](examples/text_embedding_qdrant) | Index documents in a Qdrant collection for semantic search |\n| [Embeddings to LanceDB](examples/text_embedding_lancedb) | Index documents in a LanceDB collection for semantic search |\n| [FastAPI Server with Docker](examples/fastapi_server_docker) | Run the semantic search server in a Dockerized FastAPI setup |\n| [Product Recommendation](examples/product_recommendation) | Build real-time product recommendations with LLM and graph database|\n| [Image Search with Vision API](examples/image_search) | Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend|\n| [Face Recognition](examples/face_recognition) | Recognize faces in images and build embedding index |\n| [Paper Metadata](examples/paper_metadata) | Index papers in PDF files, and build metadata tables for each paper |\n| [Multi Format Indexing](examples/multi_format_indexing) | Build visual document index from PDFs and images with ColPali for semantic search |\n| [Custom Source HackerNews](examples/custom_source_hn) | Index HackerNews threads and comments, using *CocoIndex Custom Source* |\n| [Custom Output Files](examples/custom_output_files) | Convert markdown files to HTML files and save them to a local directory, using *CocoIndex Custom Targets* |\n| [Patient intake form extraction](examples/patient_intake_extraction) | Use LLM to extract structured data from patient intake forms with different formats |\n| [HackerNews Trending Topics](examples/hn_trending_topics) | Extract trending topics from HackerNews threads and comments, using *CocoIndex Custom Source* and LLM |\n| [Patient Intake Form Extraction with BAML](examples/patient_intake_extraction_baml) | Extract structured data from patient intake forms using BAML |\n| [Patient Intake Form Extraction with DSPy](examples/patient_intake_extraction_dspy) | Extract structured data from patient intake forms using DSPy |\n\nMore coming and stay tuned 👀!\n\n## 📖 Documentation\n\nFor detailed documentation, visit [CocoIndex Documentation](https://cocoindex.io/docs), including a [Quickstart guide](https://cocoindex.io/docs/getting_started/quickstart).\n\n## 🤝 Contributing\n\nWe love contributions from our community ❤️. For details on contributing or running the project for development, check out our [contributing guide](https://cocoindex.io/docs/about/contributing).\n\n## 👥 Community\n\nWelcome with a huge coconut hug 🥥⋆｡˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.\n\nJoin our community here:\n\n- 🌟 [Star us on GitHub](https://github.com/cocoindex-io/cocoindex)\n- 👋 [Join our Discord community](https://discord.com/invite/zpA9S2DR7s)\n- ▶️ [Subscribe to our YouTube channel](https://www.youtube.com/@cocoindex-io)\n- 📜 [Read our blog posts](https://cocoindex.io/blogs/)\n\n## Support us\n\nWe are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo [![GitHub](https://img.shields.io/github/stars/cocoindex-io/cocoindex?color=5B5BD6)](https://github.com/cocoindex-io/cocoindex) to stay tuned and help us grow.\n\n## License\n\nCocoIndex is Apache 2.0 licensed.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcocoindex-io%2Fcocoindex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcocoindex-io%2Fcocoindex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcocoindex-io%2Fcocoindex/lists"}