https://github.com/morphik-org/morphik-core

The most accurate document search and store for building AI apps
https://github.com/morphik-org/morphik-core

artificial-intelligence cache-augmented-generation colpali database litellm multimodal rag rules-based-ingestion

Last synced: 4 months ago
JSON representation

The most accurate document search and store for building AI apps

Host: GitHub
URL: https://github.com/morphik-org/morphik-core
Owner: morphik-org
License: other
Created: 2024-11-11T23:47:06.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2026-02-05T23:35:40.000Z (5 months ago)
Last Synced: 2026-02-06T08:50:13.680Z (5 months ago)
Topics: artificial-intelligence, cache-augmented-generation, colpali, database, litellm, multimodal, rag, rules-based-ingestion
Language: Python
Homepage: https://morphik.ai/docs
Size: 125 MB
Stars: 3,472
Watchers: 18
Forks: 288
Open Issues: 25
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

awesome-ai-agents-2026 - Morphik - 🆕 表や図を含む文書向けのマルチモーダル RAG エンジン。複雑な PDF 処理の LlamaIndex 代替として 2026 年に急浮上。 ![GitHub stars](https://img.shields.io/badge/dynamic/json?label=Stars&query=%24.stargazers_count&url=https%3A%2F%2Fapi.github.com%2Frepos%2Fmorphik-org%2Fmorphik-core&color=yellow&logo=github&logoColor=white&style=flat&cacheSeconds=300) (🔍 RAG とナレッジ / その他の標準)
StarryDivineSky - morphik-org/morphik-core - core是一个开源的多模态RAG（检索增强生成）框架，旨在帮助开发者基于私有知识构建AI应用。它允许用户利用各种数据模态（例如文本、图像、音频等）进行信息检索和生成，从而增强AI应用的知识理解和推理能力。该项目提供了一套工具和组件，简化了RAG流程的构建和定制，使得开发者能够更高效地将私有知识集成到AI应用中。Morphik-core的核心优势在于其多模态支持和灵活的架构，开发者可以根据自身需求进行扩展和定制。通过该项目，用户可以构建能够理解和利用多种数据类型的AI应用，从而实现更智能和个性化的用户体验。总而言之，Morphik-core是一个强大的开源工具，助力开发者构建基于私有知识的多模态AI应用。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
awesome - morphik-org/morphik-core - The most accurate document search and store for building AI apps (Python)
awesome-side-quests - morphik-org/morphik-core
awesome-opensource-ai - Morphik - Open-source multimodal RAG framework for building AI apps over private knowledge. Handles text, images, and documents with built-in embedding generation and vector search. MIT licensed. ![GitHub stars](https://img.shields.io/github/stars/morphik-org/morphik-core?style=social) (5. Retrieval-Augmented Generation (RAG) & Knowledge)
awesome-github-repos - morphik-org/morphik-core - The most accurate document search and store for building AI apps (Python)
my-awesome - morphik-org/morphik-core - intelligence,cache-augmented-generation,colpali,database,litellm,multimodal,rag,rules-based-ingestion pushed_at:2026-05 star:3.6k fork:0.3k The most accurate document search and store for building AI apps (Python)

README

          ![Morphik Logo](/morphik_no_pad.png)

# Morphik Core

**Note**: Morphik is launching a hosted service soon! Please sign up for the [waitlist](https://docs.google.com/forms/d/1gFoUKzECICugInLkRlAlgwrkRVorfNywAgkmcjmVGkE/edit).

[![License](https://img.shields.io/badge/license-MIT-blue)](https://github.com/morphik-org/morphik-core/tree/main?tab=License-1-ov-file#readme) [![PyPI - Version](https://img.shields.io/pypi/v/morphik)](https://pypi.org/project/morphik/) [![Discord](https://img.shields.io/discord/1336524712817332276?logo=discord&label=discord)](https://discord.gg/BwMtv3Zaju)

## What is Morphik?

Morphik is an open-source database designed for AI applications that simplifies working with unstructured data. It provides advanced RAG (Retrieval Augmented Generation) capabilities with multi-modal support, knowledge graphs, and intuitive APIs.

Built for scale and performance, Morphik can handle millions of documents while maintaining fast retrieval times. Whether you're prototyping a new AI application or deploying production-grade systems, Morphik provides the infrastructure you need.

## Features

- 📄 **First-class Support for Unstructured Data**

  - Ingest ANY file format (PDFs, videos, text) with intelligent parsing

  - Advanced retrieval with ColPali multi-modal embeddings

  - Automatic document chunking and embedding

- 🧠 **Knowledge Graph Integration**

  - Extract entities and relationships automatically

  - Graph-enhanced retrieval for more relevant results

  - Explore document connections visually

- 🔍 **Advanced RAG Capabilities**

  - Multi-stage retrieval with vector search and reranking

  - Fine-tuned similarity thresholds

  - Detailed metadata filtering

- 📏 **Natural Language Rules Engine**

  - Define schema-like rules for unstructured data

  - Extract structured metadata during ingestion

  - Transform documents with natural language instructions

- 💾 **Persistent KV-caching**

  - Pre-process and "freeze" document states

  - Reduce compute costs and response times

  - Cache selective document subsets

- 🔌 **MCP Support**

  - Model Context Protocol integration

  - Easy knowledge sharing with AI systems

- 🧩 **Extensible Architecture**

  - Support for custom parsers and embedding models

  - Multiple storage backends (S3, local)

  - Vector store integrations (PostgreSQL/pgvector, MongoDB)

## Quick Start

### Installation

```bash

# Clone the repository

git clone https://github.com/morphik-org/morphik-core.git

cd morphik-core

# Create a virtual environment

python3.12 -m venv .venv

source .venv/bin/activate  # Linux/macOS

# Install dependencies

pip install -r requirements.txt

# Configure and start the server

python quick_setup.py

python start_server.py

```

### Using the Python SDK

```python

from morphik import Morphik

# Connect to Morphik server

db = Morphik("morphik://localhost:8000")

# Ingest a document

doc = db.ingest_text("This is a sample document about AI technology.", 

                    metadata={"category": "tech", "author": "Morphik"})

# Ingest a file (PDF, DOCX, video, etc.)

doc = db.ingest_file("path/to/document.pdf", 

                    metadata={"category": "research"})

# Use ColPali for multi-modal documents (PDFs with images, charts, etc.)

doc = db.ingest_file("path/to/report_with_charts.pdf", use_colpali=True)

# Apply natural language rules during ingestion

rules = [

    {"type": "metadata_extraction", "schema": {"title": "string", "author": "string"}},

    {"type": "natural_language", "prompt": "Remove all personally identifiable information"}

]

doc = db.ingest_file("path/to/document.pdf", rules=rules)

# Retrieve relevant document chunks

chunks = db.retrieve_chunks("What are the latest AI advancements?", 

                           filters={"category": "tech"}, 

                           k=5)

# Generate a completion with context

response = db.query("Explain the benefits of knowledge graphs in AI applications",

                   filters={"category": "research"})

print(response.completion)

# Create and use a knowledge graph

db.create_graph("tech_graph", filters={"category": "tech"})

response = db.query("How does AI relate to cloud computing?", 

                   graph_name="tech_graph", 

                   hop_depth=2)

```

### Batch Operations

```python

# Ingest multiple files

docs = db.ingest_files(

    ["doc1.pdf", "doc2.pdf"],

    metadata={"category": "research"},

    parallel=True

)

# Ingest all PDFs in a directory

docs = db.ingest_directory(

    "data/documents",

    recursive=True,

    pattern="*.pdf"

)

# Batch retrieve documents

docs = db.batch_get_documents(["doc_id1", "doc_id2"])

```

### Multi-modal Retrieval (ColPali)

```python

# Ingest a PDF with charts and images

db.ingest_file("report_with_charts.pdf", use_colpali=True)

# Retrieve relevant chunks, including images

chunks = db.retrieve_chunks(

    "Show me the Q2 revenue chart", 

    use_colpali=True, 

    k=3

)

# Process retrieved images

for chunk in chunks:

    if hasattr(chunk.content, 'show'):  # If it's an image

        chunk.content.show()

    else:

        print(chunk.content)

```

## Why Choose Morphik?

| Feature | Morphik | Traditional Vector DBs | Document DBs | LLM Frameworks |

|---------|-----------|---------------------|------------|---------------|

| **Multi-modal Support** | ✅ Advanced ColPali embedding for text + images | ❌ or Limited | ❌ | ❌ |

| **Knowledge Graphs** | ✅ Automated extraction & enhanced retrieval | ❌ | ❌ | ❌ |

| **Rules Engine** | ✅ Natural language rules & schema definition | ❌ | ❌ | Limited |

| **Caching** | ✅ Persistent KV-caching with selective updates | ❌ | ❌ | Limited |

| **Scalability** | ✅ Millions of documents with PostgreSQL/MongoDB | ✅ | ✅ | Limited |

| **Video Content** | ✅ Native video parsing & transcription | ❌ | ❌ | ❌ |

| **Deployment Options** | ✅ Self-hosted, cloud, or hybrid | Varies | Varies | Limited |

| **Open Source** | ✅ MIT License | Varies | Varies | Varies |

| **API & SDK** | ✅ Clean Python SDK & RESTful API | Varies | Varies | Varies |

### Key Advantages

- **ColPali Multi-modal Embeddings**: Process and retrieve from documents based on both textual and visual content, maintaining the visual context that other systems miss.

- **Cache Augmented Retrieval**: Pre-process and "freeze" document states to reduce compute costs by up to 80% and drastically improve response times.

- **Schema-like Rules for Unstructured Data**: Define rules to extract consistent metadata from unstructured content, bringing database-like queryability to any document format.

- **Enterprise-grade Scalability**: Built on proven database technologies (PostgreSQL/MongoDB) that can scale to millions of documents while maintaining sub-second retrieval times.

## Documentation

For comprehensive documentation:

- [Installation Guide](https://docs.morphik.ai/getting-started)

- [Core Concepts](https://docs.morphik.ai/concepts/naive-rag)

- [Python SDK](https://docs.morphik.ai/python-sdk/morphik)

- [API Reference](https://docs.morphik.ai/api-reference/health-check)

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Community

- [Discord](https://discord.gg/BwMtv3Zaju) - Join our community

- [GitHub](https://github.com/morphik-org/morphik-core) - Contribute to development

---

Built with ❤️ by Morphik

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/morphik-org/morphik-core

Awesome Lists containing this project

README