{"id":27226049,"url":"https://github.com/morphik-org/morphik-core","last_synced_at":"2026-03-06T07:14:32.573Z","repository":{"id":264717364,"uuid":"886966165","full_name":"morphik-org/morphik-core","owner":"morphik-org","description":"The most accurate document search and store for building AI apps","archived":false,"fork":false,"pushed_at":"2026-02-05T23:35:40.000Z","size":130814,"stargazers_count":3472,"open_issues_count":25,"forks_count":288,"subscribers_count":18,"default_branch":"main","last_synced_at":"2026-02-06T08:50:13.680Z","etag":null,"topics":["artificial-intelligence","cache-augmented-generation","colpali","database","litellm","multimodal","rag","rules-based-ingestion"],"latest_commit_sha":null,"homepage":"https://morphik.ai/docs","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/morphik-org.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-11-11T23:47:06.000Z","updated_at":"2026-02-06T08:23:23.000Z","dependencies_parsed_at":null,"dependency_job_id":"e112c20d-535a-4376-b45b-2adb0b19ff0f","html_url":"https://github.com/morphik-org/morphik-core","commit_stats":null,"previous_names":["databridge-org/databridge-core","morphik-org/morphik-core"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/morphik-org/morphik-core","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/morphik-org%2Fmorphik-core","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/morphik-org%2Fmorphik-core/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/morphik-org%2Fmorphik-core/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/morphik-org%2Fmorphik-core/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/morphik-org","download_url":"https://codeload.github.com/morphik-org/morphik-core/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/morphik-org%2Fmorphik-core/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30165139,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T04:43:31.446Z","status":"ssl_error","status_checked_at":"2026-03-06T04:40:30.133Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","cache-augmented-generation","colpali","database","litellm","multimodal","rag","rules-based-ingestion"],"created_at":"2025-04-10T11:02:27.730Z","updated_at":"2026-03-06T07:14:32.546Z","avatar_url":"https://github.com/morphik-org.png","language":"Python","funding_links":[],"categories":["🔍 RAG とナレッジ","AI \u0026 LLM","Python","A01_文本生成_文本对话","5. Retrieval-Augmented Generation (RAG) \u0026 Knowledge"],"sub_categories":["その他の標準","RAG \u0026 Vector Search","大语言对话模型及数据"],"readme":"![Morphik Logo](/morphik_no_pad.png)\n\n# Morphik Core\n\n**Note**: Morphik is launching a hosted service soon! Please sign up for the [waitlist](https://docs.google.com/forms/d/1gFoUKzECICugInLkRlAlgwrkRVorfNywAgkmcjmVGkE/edit).\n\n[![License](https://img.shields.io/badge/license-MIT-blue)](https://github.com/morphik-org/morphik-core/tree/main?tab=License-1-ov-file#readme) [![PyPI - Version](https://img.shields.io/pypi/v/morphik)](https://pypi.org/project/morphik/) [![Discord](https://img.shields.io/discord/1336524712817332276?logo=discord\u0026label=discord)](https://discord.gg/BwMtv3Zaju)\n\n## What is Morphik?\n\nMorphik is an open-source database designed for AI applications that simplifies working with unstructured data. It provides advanced RAG (Retrieval Augmented Generation) capabilities with multi-modal support, knowledge graphs, and intuitive APIs.\n\nBuilt for scale and performance, Morphik can handle millions of documents while maintaining fast retrieval times. Whether you're prototyping a new AI application or deploying production-grade systems, Morphik provides the infrastructure you need.\n\n## Features\n\n- 📄 **First-class Support for Unstructured Data**\n  - Ingest ANY file format (PDFs, videos, text) with intelligent parsing\n  - Advanced retrieval with ColPali multi-modal embeddings\n  - Automatic document chunking and embedding\n\n- 🧠 **Knowledge Graph Integration**\n  - Extract entities and relationships automatically\n  - Graph-enhanced retrieval for more relevant results\n  - Explore document connections visually\n\n- 🔍 **Advanced RAG Capabilities**\n  - Multi-stage retrieval with vector search and reranking\n  - Fine-tuned similarity thresholds\n  - Detailed metadata filtering\n\n- 📏 **Natural Language Rules Engine**\n  - Define schema-like rules for unstructured data\n  - Extract structured metadata during ingestion\n  - Transform documents with natural language instructions\n\n- 💾 **Persistent KV-caching**\n  - Pre-process and \"freeze\" document states\n  - Reduce compute costs and response times\n  - Cache selective document subsets\n\n- 🔌 **MCP Support**\n  - Model Context Protocol integration\n  - Easy knowledge sharing with AI systems\n\n- 🧩 **Extensible Architecture**\n  - Support for custom parsers and embedding models\n  - Multiple storage backends (S3, local)\n  - Vector store integrations (PostgreSQL/pgvector, MongoDB)\n\n## Quick Start\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/morphik-org/morphik-core.git\ncd morphik-core\n\n# Create a virtual environment\npython3.12 -m venv .venv\nsource .venv/bin/activate  # Linux/macOS\n\n# Install dependencies\npip install -r requirements.txt\n\n# Configure and start the server\npython quick_setup.py\npython start_server.py\n```\n\n### Using the Python SDK\n\n```python\nfrom morphik import Morphik\n\n# Connect to Morphik server\ndb = Morphik(\"morphik://localhost:8000\")\n\n# Ingest a document\ndoc = db.ingest_text(\"This is a sample document about AI technology.\", \n                    metadata={\"category\": \"tech\", \"author\": \"Morphik\"})\n\n# Ingest a file (PDF, DOCX, video, etc.)\ndoc = db.ingest_file(\"path/to/document.pdf\", \n                    metadata={\"category\": \"research\"})\n\n# Use ColPali for multi-modal documents (PDFs with images, charts, etc.)\ndoc = db.ingest_file(\"path/to/report_with_charts.pdf\", use_colpali=True)\n\n# Apply natural language rules during ingestion\nrules = [\n    {\"type\": \"metadata_extraction\", \"schema\": {\"title\": \"string\", \"author\": \"string\"}},\n    {\"type\": \"natural_language\", \"prompt\": \"Remove all personally identifiable information\"}\n]\ndoc = db.ingest_file(\"path/to/document.pdf\", rules=rules)\n\n# Retrieve relevant document chunks\nchunks = db.retrieve_chunks(\"What are the latest AI advancements?\", \n                           filters={\"category\": \"tech\"}, \n                           k=5)\n\n# Generate a completion with context\nresponse = db.query(\"Explain the benefits of knowledge graphs in AI applications\",\n                   filters={\"category\": \"research\"})\nprint(response.completion)\n\n# Create and use a knowledge graph\ndb.create_graph(\"tech_graph\", filters={\"category\": \"tech\"})\nresponse = db.query(\"How does AI relate to cloud computing?\", \n                   graph_name=\"tech_graph\", \n                   hop_depth=2)\n```\n\n### Batch Operations\n\n```python\n# Ingest multiple files\ndocs = db.ingest_files(\n    [\"doc1.pdf\", \"doc2.pdf\"],\n    metadata={\"category\": \"research\"},\n    parallel=True\n)\n\n# Ingest all PDFs in a directory\ndocs = db.ingest_directory(\n    \"data/documents\",\n    recursive=True,\n    pattern=\"*.pdf\"\n)\n\n# Batch retrieve documents\ndocs = db.batch_get_documents([\"doc_id1\", \"doc_id2\"])\n```\n\n### Multi-modal Retrieval (ColPali)\n\n```python\n# Ingest a PDF with charts and images\ndb.ingest_file(\"report_with_charts.pdf\", use_colpali=True)\n\n# Retrieve relevant chunks, including images\nchunks = db.retrieve_chunks(\n    \"Show me the Q2 revenue chart\", \n    use_colpali=True, \n    k=3\n)\n\n# Process retrieved images\nfor chunk in chunks:\n    if hasattr(chunk.content, 'show'):  # If it's an image\n        chunk.content.show()\n    else:\n        print(chunk.content)\n```\n\n## Why Choose Morphik?\n\n| Feature | Morphik | Traditional Vector DBs | Document DBs | LLM Frameworks |\n|---------|-----------|---------------------|------------|---------------|\n| **Multi-modal Support** | ✅ Advanced ColPali embedding for text + images | ❌ or Limited | ❌ | ❌ |\n| **Knowledge Graphs** | ✅ Automated extraction \u0026 enhanced retrieval | ❌ | ❌ | ❌ |\n| **Rules Engine** | ✅ Natural language rules \u0026 schema definition | ❌ | ❌ | Limited |\n| **Caching** | ✅ Persistent KV-caching with selective updates | ❌ | ❌ | Limited |\n| **Scalability** | ✅ Millions of documents with PostgreSQL/MongoDB | ✅ | ✅ | Limited |\n| **Video Content** | ✅ Native video parsing \u0026 transcription | ❌ | ❌ | ❌ |\n| **Deployment Options** | ✅ Self-hosted, cloud, or hybrid | Varies | Varies | Limited |\n| **Open Source** | ✅ MIT License | Varies | Varies | Varies |\n| **API \u0026 SDK** | ✅ Clean Python SDK \u0026 RESTful API | Varies | Varies | Varies |\n\n### Key Advantages\n\n- **ColPali Multi-modal Embeddings**: Process and retrieve from documents based on both textual and visual content, maintaining the visual context that other systems miss.\n\n- **Cache Augmented Retrieval**: Pre-process and \"freeze\" document states to reduce compute costs by up to 80% and drastically improve response times.\n\n- **Schema-like Rules for Unstructured Data**: Define rules to extract consistent metadata from unstructured content, bringing database-like queryability to any document format.\n\n- **Enterprise-grade Scalability**: Built on proven database technologies (PostgreSQL/MongoDB) that can scale to millions of documents while maintaining sub-second retrieval times.\n\n## Documentation\n\nFor comprehensive documentation:\n\n- [Installation Guide](https://docs.morphik.ai/getting-started)\n- [Core Concepts](https://docs.morphik.ai/concepts/naive-rag)\n- [Python SDK](https://docs.morphik.ai/python-sdk/morphik)\n- [API Reference](https://docs.morphik.ai/api-reference/health-check)\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Community\n\n- [Discord](https://discord.gg/BwMtv3Zaju) - Join our community\n- [GitHub](https://github.com/morphik-org/morphik-core) - Contribute to development\n\n---\n\nBuilt with ❤️ by Morphik\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmorphik-org%2Fmorphik-core","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmorphik-org%2Fmorphik-core","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmorphik-org%2Fmorphik-core/lists"}