https://github.com/kinyugo/madin
An Agentic Framework for Document Retrieval
https://github.com/kinyugo/madin
agentic-rag ai ai-agents llm rag
Last synced: about 1 month ago
JSON representation
An Agentic Framework for Document Retrieval
- Host: GitHub
- URL: https://github.com/kinyugo/madin
- Owner: Kinyugo
- License: mit
- Created: 2025-09-11T21:11:19.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-09-23T16:16:08.000Z (9 months ago)
- Last Synced: 2025-10-01T04:31:04.682Z (9 months ago)
- Topics: agentic-rag, ai, ai-agents, llm, rag
- Language: Python
- Homepage:
- Size: 719 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# madin: An Agentic Framework for Document Retrieval
madin is a Python framework for building sophisticated, agentic retrieval systems over structured and unstructured documents.
At its core, madin represents documents as structured trees, enabling LLM-powered agents to perform nuanced, fine-grained searches that go beyond simple vector-based retrieval.
## Key Features
- **Agentic Tree Search:** The primary retrieval mechanism is an agentic tree search, allowing an LLM-powered agent to navigate the document's hierarchical structure to find precise answers.
- **Intelligent Content Chunking:** The document processing pipeline intelligently chunks content from documents. This is a crucial step for enabling hybrid retrieval strategy, as it creates the document segments used for the initial vector search.
- **Rich Metadata Enrichment:** The document processing pipeline automatically enriches the nodes of the document tree with metadata, such as keywords and named entities. This provides the agent with valuable context for more effective and fine-grained tree traversal.
- **Flexible LLM Agent Support:** The framework is built on `pydantic-ai`, allowing for easy integration with any supported Large Language Model (LLM).
- **Internal Structuring of Documents:** madin can process both structured (like Markdown) and unstructured documents by imposing an internal tree-like structure, making any document amenable to agentic tree search.
## Getting Started
### Installation
To get started with madin, clone the repository and install the dependencies using `uv`.
```bash
git clone https://github.com/Kinyugo/madin.git
cd madin
uv sync --all-extras
uv pip install -e .
```
## Usage
The core workflow of madin involves processing a document into a structured tree, and then using an agent to retrieve information from it. Here is a complete example of how to perform agentic retrieval on a single document.
```python
import asyncio
import textwrap
import logfire
from dotenv import load_dotenv
from madin import (
AgentConfig,
Document,
DocumentProcessingConfig,
build_document_retrieval_agent,
flat_document_to_tree,
get_document_node_by_id,
process_document,
redact_document,
retrieve_documents,
)
# --- Configuration ---
# Load environment variables from a .env file (e.g., for API keys)
load_dotenv("path/to/.env")
# Configure Logfire for observability
logfire.configure()
logfire.instrument_pydantic_ai()
# Define the model to be used by the agents for all tasks
AGENT_MODEL = "openai:gpt-5-mini"
async def main() -> None:
"""Runs the main madin demonstration workflow."""
print("š Starting the madin example workflow...")
# --- 1. Create and Process a Document ---
markdown_content = textwrap.dedent(
"""
madin: An Agentic Framework for Document Retrieval
### Introduction
madin is a Python framework for building sophisticated, agentic retrieval
systems over structured and unstructured documents. It excels at understanding
document hierarchy to provide precise answers.
Key Features
- **Agentic Tree Search**: Navigates the document structure like a human would, leading to more context-aware results.
- **Hybrid Retrieval**: Combines semantic search with structured traversal for scalability and accuracy.
- **Intelligent Content Chunking**: Dynamically breaks down content based on its semantic meaning and structural importance.
"""
)
doc = Document(id="madin-framework-doc", raw_content=markdown_content)
print("\n[Step 1/3] Processing document into a structured tree...")
processing_config = DocumentProcessingConfig(
agent_config=AgentConfig(
document_content_analysis=AGENT_MODEL,
document_structure_editing=AGENT_MODEL,
node_content_analysis=AGENT_MODEL,
node_content_chunking=AGENT_MODEL,
),
)
processing_result = await process_document(doc, processing_config)
processed_document = flat_document_to_tree(processing_result.document)
print("Processed document structure:")
print(
redact_document(processed_document).model_dump_json(
indent=2, exclude_none=True, exclude_unset=True
)
)
# --- 2. Build a Retrieval Agent ---
print("\n[Step 2/3] Building the retrieval agent...")
retrieval_agent = build_document_retrieval_agent(model=AGENT_MODEL)
# --- 3. Retrieve Relevant Content ---
query = "What is agentic tree search?"
print(f"\n[Step 3/3] Retrieving content for query: '{query}'")
retrieval_results = await retrieve_documents(
agent=retrieval_agent, documents=[processed_document], query=query
)
print("\n⨠Retrieval Complete! Relevant Content Found:")
if not retrieval_results.results or not retrieval_results.results[0].node_ids:
print(" - No relevant information found for the query.")
else:
# Loop through results and print the content of each retrieved node
for result in retrieval_results.results:
for node_id in result.node_ids:
node = get_document_node_by_id(processed_document, node_id)
if node:
print("-" * 50)
print(f"š Node ID: {node.id}")
print("Content:")
print(textwrap.indent(node.content, " > "))
if __name__ == "__main__":
asyncio.run(main())
```
For a more comprehensive example, including demonstrations of hybrid retrieval strategies, see [notebooks/madin.ipynb](notebooks/madin.ipynb).
## Project Structure
- `madin/`: The main Python library containing all core logic for document processing, schemas, and retrieval algorithms.
- `notebooks/`: Jupyter notebooks for demonstrating and evaluating retrieval strategies.
- `example_data/`: Sample data for running the examples.
- `pyproject.toml`: Project configuration and dependencies.