https://github.com/rishang/deep-research
Python SDK for Deep-Research
https://github.com/rishang/deep-research
ai deep-research deepresearch docling litellm
Last synced: 4 months ago
JSON representation
Python SDK for Deep-Research
- Host: GitHub
- URL: https://github.com/rishang/deep-research
- Owner: Rishang
- License: apache-2.0
- Created: 2025-04-03T08:16:41.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-04-15T18:18:16.000Z (6 months ago)
- Last Synced: 2025-05-17T23:13:47.950Z (5 months ago)
- Topics: ai, deep-research, deepresearch, docling, litellm
- Language: Python
- Homepage:
- Size: 196 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 🧠 Deep Research

**An AI-powered research assistant that performs comprehensive, autonomous research on any topic**
[Features](#-key-features) •
[Installation](#-installation) •
[Quick Start](#-quick-start) •
[Examples](#-usage-examples) •
[Contribute](#-contributing)## 🌟 Overview
DeepResearch SDK empowers developers with AI-driven research capabilities, enabling applications to conduct deep, iterative research autonomously. Inspired by products like Perplexity AI and Claude's web browsing, DeepResearch combines **web search, content extraction, and AI analysis** into a unified, easy-to-use API.
Originally a TypeScript implementation, this Python SDK adds parallel web scraping, multiple search providers, and a more developer-friendly interface.
```mermaid
graph TD
subgraph "DeepResearch SDK"
A[Deep Research] --> W[Web Clients]
A --> C[LLM Integration]
A --> D[Research Callbacks]W --> B1[DoclingClient]
W --> B2[DoclingServerClient]
W --> B3[FirecrawlClient]B1 --> E[Brave Search]
B1 --> F[DuckDuckGo Search]
B1 --> G[Document Extraction]B2 --> E
B2 --> F
B2 --> G2[Server-based Extraction]B3 --> E
B3 --> F
B3 --> G3[Firecrawl Extraction]C --> H[GPT Models]
C --> I[Claude Models]
C --> J[Other LLMs]D --> K[Progress Tracking]
D --> L[Source Monitoring]
D --> M[Activity Logging]
endclassDef primary fill:#4285F4,stroke:#333,stroke-width:2px,color:white
classDef secondary fill:#34A853,stroke:#333,stroke-width:2px,color:white
classDef tertiary fill:#FBBC05,stroke:#333,stroke-width:2px,color:white
classDef quaternary fill:#EA4335,stroke:#333,stroke-width:2px,color:whiteclass A primary
class W,C,D secondary
class B1,B2,B3 secondary
class E,F,G,G2,G3,H,I,J tertiary
class K,L,M quaternary
```## 📋 Table of Contents
- [🧠 Deep Research](#-deep-research)
- [🌟 Overview](#-overview)
- [📋 Table of Contents](#-table-of-contents)
- [🚀 Key Features](#-key-features)
- [📦 Installation](#-installation)
- [Using pip](#using-pip)
- [Using Poetry](#using-poetry)
- [From Source](#from-source)
- [🏁 Quick Start](#-quick-start)
- [🔍 Quick Demo](#-quick-demo)
- [🔬 How It Works](#-how-it-works)
- [📊 Usage Examples](#-usage-examples)
- [Search Results with Metadata](#search-results-with-metadata)
- [Using the Cache System](#using-the-cache-system)
- [Advanced: Using Structured Models for Cache Keys](#advanced-using-structured-models-for-cache-keys)
- [Custom Research Parameters](#custom-research-parameters)
- [Using with Different LLM Providers](#using-with-different-llm-providers)
- [Accessing Research Sources](#accessing-research-sources)
- [Scheduling Research Tasks](#scheduling-research-tasks)
- [🔄 Custom Callbacks](#-custom-callbacks)
- [🤝 Contributing](#-contributing)
- [📄 License](#-license)## 🚀 Key Features
- **📊 Multiple Search Providers**
- [Brave Search API](https://brave.com/search/api/) for high-quality results
- [DuckDuckGo Search](https://duckduckgo.com/) as automatic API-key-free fallback
- Fault-tolerant fallback system ensures searches always return results
- Comprehensive source tracking with detailed metadata- **⚡ Parallel Processing**
- Extract content from multiple sources simultaneously
- Control concurrency to balance speed and resource usage- **🔄 Adaptive Research**
- Automatic gap identification in research
- Self-guided exploration of topics
- Depth-first approach with backtracking- **🧩 Modular Architecture**
- Multiple web client options (Docling, Docling-Server, Firecrawl)
- Easily extensible for custom search providers
- Plug in different LLM backends through LiteLLM
- Event-driven callback system for monitoring progress- **🛠️ Developer-Friendly**
- Async/await interface for integration with modern Python applications
- Type hints throughout for IDE autocompletion
- Comprehensive Pydantic models for structured data
- Rich metadata for search results (provider, publication date, relevance)
- Optional caching system for search and extraction results## 📦 Installation
### Using pip
```bash
# Install from PyPI
pip install deep-research-sdk
```### Using Poetry
```bash
# Add to your Poetry project
poetry add deep-research-sdk
```### From Source
```bash
# Clone the repository
git clone https://github.com/Rishang/deep-research-sdk.git
cd deep-research-sdkpoetry install
```## 🏁 Quick Start
```python
import asyncio
import os
import logging
from deep_research import DeepResearch
from deep_research.utils import DoclingClient, DoclingServerClient, FirecrawlClient
from deep_research.utils.cache import CacheConfiglogging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
async def main():
# Get API keys from environment variables
openai_api_key = os.environ.get("OPENAI_API_KEY")
brave_api_key = os.environ.get("BRAVE_SEARCH_API_KEY") # Optional
firecrawl_api_key = os.environ.get("FIRECRAWL_API_KEY") # Optionalif not openai_api_key:
print("Error: OPENAI_API_KEY environment variable not set")
return# Initialize the DeepResearch instance
# Optional: Configure caching for search and extraction results
# If you don't want to use caching, you can simply set `cache_config=None`
cache_config = CacheConfig(
enabled=True, # Enable caching
ttl_seconds=3600, # Cache entries expire after 1 hour
db_url="sqlite:///docling_cache.sqlite3" # SQLite database for cache
)# OPTION 1: Use the standard Docling client
researcher = DeepResearch(
web_client=DoclingClient(
cache_config=cache_config,
brave_api_key=brave_api_key, # Optional: Brave Search API key if None it will use DuckDuckGo with no API key
),
llm_api_key=openai_api_key,
research_model="gpt-4o-mini", # Advanced research model
reasoning_model="o3-mini", # More efficient model for reasoning
max_depth=3, # Maximum research depth
time_limit_minutes=2 # Time limit in minutes
)# OPTION 2: Use the Docling Server client
# async with DoclingServerClient(
# server_url="http://localhost:8000", # URL of your docling-serve instance
# brave_api_key=brave_api_key,
# cache_config=cache_config
# ) as docling_server:
# researcher = DeepResearch(
# web_client=docling_server,
# llm_api_key=openai_api_key,
# research_model="gpt-4o-mini",
# reasoning_model="o3-mini",
# max_depth=3,
# time_limit_minutes=2
# )
# # ... rest of your code# OPTION 3: Use the Firecrawl client
# if firecrawl_api_key:
# async with FirecrawlClient(
# api_key=firecrawl_api_key,
# cache_config=cache_config
# ) as firecrawl:
# researcher = DeepResearch(
# web_client=firecrawl,
# llm_api_key=openai_api_key,
# research_model="gpt-4o-mini",
# reasoning_model="o3-mini",
# max_depth=3,
# time_limit_minutes=2
# )
# # ... rest of your code# Perform research
result = await researcher.research("The impact of quantum computing on cryptography")# Process results
if result.success:
print("\n==== RESEARCH SUCCESSFUL ====")
print(f"Found {len(result.data['findings'])} pieces of information")
print(f"Used {len(result.data['sources'])} sources")
# Access the sources used in research
for i, source in enumerate(result.data['sources']):
print(f"Source {i+1}: {source['title']} - {source['url']}")
print(f"Analysis:\n{result.data['analysis']}")
else:
print(f"Research failed: {result.error}")if __name__ == "__main__":
asyncio.run(main())
```## 🔍 Quick Demo
Want to see the DeepResearch SDK in action quickly? Follow these steps:
1. **Set up your environment variables**:
```bash
# [Optional] Brave Search API key if not provided it will use DuckDuckGo with no API key required
export BRAVE_SEARCH_API_KEY="your-brave-api-key"# Set your LLM API keys in your environment
export OPENAI_API_KEY="your-openai-api-key"
```1. **Clone the repository**:
```bash
git clone https://github.com/Rishang/deep-research-sdk.git
cd deep-research-sdk
```2. **Run the included demo script**:
```bash
python example.py
```You'll see research progress in real-time, and within about a minute, get an AI-generated analysis of the benefits of regular exercise.
## 🔬 How It Works
DeepResearch implements an iterative research process:
1. **Initialization**: Configure models, search providers and parameters
2. **Search**: Find relevant sources on the topic from multiple providers
3. **Extraction**: Process and extract content from top sources in parallel
4. **Analysis**: Analyze findings, identify knowledge gaps and plan next steps
5. **Iteration**: Continue research with refined focus based on identified gaps
6. **Synthesis**: Generate comprehensive analysis with citations```mermaid
flowchart LR
A[Initialization] --> B[Search]
B --> C[Extraction]
C --> D[Analysis]
D --> E{Continue?}
E -->|Yes| F[Refine Focus]
F --> B
E -->|No| G[Final Synthesis]style A fill:#4285F4,stroke:#333,stroke-width:2px,color:white
style B fill:#34A853,stroke:#333,stroke-width:2px,color:white
style C fill:#FBBC05,stroke:#333,stroke-width:2px,color:white
style D fill:#EA4335,stroke:#333,stroke-width:2px,color:white
style E fill:#673AB7,stroke:#333,stroke-width:2px,color:white
style F fill:#FF9800,stroke:#333,stroke-width:2px,color:white
style G fill:#2196F3,stroke:#333,stroke-width:2px,color:white
```## 📊 Usage Examples
### Web Client Options
DeepResearch supports multiple web client implementations:
```python
from deep_research.utils import DoclingClient, DoclingServerClient, FirecrawlClient
from deep_research.utils.cache import CacheConfig# 1. Standard Docling Client (local HTML parsing)
docling_client = DoclingClient(
brave_api_key="your-brave-key", # Optional
max_concurrent_requests=8,
cache_config=CacheConfig(enabled=True)
)# 2. Docling Server Client (connects to remote docling-serve instance)
docling_server_client = DoclingServerClient(
server_url="http://localhost:8000", # URL of your docling-serve instance
brave_api_key="your-brave-key", # Optional
max_concurrent_requests=8,
cache_config=CacheConfig(enabled=True)
)# 3. Firecrawl Client (connects to Firecrawl API)
firecrawl_client = FirecrawlClient(
api_key="your-firecrawl-api-key", # Required
api_url="https://api.firecrawl.dev", # Default Firecrawl API URL
max_concurrent_requests=8,
cache_config=CacheConfig(enabled=True)
)# All clients implement the BaseWebClient interface
# and can be used interchangeably with DeepResearch
```### Search Results with Metadata
The DeepResearch SDK returns search results with rich metadata, including:
```python
from deep_research.utils import DoclingClientclient = DoclingClient()
# Get search results
search_results = await client.search("artificial intelligence")# Access metadata in search results
for result in search_results.data:
print(f"Title: {result.title}")
print(f"URL: {result.url}")
print(f"Description: {result.description}")
print(f"Provider: {result.provider}") # Which search engine provided this result
print(f"Date: {result.date}") # Publication date when available
print(f"Relevance: {result.relevance}")
```### Using the Cache System
```python
from deep_research.utils.cache import CacheConfig
from deep_research.utils import DoclingClient, DoclingServerClient, FirecrawlClient# Configure the cache with SQLite (default)
cache_config = CacheConfig(
enabled=True, # Enable caching
ttl_seconds=3600, # Cache for 1 hour
db_url="sqlite:///docling_cache.db", # Use SQLite
create_tables=True # Create tables if they don't exist
)# Initialize client with caching (works with any client type)
client = DoclingClient(
cache_config=cache_config
)# Search results will be cached and reused for identical queries
search_result = await client.search("quantum computing")
# Second call uses cached data
search_result = await client.search("quantum computing")# To disable caching completely, either:
# 1. Don't provide a cache_config:
client_no_cache = DoclingClient() # No caching# For MySQL/MariaDB backend instead of SQLite
# First install pymysql: pip install pymysql
mysql_config = CacheConfig(
enabled=True,
db_url="mysql+pymysql://username:password@localhost/docling_cache"
)
```#### Advanced: Using Structured Models for Cache Keys
```python
from pydantic import BaseModel
from deep_research.utils.cache import cache# Define which parameters should be used for caching
class SearchParams(BaseModel):
query: str
max_results: int = 10@cache(structure=SearchParams)
async def search_function(self, query: str, max_results: int = 10):
# Only query and max_results will be used for the cache key
# Other parameters will be ignored for caching purposes
return results
```### Custom Research Parameters
```python
# Configure research with custom parameters
result = await researcher.research(
topic="Emerging trends in renewable energy storage",
max_tokens=3000, # Control output length
temperature=0.7 # Add more creativity to analysis
)
```### Using with Different LLM Providers
DeepResearch uses LiteLLM, which supports multiple LLM providers:
```python
# Use Anthropic Claude models
researcher = DeepResearch(
# ... other parameters
research_model="anthropic/claude-3-opus-20240229",
reasoning_model="anthropic/claude-3-haiku-20240307",
)
```### Accessing Research Sources
DeepResearch tracks all sources used in the research process and returns them in the result:
```python
# Run the research
result = await researcher.research("Advances in quantum computing")# Access the sources used in research
if result.success:
print(f"Research used {len(result.data['sources'])} sources:")
for i, source in enumerate(result.data['sources']):
print(f"Source {i+1}: {source['title']}")
print(f" URL: {source['url']}")
print(f" Relevance: {source['relevance']}")
if source['description']:
print(f" Description: {source['description']}")
print()
# The sources data includes structured information about all references
# used during the research process, complete with metadata
```### Scheduling Research Tasks
```python
# Run multiple research tasks concurrently
async def research_multiple_topics():
topics = ["AI safety", "Climate adaptation", "Future of work"]
tasks = [researcher.research(topic, max_depth=2) for topic in topics]
results = await asyncio.gather(*tasks)for topic, result in zip(topics, results):
print(f"Research on {topic}: {'Success' if result.success else 'Failed'}")
```## 🔄 Custom Callbacks
Monitor and track research progress by implementing custom callbacks:
```python
from deep_research.core.callbacks import ResearchCallback
from deep_research.models import ActivityItem, SourceItemclass MyCallback(ResearchCallback):
async def on_activity(self, activity: ActivityItem) -> None:
# Handle activity updates (search, extract, analyze)
print(f"Activity: {activity.type} - {activity.message}")async def on_source(self, source: SourceItem) -> None:
# Handle discovered sources
print(f"Source: {source.title} ({source.url})")async def on_depth_change(self, current, maximum, completed_steps, total_steps) -> None:
# Track research depth progress
progress = int(completed_steps / total_steps * 100) if total_steps > 0 else 0
print(f"Depth: {current}/{maximum} - Progress: {progress}%")async def on_progress_init(self, max_depth: int, total_steps: int) -> None:
# Handle research initialization
print(f"Initialized with max depth {max_depth} and {total_steps} steps")async def on_finish(self, content: str) -> None:
# Handle research completion
print(f"Research complete! Result length: {len(content)} characters")
```## 🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request## 📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
---