https://github.com/rishang/deep-research

Python SDK for Deep-Research
https://github.com/rishang/deep-research
ai deep-research deepresearch docling litellm
Last synced: 4 months ago
JSON representation
Python SDK for Deep-Research
Host: GitHub
URL: https://github.com/rishang/deep-research
Owner: Rishang
License: apache-2.0
Created: 2025-04-03T08:16:41.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-04-15T18:18:16.000Z (6 months ago)
Last Synced: 2025-05-17T23:13:47.950Z (5 months ago)
Topics: ai, deep-research, deepresearch, docling, litellm
Language: Python
Homepage:
Size: 196 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # 🧠 Deep Research



![Python Version](https://img.shields.io/badge/python-3.11%2B-blue)

**An AI-powered research assistant that performs comprehensive, autonomous research on any topic**

[Features](#-key-features) •

[Installation](#-installation) •

[Quick Start](#-quick-start) •

[Examples](#-usage-examples) •

[Contribute](#-contributing)



## 🌟 Overview

DeepResearch SDK empowers developers with AI-driven research capabilities, enabling applications to conduct deep, iterative research autonomously. Inspired by products like Perplexity AI and Claude's web browsing, DeepResearch combines **web search, content extraction, and AI analysis** into a unified, easy-to-use API.

Originally a TypeScript implementation, this Python SDK adds parallel web scraping, multiple search providers, and a more developer-friendly interface.

```mermaid

graph TD

    subgraph "DeepResearch SDK"

        A[Deep Research] --> W[Web Clients]

        A --> C[LLM Integration]

        A --> D[Research Callbacks]

        W --> B1[DoclingClient]

        W --> B2[DoclingServerClient]

        W --> B3[FirecrawlClient]

        B1 --> E[Brave Search]

        B1 --> F[DuckDuckGo Search]

        B1 --> G[Document Extraction]

        B2 --> E

        B2 --> F

        B2 --> G2[Server-based Extraction]

        B3 --> E

        B3 --> F

        B3 --> G3[Firecrawl Extraction]

        C --> H[GPT Models]

        C --> I[Claude Models]

        C --> J[Other LLMs]

        D --> K[Progress Tracking]

        D --> L[Source Monitoring]

        D --> M[Activity Logging]

    end

    classDef primary fill:#4285F4,stroke:#333,stroke-width:2px,color:white

    classDef secondary fill:#34A853,stroke:#333,stroke-width:2px,color:white

    classDef tertiary fill:#FBBC05,stroke:#333,stroke-width:2px,color:white

    classDef quaternary fill:#EA4335,stroke:#333,stroke-width:2px,color:white

    class A primary

    class W,C,D secondary

    class B1,B2,B3 secondary

    class E,F,G,G2,G3,H,I,J tertiary

    class K,L,M quaternary

```

## 📋 Table of Contents

- [🧠 Deep Research](#-deep-research)

  - [🌟 Overview](#-overview)

  - [📋 Table of Contents](#-table-of-contents)

  - [🚀 Key Features](#-key-features)

  - [📦 Installation](#-installation)

    - [Using pip](#using-pip)

    - [Using Poetry](#using-poetry)

    - [From Source](#from-source)

  - [🏁 Quick Start](#-quick-start)

  - [🔍 Quick Demo](#-quick-demo)

  - [🔬 How It Works](#-how-it-works)

  - [📊 Usage Examples](#-usage-examples)

    - [Search Results with Metadata](#search-results-with-metadata)

    - [Using the Cache System](#using-the-cache-system)

      - [Advanced: Using Structured Models for Cache Keys](#advanced-using-structured-models-for-cache-keys)

    - [Custom Research Parameters](#custom-research-parameters)

    - [Using with Different LLM Providers](#using-with-different-llm-providers)

    - [Accessing Research Sources](#accessing-research-sources)

    - [Scheduling Research Tasks](#scheduling-research-tasks)

  - [🔄 Custom Callbacks](#-custom-callbacks)

  - [🤝 Contributing](#-contributing)

  - [📄 License](#-license)

## 🚀 Key Features

- **📊 Multiple Search Providers**

  - [Brave Search API](https://brave.com/search/api/) for high-quality results

  - [DuckDuckGo Search](https://duckduckgo.com/) as automatic API-key-free fallback

  - Fault-tolerant fallback system ensures searches always return results

  - Comprehensive source tracking with detailed metadata

- **⚡ Parallel Processing**

  - Extract content from multiple sources simultaneously

  - Control concurrency to balance speed and resource usage

- **🔄 Adaptive Research**

  - Automatic gap identification in research

  - Self-guided exploration of topics

  - Depth-first approach with backtracking

- **🧩 Modular Architecture**

  - Multiple web client options (Docling, Docling-Server, Firecrawl)

  - Easily extensible for custom search providers

  - Plug in different LLM backends through LiteLLM

  - Event-driven callback system for monitoring progress

- **🛠️ Developer-Friendly**

  - Async/await interface for integration with modern Python applications

  - Type hints throughout for IDE autocompletion

  - Comprehensive Pydantic models for structured data

  - Rich metadata for search results (provider, publication date, relevance)

  - Optional caching system for search and extraction results

## 📦 Installation

### Using pip

```bash

# Install from PyPI

pip install deep-research-sdk

```

### Using Poetry

```bash

# Add to your Poetry project

poetry add deep-research-sdk

```

### From Source

```bash

# Clone the repository

git clone https://github.com/Rishang/deep-research-sdk.git

cd deep-research-sdk

poetry install

```

## 🏁 Quick Start

```python

import asyncio

import os

import logging

from deep_research import DeepResearch

from deep_research.utils import DoclingClient, DoclingServerClient, FirecrawlClient

from deep_research.utils.cache import CacheConfig

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

async def main():

    # Get API keys from environment variables

    openai_api_key = os.environ.get("OPENAI_API_KEY")

    brave_api_key = os.environ.get("BRAVE_SEARCH_API_KEY")  # Optional

    firecrawl_api_key = os.environ.get("FIRECRAWL_API_KEY") # Optional

    if not openai_api_key:

        print("Error: OPENAI_API_KEY environment variable not set")

        return

    # Initialize the DeepResearch instance

    # Optional: Configure caching for search and extraction results

    # If you don't want to use caching, you can simply set `cache_config=None`

    cache_config = CacheConfig(

        enabled=True,                  # Enable caching

        ttl_seconds=3600,              # Cache entries expire after 1 hour

        db_url="sqlite:///docling_cache.sqlite3"  # SQLite database for cache

    )

    # OPTION 1: Use the standard Docling client

    researcher = DeepResearch(

        web_client=DoclingClient(

            cache_config=cache_config,

            brave_api_key=brave_api_key, # Optional: Brave Search API key if None it will use DuckDuckGo with no API key

        ),

        llm_api_key=openai_api_key,

        research_model="gpt-4o-mini",    # Advanced research model

        reasoning_model="o3-mini",       # More efficient model for reasoning

        max_depth=3,                     # Maximum research depth

        time_limit_minutes=2             # Time limit in minutes

    )

    # OPTION 2: Use the Docling Server client

    # async with DoclingServerClient(

    #     server_url="http://localhost:8000",  # URL of your docling-serve instance

    #     brave_api_key=brave_api_key,

    #     cache_config=cache_config

    # ) as docling_server:

    #     researcher = DeepResearch(

    #         web_client=docling_server,

    #         llm_api_key=openai_api_key,

    #         research_model="gpt-4o-mini",

    #         reasoning_model="o3-mini",

    #         max_depth=3,

    #         time_limit_minutes=2

    #     )

    #     # ... rest of your code

    # OPTION 3: Use the Firecrawl client

    # if firecrawl_api_key:

    #     async with FirecrawlClient(

    #         api_key=firecrawl_api_key,

    #         cache_config=cache_config

    #     ) as firecrawl:

    #         researcher = DeepResearch(

    #             web_client=firecrawl,

    #             llm_api_key=openai_api_key,

    #             research_model="gpt-4o-mini",

    #             reasoning_model="o3-mini",

    #             max_depth=3,

    #             time_limit_minutes=2

    #         )

    #         # ... rest of your code

    # Perform research

    result = await researcher.research("The impact of quantum computing on cryptography")

    # Process results

    if result.success:

        print("\n==== RESEARCH SUCCESSFUL ====")

        print(f"Found {len(result.data['findings'])} pieces of information")

        print(f"Used {len(result.data['sources'])} sources")

        

        # Access the sources used in research

        for i, source in enumerate(result.data['sources']):

            print(f"Source {i+1}: {source['title']} - {source['url']}")

            

        print(f"Analysis:\n{result.data['analysis']}")

    else:

        print(f"Research failed: {result.error}")

if __name__ == "__main__":

    asyncio.run(main())

```

## 🔍 Quick Demo

Want to see the DeepResearch SDK in action quickly? Follow these steps:

1. **Set up your environment variables**:

```bash

# [Optional] Brave Search API key if not provided it will use DuckDuckGo with no API key required

export BRAVE_SEARCH_API_KEY="your-brave-api-key"

# Set your LLM API keys in your environment

export OPENAI_API_KEY="your-openai-api-key"

```

1. **Clone the repository**:

```bash

git clone https://github.com/Rishang/deep-research-sdk.git

cd deep-research-sdk

```

2. **Run the included demo script**:

```bash

python example.py

```

You'll see research progress in real-time, and within about a minute, get an AI-generated analysis of the benefits of regular exercise.

## 🔬 How It Works

DeepResearch implements an iterative research process:

1. **Initialization**: Configure models, search providers and parameters

2. **Search**: Find relevant sources on the topic from multiple providers

3. **Extraction**: Process and extract content from top sources in parallel

4. **Analysis**: Analyze findings, identify knowledge gaps and plan next steps

5. **Iteration**: Continue research with refined focus based on identified gaps

6. **Synthesis**: Generate comprehensive analysis with citations

```mermaid

flowchart LR

    A[Initialization] --> B[Search]

    B --> C[Extraction]

    C --> D[Analysis]

    D --> E{Continue?}

    E -->|Yes| F[Refine Focus]

    F --> B

    E -->|No| G[Final Synthesis]

    style A fill:#4285F4,stroke:#333,stroke-width:2px,color:white

    style B fill:#34A853,stroke:#333,stroke-width:2px,color:white

    style C fill:#FBBC05,stroke:#333,stroke-width:2px,color:white

    style D fill:#EA4335,stroke:#333,stroke-width:2px,color:white

    style E fill:#673AB7,stroke:#333,stroke-width:2px,color:white

    style F fill:#FF9800,stroke:#333,stroke-width:2px,color:white

    style G fill:#2196F3,stroke:#333,stroke-width:2px,color:white

```

## 📊 Usage Examples

### Web Client Options

DeepResearch supports multiple web client implementations:

```python

from deep_research.utils import DoclingClient, DoclingServerClient, FirecrawlClient

from deep_research.utils.cache import CacheConfig

# 1. Standard Docling Client (local HTML parsing)

docling_client = DoclingClient(

    brave_api_key="your-brave-key",  # Optional

    max_concurrent_requests=8,

    cache_config=CacheConfig(enabled=True)

)

# 2. Docling Server Client (connects to remote docling-serve instance)

docling_server_client = DoclingServerClient(

    server_url="http://localhost:8000",  # URL of your docling-serve instance

    brave_api_key="your-brave-key",      # Optional

    max_concurrent_requests=8,

    cache_config=CacheConfig(enabled=True)

)

# 3. Firecrawl Client (connects to Firecrawl API)

firecrawl_client = FirecrawlClient(

    api_key="your-firecrawl-api-key",    # Required

    api_url="https://api.firecrawl.dev", # Default Firecrawl API URL

    max_concurrent_requests=8,

    cache_config=CacheConfig(enabled=True)

)

# All clients implement the BaseWebClient interface

# and can be used interchangeably with DeepResearch

```

### Search Results with Metadata

The DeepResearch SDK returns search results with rich metadata, including:

```python

from deep_research.utils import DoclingClient

client = DoclingClient()

# Get search results

search_results = await client.search("artificial intelligence")

# Access metadata in search results

for result in search_results.data:

    print(f"Title: {result.title}")

    print(f"URL: {result.url}")

    print(f"Description: {result.description}")

    print(f"Provider: {result.provider}")  # Which search engine provided this result

    print(f"Date: {result.date}")          # Publication date when available

    print(f"Relevance: {result.relevance}")

```

### Using the Cache System

```python

from deep_research.utils.cache import CacheConfig

from deep_research.utils import DoclingClient, DoclingServerClient, FirecrawlClient

# Configure the cache with SQLite (default)

cache_config = CacheConfig(

    enabled=True,                           # Enable caching

    ttl_seconds=3600,                       # Cache for 1 hour

    db_url="sqlite:///docling_cache.db",    # Use SQLite

    create_tables=True                      # Create tables if they don't exist

)

# Initialize client with caching (works with any client type)

client = DoclingClient(

    cache_config=cache_config

)

# Search results will be cached and reused for identical queries

search_result = await client.search("quantum computing")

# Second call uses cached data

search_result = await client.search("quantum computing")

# To disable caching completely, either:

# 1. Don't provide a cache_config:

client_no_cache = DoclingClient()  # No caching

# For MySQL/MariaDB backend instead of SQLite

# First install pymysql: pip install pymysql

mysql_config = CacheConfig(

    enabled=True,

    db_url="mysql+pymysql://username:password@localhost/docling_cache"

)

```

#### Advanced: Using Structured Models for Cache Keys

```python

from pydantic import BaseModel

from deep_research.utils.cache import cache

# Define which parameters should be used for caching

class SearchParams(BaseModel):

    query: str

    max_results: int = 10

@cache(structure=SearchParams)

async def search_function(self, query: str, max_results: int = 10):

    # Only query and max_results will be used for the cache key

    # Other parameters will be ignored for caching purposes

    return results

```

### Custom Research Parameters

```python

# Configure research with custom parameters

result = await researcher.research(

    topic="Emerging trends in renewable energy storage",

    max_tokens=3000,           # Control output length

    temperature=0.7            # Add more creativity to analysis

)

```

### Using with Different LLM Providers

DeepResearch uses LiteLLM, which supports multiple LLM providers:

```python

# Use Anthropic Claude models

researcher = DeepResearch(

    # ... other parameters

    research_model="anthropic/claude-3-opus-20240229",

    reasoning_model="anthropic/claude-3-haiku-20240307",

)

```

### Accessing Research Sources

DeepResearch tracks all sources used in the research process and returns them in the result:

```python

# Run the research

result = await researcher.research("Advances in quantum computing")

# Access the sources used in research

if result.success:

    print(f"Research used {len(result.data['sources'])} sources:")

    

    for i, source in enumerate(result.data['sources']):

        print(f"Source {i+1}: {source['title']}")

        print(f"  URL: {source['url']}")

        print(f"  Relevance: {source['relevance']}")

        if source['description']:

            print(f"  Description: {source['description']}")

        print()

        

    # The sources data includes structured information about all references

    # used during the research process, complete with metadata

```

### Scheduling Research Tasks

```python

# Run multiple research tasks concurrently

async def research_multiple_topics():

    topics = ["AI safety", "Climate adaptation", "Future of work"]

    tasks = [researcher.research(topic, max_depth=2) for topic in topics]

    results = await asyncio.gather(*tasks)

    for topic, result in zip(topics, results):

        print(f"Research on {topic}: {'Success' if result.success else 'Failed'}")

```

## 🔄 Custom Callbacks

Monitor and track research progress by implementing custom callbacks:

```python

from deep_research.core.callbacks import ResearchCallback

from deep_research.models import ActivityItem, SourceItem

class MyCallback(ResearchCallback):

    async def on_activity(self, activity: ActivityItem) -> None:

        # Handle activity updates (search, extract, analyze)

        print(f"Activity: {activity.type} - {activity.message}")

    async def on_source(self, source: SourceItem) -> None:

        # Handle discovered sources

        print(f"Source: {source.title} ({source.url})")

    async def on_depth_change(self, current, maximum, completed_steps, total_steps) -> None:

        # Track research depth progress

        progress = int(completed_steps / total_steps * 100) if total_steps > 0 else 0

        print(f"Depth: {current}/{maximum} - Progress: {progress}%")

    async def on_progress_init(self, max_depth: int, total_steps: int) -> None:

        # Handle research initialization

        print(f"Initialized with max depth {max_depth} and {total_steps} steps")

    async def on_finish(self, content: str) -> None:

        # Handle research completion

        print(f"Research complete! Result length: {len(content)} characters")

```

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository

2. Create your feature branch (`git checkout -b feature/amazing-feature`)

3. Commit your changes (`git commit -m 'Add some amazing feature'`)

4. Push to the branch (`git push origin feature/amazing-feature`)

5. Open a Pull Request

## 📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

---



  If you find DeepResearch SDK useful, please consider giving it a star on GitHub!
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rishang/deep-research

Awesome Lists containing this project

README