https://github.com/aneeshpatne/curiosity

Curiosity: Search Agent – Multi-agent system using LLMs (GPT, Gemini) with DuckDuckGo, Playwright, and LangChain for web search, scraping, and detailed summaries with follow-ups.
https://github.com/aneeshpatne/curiosity
ai fastapi nextjs webscraping
Last synced: about 2 months ago
JSON representation
Curiosity: Search Agent – Multi-agent system using LLMs (GPT, Gemini) with DuckDuckGo, Playwright, and LangChain for web search, scraping, and detailed summaries with follow-ups.
Host: GitHub
URL: https://github.com/aneeshpatne/curiosity
Owner: aneeshpatne
Created: 2025-02-07T15:54:24.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-02T07:23:14.000Z (over 1 year ago)
Last Synced: 2025-03-02T07:26:02.188Z (over 1 year ago)
Topics: ai, fastapi, nextjs, webscraping
Language: Python
Homepage:
Size: 3.56 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          


# 🔍 Curiosity

### AI-Powered Search & News Intelligence Platform

![Curiosity Bot](Frontend/curiosity/public/assets/bot.png)

[![Next.js](https://img.shields.io/badge/Next.js-15.1.7-black?style=flat&logo=next.js)](https://nextjs.org/)

[![React](https://img.shields.io/badge/React-19.0-61DAFB?style=flat&logo=react)](https://react.dev/)

[![Python](https://img.shields.io/badge/Python-3.9+-3776AB?style=flat&logo=python)](https://www.python.org/)

[![FastAPI](https://img.shields.io/badge/FastAPI-Latest-009688?style=flat&logo=fastapi)](https://fastapi.tiangolo.com/)

[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

**An intelligent search agent that combines real-time web scraping, LLM-powered analysis, and automated news digests to deliver comprehensive, cited answers to your queries.**

[Features](#-features) • [Installation](#-installation) • [Usage](#-usage) • [How It Works](#-how-it-works)



---

## 📋 Table of Contents

- [Overview](#-overview)

- [Features](#-features)

- [Technology Stack](#-technology-stack)

- [Installation](#-installation)

- [Usage](#-usage)

- [Project Structure](#-project-structure)

- [How It Works](#-how-it-works)

- [Components](#-components)

- [Configuration](#-configuration)

- [Contributing](#-contributing)

- [License](#-license)

---

## 🌟 Overview

**Curiosity** is a cutting-edge AI-powered search platform that revolutionizes how you gather and process information. Unlike traditional search engines that provide links, Curiosity scrapes, analyzes, and synthesizes content from multiple sources to deliver comprehensive, citation-backed answers in real-time.

The platform features two main components:

1. **🔎 Curiosity Search** - An interactive chat interface with multiple search modes

2. **📰 Curiosity Newsletter** - An automated daily news digest delivered to your inbox

---

## ✨ Features

### 🔍 Curiosity Search

#### Multiple Search Modes

- **Normal Search** - Quick searches analyzing 7 sources with standard depth

- **Pro Search** - Enhanced search examining 25 sources for comprehensive results

- **Deep Search** - Recursive multi-level search that:

  - Explores follow-up questions automatically

  - Synthesizes information from 100+ sources

  - Provides in-depth analysis from multiple perspectives

#### Intelligent Features

- **🔄 Real-time Updates** - Live status indicators showing search, scraping, and analysis progress

- **📚 Source Citations** - Every claim is backed by numbered citations linking to original sources

- **🎯 Smart Follow-ups** - AI-generated follow-up questions to explore topics deeper

- **💬 Conversational Memory** - Maintains context across multiple queries

- **⚡ Live Source Display** - See sources as they're discovered with favicon previews

- **📱 Responsive UI** - Modern, dark-mode interface built with shadcn/ui

### 📰 Curiosity Newsletter

#### Automated News Intelligence

- **🌍 Global News Coverage** - Automatically fetches top stories from multiple sources

- **🤖 AI Summarization** - Condenses 20+ articles into structured, readable summaries

- **📧 Email Delivery** - Beautiful HTML-formatted newsletters sent daily

- **🔄 Deep Analysis** - Uses recursive search to provide context and depth

- **⏰ Scheduled Execution** - Automated via cron jobs for daily delivery

- **🎨 Rich Formatting** - Professionally styled email templates with responsive design

---

## 🛠 Technology Stack

### Frontend

| Technology                                       | Version | Purpose                               |

| ------------------------------------------------ | ------- | ------------------------------------- |

| [Next.js](https://nextjs.org/)                   | 15.1.7  | React framework with App Router       |

| [React](https://react.dev/)                      | 19.0    | UI library                            |

| [Socket.io Client](https://socket.io/)           | 4.8.1   | Real-time bidirectional communication |

| [Tailwind CSS](https://tailwindcss.com/)         | 3.4.1   | Utility-first CSS framework           |

| [shadcn/ui](https://ui.shadcn.com/)              | Latest  | High-quality UI components            |

| [Marked](https://marked.js.org/)                 | 15.0.7  | Markdown parser and renderer          |

| [DOMPurify](https://github.com/cure53/DOMPurify) | 3.2.4   | XSS sanitizer for HTML                |

| [Lucide React](https://lucide.dev/)              | 0.475.0 | Icon library                          |

### Backend

| Technology                                                       | Purpose                     |

| ---------------------------------------------------------------- | --------------------------- |

| [Python](https://www.python.org/)                                | Core backend language       |

| [FastAPI](https://fastapi.tiangolo.com/)                         | Modern async web framework  |

| [Socket.io](https://socket.io/)                                  | Real-time server            |

| [Playwright](https://playwright.dev/)                            | Headless browser automation |

| [DuckDuckGo Search](https://github.com/deedy5/duckduckgo_search) | Privacy-focused search API  |

| [LangChain](https://python.langchain.com/)                       | LLM orchestration framework |

| [Pydantic](https://docs.pydantic.dev/)                           | Data validation             |

### AI Models

- **OpenAI GPT-4o-mini** - Fast summarization and agent reasoning

- **OpenAI o1-mini** - Deep reasoning for complex queries

- **Google Gemini 2.0 Flash** - High-speed content analysis

- **Meta LLaMA 3.3** (via OpenRouter) - Alternative model support

---

## 🚀 Installation

### Prerequisites

- **Node.js** 18+ and npm/yarn

- **Python** 3.9+

- **OpenAI API Key**

- **Google Gemini API Key** (optional)

- **OpenRouter API Key** (optional)

### Step 1: Clone the Repository

```bash

git clone https://github.com/yourusername/curiosity.git

cd curiosity

```

### Step 2: Backend Setup

#### Install Python Dependencies

```bash

# Install required packages

pip install fastapi uvicorn socketio python-socketio playwright pydantic

pip install duckduckgo-search langchain langchain-openai langchain-google-genai

pip install python-dotenv markdown

# Install Playwright browsers

playwright install chromium

```

#### Configure Environment Variables

Create a `.env` file in the root directory:

```env

# Required

OPENAI_API_KEY=your_openai_api_key_here

# Optional (for alternative models)

GEMINI_API_KEY=your_gemini_api_key_here

OPEN_ROUTER_KEY=your_openrouter_key_here

# For Newsletter (Optional)

SMTP_SERVER=smtp.gmail.com

SMTP_PORT=587

EMAIL_SENDER=your_email@gmail.com

EMAIL_PASSWORD=your_app_password

EMAIL_RECEIVER=recipient@email.com

```

#### Start the Backend Server

```bash

# From the Search directory

cd Search

python search-agent.py

# Server will start on http://localhost:4000

```

### Step 3: Frontend Setup

```bash

cd Frontend/curiosity

# Install dependencies

npm install

# Start development server

npm run dev

# Frontend will start on http://localhost:3000

```

### Step 4: Newsletter Setup (Optional)

```bash

cd News

# Make the shell script executable

chmod +x run_news_agent.sh

# Run manually

python news-agent.py

# Or set up a cron job for daily execution

crontab -e

# Add: 0 8 * * * /path/to/Curiosity/News/run_news_agent.sh

```

---

## 📖 Usage

### Starting the Application

1. **Start Backend**:

```bash

cd Search

python search-agent.py

```

2. **Start Frontend**:

```bash

cd Frontend/curiosity

npm run dev

```

3. **Access the Application**:

   - Open your browser to `http://localhost:3000`

### Using Different Search Modes

#### Normal Search

```

1. Select "Normal Search" from dropdown

2. Enter your query: "What is quantum computing?"

3. Get results from ~7 sources with citations

```

#### Pro Search

```

1. Select "Pro Search" from dropdown

2. Enter your query: "Latest developments in AI research"

3. Get comprehensive results from ~25 sources

```

#### Deep Search

```

1. Select "Deep Search" from dropdown

2. Enter complex query: "Impact of climate change on global economy"

3. System will:

   - Search initial query

   - Generate 20 follow-up questions

   - Recursively search each follow-up

   - Synthesize 100+ sources into comprehensive answer

```

### Newsletter Usage

```bash

# Manual execution

python News/news-agent.py

# Automated daily execution (8 AM)

# Add to crontab:

0 8 * * * /path/to/Curiosity/News/run_news_agent.sh

```

---

## 📂 Project Structure

```

Curiosity/

├── Frontend/

│   └── curiosity/

│       ├── src/

│       │   ├── app/

│       │   │   ├── layout.js          # Root layout

│       │   │   ├── page.js            # Home page

│       │   │   └── globals.css        # Global styles

│       │   ├── components/

│       │   │   ├── chat.jsx           # Main chat interface

│       │   │   └── ui/                # shadcn/ui components

│       │   │       ├── button.jsx

│       │   │       ├── input.jsx

│       │   │       └── select.jsx

│       │   └── lib/

│       │       └── utils.js           # Utility functions

│       ├── public/

│       │   └── assets/                # Static assets

│       ├── package.json

│       ├── next.config.mjs

│       ├── tailwind.config.mjs

│       └── components.json

│

├── Search/

│   ├── search-agent.py                # Main search agent with FastAPI server

│   ├── deep-search.py                 # Standalone deep search implementation

│   ├── combined_sources.txt           # Debug output (generated)

│   └── Deprecated/                    # Legacy implementations

│       ├── search.py

│       ├── search-new.py

│       ├── search_local.py

│       └── test.py

│

├── News/

│   ├── news-agent.py                  # Automated news summarization

│   ├── run_news_agent.sh              # Shell script for cron execution

│   ├── deepSearch.py                  # News-specific deep search (deprecated)

│   ├── example.py

│   ├── simple.py

│   └── test.py

│

├── README.md

└── .env                               # Environment variables (create this)

```

---

## 🔬 How It Works

### Search Flow

```mermaid

sequenceDiagram

    participant User

    participant Frontend

    participant Backend

    participant Scraper

    participant LLM

    User->>Frontend: Enter query

    Frontend->>Backend: Send via WebSocket

    Backend->>Backend: Emit "waiting" status

    Backend->>DuckDuckGo: Search query

    DuckDuckGo-->>Backend: Return URLs

    Backend->>Frontend: Emit sources

    Backend->>Backend: Emit "scraping" status

    par Parallel Scraping

        Backend->>Scraper: Scrape URL 1

        Backend->>Scraper: Scrape URL 2

        Backend->>Scraper: Scrape URL N

    end

    Scraper-->>Backend: Return content

    Backend->>Backend: Emit "thinking" status

    Backend->>LLM: Summarize with citations

    LLM-->>Backend: Return summary + follow-ups

    Backend->>Frontend: Emit final response

    Frontend->>User: Display with citations

```

### Deep Search Flow

```mermaid

graph TD

    A[User Query] --> B[Initial Search]

    B --> C[Scrape 5 URLs]

    C --> D[Summarize]

    D --> E[Generate 20 Follow-ups]

    E --> F1[Follow-up 1]

    E --> F2[Follow-up 2]

    E --> F20[Follow-up 20]

    F1 --> G1[Scrape 5 URLs]

    F2 --> G2[Scrape 5 URLs]

    F20 --> G20[Scrape 5 URLs]

    G1 --> H1[Summarize]

    G2 --> H2[Summarize]

    G20 --> H20[Summarize]

    H1 --> I[Combine All Summaries]

    H2 --> I

    H20 --> I

    I --> J[Final LLM Synthesis]

    J --> K[Comprehensive Answer]

```

### Component Details

#### 1. Web Scraping

```python

# Concurrent scraping with semaphore control

async def scrape_page(context, url: str) -> str:

    async with semaphore:  # Limit to 7 concurrent requests

        page = await context.new_page()

        # Block images, stylesheets, fonts for speed

        await page.route("**/*", block_requests)

        await page.goto(url, wait_until='domcontentloaded')

        # Extract text content from semantic elements

        text_blocks = await page.locator("body p, h1, h2, h3").all_text_contents()

        return cleaned_text[:5000]  # First 5KB of content

```

#### 2. LLM Summarization

```python

# Structured output with citations and follow-ups

class SummaryFormat(BaseModel):

    content: str  # Markdown summary with [1] [2] citations

    moreQtn: list[str]  # 5-20 follow-up questions

# Chain: Prompt → LLM → Parser → Retry on Error

chain = prompt | llm | StrOutputParser()

retry_parser = RetryWithErrorOutputParser(parser=parser, max_retries=3)

```

#### 3. Real-time Communication

```javascript

// Frontend emits query

socket.emit("message", { id, text: query, searchType });

// Backend emits updates

await sio.emit("status", { id, status: "searching" });

await sio.emit("sources", { id, sources: urls });

await sio.emit("message", { id, text: summary, status: "finished" });

```

---

## 🧩 Components

### Backend Components

#### `search-agent.py`

The main FastAPI server that orchestrates the entire search process:

- **FastAPI Server** - Handles HTTP and WebSocket connections

- **Socket.io Integration** - Real-time bidirectional communication

- **Search Orchestration** - Manages search, scrape, summarize pipeline

- **LLM Chain Management** - Coordinates multiple LLM calls with retry logic

- **Memory Management** - Maintains conversation context

- **Deep Search Engine** - Recursive multi-level search implementation

Key Functions:

- `follow_up()` - Main query handler with search type routing

- `deep_search()` - Recursive search with depth control

- `scrape_contents()` - Parallel web scraping

- `summarize()` - LLM-powered summarization with citations

- `generate_final_summary()` - Deep search synthesis

#### `deep-search.py`

Standalone implementation of deep search for testing and development:

- Source tracking with global citation counter

- Recursive question exploration

- Citation preservation across levels

- Final synthesis from all sources

#### `news-agent.py`

Automated news aggregation and email delivery:

- Global news search

- Recursive deep search for context

- HTML email generation with styling

- SMTP email delivery

- Browser preview for testing

### Frontend Components

#### `chat.jsx`

Main chat interface with real-time updates:

- **Message Management** - State handling for sent/received messages

- **Socket.io Integration** - Event listeners for status, sources, messages

- **Search Type Selection** - Dropdown for Normal/Pro/Deep modes

- **Real-time Status** - Loading indicators and progress updates

- **Source Display** - Live URL cards with favicons

- **Markdown Rendering** - Safe HTML rendering with DOMPurify

- **Citation Linking** - Interactive superscript citations

- **Follow-up Questions** - Clickable suggestions

Components:

- `Chat` - Main container component

- `SentMessage` - User query display

- `ReceivedMessage` - AI response with sources and citations

- `MarkdownRenderer` - Safe markdown to HTML conversion

- `Citation` - Interactive citation superscripts

- `Sources` - URL preview cards

- `FollowUp` - Follow-up question suggestions

---

## ⚙ Configuration

### LLM Model Selection

Edit the model configuration in `search-agent.py`:

```python

# For faster, cheaper responses

agent_llm = ChatOpenAI(model='gpt-5-mini', api_key=SecretStr(api_key))

summary_llm = ChatOpenAI(model='gpt-5-mini', api_key=SecretStr(api_key))

# For higher quality, deeper reasoning

deep_search_llm = ChatOpenAI(model='gpt-5', api_key=SecretStr(api_key))

# For alternative providers

summary_llm = ChatOpenAI(

    base_url='https://openrouter.ai/api/v1',

    model='meta-llama/llama-3.3-70b-instruct:nitro',

    api_key=SecretStr(openRouterKey)

)

```

### Search Parameters

Customize search depth and source count:

```python

# Number of concurrent scraping tasks

semaphore = asyncio.Semaphore(7)  # Adjust based on system resources

# Search result counts

normal_search_results = 7

pro_search_results = 25

deep_search_results = 5  # Per query level

# Deep search recursion depth

deep_search_depth = 2  # Levels of follow-up questions

# Number of follow-up questions

follow_up_questions = 20  # For deep search

```

### Frontend Configuration

Edit Socket.io connection in `chat.jsx`:

```javascript

// Change backend URL

const socket = io("http://localhost:4000");

// For production

const socket = io(process.env.NEXT_PUBLIC_BACKEND_URL);

```

---



### ⭐ Star this repository if you find it helpful!

**Made with ❤️ and curiosity**
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aneeshpatne/curiosity

Awesome Lists containing this project

README