https://github.com/aneeshpatne/curiosity
Curiosity: Search Agent – Multi-agent system using LLMs (GPT, Gemini) with DuckDuckGo, Playwright, and LangChain for web search, scraping, and detailed summaries with follow-ups.
https://github.com/aneeshpatne/curiosity
ai fastapi nextjs webscraping
Last synced: about 2 months ago
JSON representation
Curiosity: Search Agent – Multi-agent system using LLMs (GPT, Gemini) with DuckDuckGo, Playwright, and LangChain for web search, scraping, and detailed summaries with follow-ups.
- Host: GitHub
- URL: https://github.com/aneeshpatne/curiosity
- Owner: aneeshpatne
- Created: 2025-02-07T15:54:24.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-02T07:23:14.000Z (over 1 year ago)
- Last Synced: 2025-03-02T07:26:02.188Z (over 1 year ago)
- Topics: ai, fastapi, nextjs, webscraping
- Language: Python
- Homepage:
- Size: 3.56 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🔍 Curiosity
### AI-Powered Search & News Intelligence Platform

[](https://nextjs.org/)
[](https://react.dev/)
[](https://www.python.org/)
[](https://fastapi.tiangolo.com/)
[](LICENSE)
**An intelligent search agent that combines real-time web scraping, LLM-powered analysis, and automated news digests to deliver comprehensive, cited answers to your queries.**
[Features](#-features) • [Installation](#-installation) • [Usage](#-usage) • [How It Works](#-how-it-works)
---
## 📋 Table of Contents
- [Overview](#-overview)
- [Features](#-features)
- [Technology Stack](#-technology-stack)
- [Installation](#-installation)
- [Usage](#-usage)
- [Project Structure](#-project-structure)
- [How It Works](#-how-it-works)
- [Components](#-components)
- [Configuration](#-configuration)
- [Contributing](#-contributing)
- [License](#-license)
---
## 🌟 Overview
**Curiosity** is a cutting-edge AI-powered search platform that revolutionizes how you gather and process information. Unlike traditional search engines that provide links, Curiosity scrapes, analyzes, and synthesizes content from multiple sources to deliver comprehensive, citation-backed answers in real-time.
The platform features two main components:
1. **🔎 Curiosity Search** - An interactive chat interface with multiple search modes
2. **📰 Curiosity Newsletter** - An automated daily news digest delivered to your inbox
---
## ✨ Features
### 🔍 Curiosity Search
#### Multiple Search Modes
- **Normal Search** - Quick searches analyzing 7 sources with standard depth
- **Pro Search** - Enhanced search examining 25 sources for comprehensive results
- **Deep Search** - Recursive multi-level search that:
- Explores follow-up questions automatically
- Synthesizes information from 100+ sources
- Provides in-depth analysis from multiple perspectives
#### Intelligent Features
- **🔄 Real-time Updates** - Live status indicators showing search, scraping, and analysis progress
- **📚 Source Citations** - Every claim is backed by numbered citations linking to original sources
- **🎯 Smart Follow-ups** - AI-generated follow-up questions to explore topics deeper
- **💬 Conversational Memory** - Maintains context across multiple queries
- **⚡ Live Source Display** - See sources as they're discovered with favicon previews
- **📱 Responsive UI** - Modern, dark-mode interface built with shadcn/ui
### 📰 Curiosity Newsletter
#### Automated News Intelligence
- **🌍 Global News Coverage** - Automatically fetches top stories from multiple sources
- **🤖 AI Summarization** - Condenses 20+ articles into structured, readable summaries
- **📧 Email Delivery** - Beautiful HTML-formatted newsletters sent daily
- **🔄 Deep Analysis** - Uses recursive search to provide context and depth
- **⏰ Scheduled Execution** - Automated via cron jobs for daily delivery
- **🎨 Rich Formatting** - Professionally styled email templates with responsive design
---
## 🛠 Technology Stack
### Frontend
| Technology | Version | Purpose |
| ------------------------------------------------ | ------- | ------------------------------------- |
| [Next.js](https://nextjs.org/) | 15.1.7 | React framework with App Router |
| [React](https://react.dev/) | 19.0 | UI library |
| [Socket.io Client](https://socket.io/) | 4.8.1 | Real-time bidirectional communication |
| [Tailwind CSS](https://tailwindcss.com/) | 3.4.1 | Utility-first CSS framework |
| [shadcn/ui](https://ui.shadcn.com/) | Latest | High-quality UI components |
| [Marked](https://marked.js.org/) | 15.0.7 | Markdown parser and renderer |
| [DOMPurify](https://github.com/cure53/DOMPurify) | 3.2.4 | XSS sanitizer for HTML |
| [Lucide React](https://lucide.dev/) | 0.475.0 | Icon library |
### Backend
| Technology | Purpose |
| ---------------------------------------------------------------- | --------------------------- |
| [Python](https://www.python.org/) | Core backend language |
| [FastAPI](https://fastapi.tiangolo.com/) | Modern async web framework |
| [Socket.io](https://socket.io/) | Real-time server |
| [Playwright](https://playwright.dev/) | Headless browser automation |
| [DuckDuckGo Search](https://github.com/deedy5/duckduckgo_search) | Privacy-focused search API |
| [LangChain](https://python.langchain.com/) | LLM orchestration framework |
| [Pydantic](https://docs.pydantic.dev/) | Data validation |
### AI Models
- **OpenAI GPT-4o-mini** - Fast summarization and agent reasoning
- **OpenAI o1-mini** - Deep reasoning for complex queries
- **Google Gemini 2.0 Flash** - High-speed content analysis
- **Meta LLaMA 3.3** (via OpenRouter) - Alternative model support
---
## 🚀 Installation
### Prerequisites
- **Node.js** 18+ and npm/yarn
- **Python** 3.9+
- **OpenAI API Key**
- **Google Gemini API Key** (optional)
- **OpenRouter API Key** (optional)
### Step 1: Clone the Repository
```bash
git clone https://github.com/yourusername/curiosity.git
cd curiosity
```
### Step 2: Backend Setup
#### Install Python Dependencies
```bash
# Install required packages
pip install fastapi uvicorn socketio python-socketio playwright pydantic
pip install duckduckgo-search langchain langchain-openai langchain-google-genai
pip install python-dotenv markdown
# Install Playwright browsers
playwright install chromium
```
#### Configure Environment Variables
Create a `.env` file in the root directory:
```env
# Required
OPENAI_API_KEY=your_openai_api_key_here
# Optional (for alternative models)
GEMINI_API_KEY=your_gemini_api_key_here
OPEN_ROUTER_KEY=your_openrouter_key_here
# For Newsletter (Optional)
SMTP_SERVER=smtp.gmail.com
SMTP_PORT=587
EMAIL_SENDER=your_email@gmail.com
EMAIL_PASSWORD=your_app_password
EMAIL_RECEIVER=recipient@email.com
```
#### Start the Backend Server
```bash
# From the Search directory
cd Search
python search-agent.py
# Server will start on http://localhost:4000
```
### Step 3: Frontend Setup
```bash
cd Frontend/curiosity
# Install dependencies
npm install
# Start development server
npm run dev
# Frontend will start on http://localhost:3000
```
### Step 4: Newsletter Setup (Optional)
```bash
cd News
# Make the shell script executable
chmod +x run_news_agent.sh
# Run manually
python news-agent.py
# Or set up a cron job for daily execution
crontab -e
# Add: 0 8 * * * /path/to/Curiosity/News/run_news_agent.sh
```
---
## 📖 Usage
### Starting the Application
1. **Start Backend**:
```bash
cd Search
python search-agent.py
```
2. **Start Frontend**:
```bash
cd Frontend/curiosity
npm run dev
```
3. **Access the Application**:
- Open your browser to `http://localhost:3000`
### Using Different Search Modes
#### Normal Search
```
1. Select "Normal Search" from dropdown
2. Enter your query: "What is quantum computing?"
3. Get results from ~7 sources with citations
```
#### Pro Search
```
1. Select "Pro Search" from dropdown
2. Enter your query: "Latest developments in AI research"
3. Get comprehensive results from ~25 sources
```
#### Deep Search
```
1. Select "Deep Search" from dropdown
2. Enter complex query: "Impact of climate change on global economy"
3. System will:
- Search initial query
- Generate 20 follow-up questions
- Recursively search each follow-up
- Synthesize 100+ sources into comprehensive answer
```
### Newsletter Usage
```bash
# Manual execution
python News/news-agent.py
# Automated daily execution (8 AM)
# Add to crontab:
0 8 * * * /path/to/Curiosity/News/run_news_agent.sh
```
---
## 📂 Project Structure
```
Curiosity/
├── Frontend/
│ └── curiosity/
│ ├── src/
│ │ ├── app/
│ │ │ ├── layout.js # Root layout
│ │ │ ├── page.js # Home page
│ │ │ └── globals.css # Global styles
│ │ ├── components/
│ │ │ ├── chat.jsx # Main chat interface
│ │ │ └── ui/ # shadcn/ui components
│ │ │ ├── button.jsx
│ │ │ ├── input.jsx
│ │ │ └── select.jsx
│ │ └── lib/
│ │ └── utils.js # Utility functions
│ ├── public/
│ │ └── assets/ # Static assets
│ ├── package.json
│ ├── next.config.mjs
│ ├── tailwind.config.mjs
│ └── components.json
│
├── Search/
│ ├── search-agent.py # Main search agent with FastAPI server
│ ├── deep-search.py # Standalone deep search implementation
│ ├── combined_sources.txt # Debug output (generated)
│ └── Deprecated/ # Legacy implementations
│ ├── search.py
│ ├── search-new.py
│ ├── search_local.py
│ └── test.py
│
├── News/
│ ├── news-agent.py # Automated news summarization
│ ├── run_news_agent.sh # Shell script for cron execution
│ ├── deepSearch.py # News-specific deep search (deprecated)
│ ├── example.py
│ ├── simple.py
│ └── test.py
│
├── README.md
└── .env # Environment variables (create this)
```
---
## 🔬 How It Works
### Search Flow
```mermaid
sequenceDiagram
participant User
participant Frontend
participant Backend
participant Scraper
participant LLM
User->>Frontend: Enter query
Frontend->>Backend: Send via WebSocket
Backend->>Backend: Emit "waiting" status
Backend->>DuckDuckGo: Search query
DuckDuckGo-->>Backend: Return URLs
Backend->>Frontend: Emit sources
Backend->>Backend: Emit "scraping" status
par Parallel Scraping
Backend->>Scraper: Scrape URL 1
Backend->>Scraper: Scrape URL 2
Backend->>Scraper: Scrape URL N
end
Scraper-->>Backend: Return content
Backend->>Backend: Emit "thinking" status
Backend->>LLM: Summarize with citations
LLM-->>Backend: Return summary + follow-ups
Backend->>Frontend: Emit final response
Frontend->>User: Display with citations
```
### Deep Search Flow
```mermaid
graph TD
A[User Query] --> B[Initial Search]
B --> C[Scrape 5 URLs]
C --> D[Summarize]
D --> E[Generate 20 Follow-ups]
E --> F1[Follow-up 1]
E --> F2[Follow-up 2]
E --> F20[Follow-up 20]
F1 --> G1[Scrape 5 URLs]
F2 --> G2[Scrape 5 URLs]
F20 --> G20[Scrape 5 URLs]
G1 --> H1[Summarize]
G2 --> H2[Summarize]
G20 --> H20[Summarize]
H1 --> I[Combine All Summaries]
H2 --> I
H20 --> I
I --> J[Final LLM Synthesis]
J --> K[Comprehensive Answer]
```
### Component Details
#### 1. Web Scraping
```python
# Concurrent scraping with semaphore control
async def scrape_page(context, url: str) -> str:
async with semaphore: # Limit to 7 concurrent requests
page = await context.new_page()
# Block images, stylesheets, fonts for speed
await page.route("**/*", block_requests)
await page.goto(url, wait_until='domcontentloaded')
# Extract text content from semantic elements
text_blocks = await page.locator("body p, h1, h2, h3").all_text_contents()
return cleaned_text[:5000] # First 5KB of content
```
#### 2. LLM Summarization
```python
# Structured output with citations and follow-ups
class SummaryFormat(BaseModel):
content: str # Markdown summary with [1] [2] citations
moreQtn: list[str] # 5-20 follow-up questions
# Chain: Prompt → LLM → Parser → Retry on Error
chain = prompt | llm | StrOutputParser()
retry_parser = RetryWithErrorOutputParser(parser=parser, max_retries=3)
```
#### 3. Real-time Communication
```javascript
// Frontend emits query
socket.emit("message", { id, text: query, searchType });
// Backend emits updates
await sio.emit("status", { id, status: "searching" });
await sio.emit("sources", { id, sources: urls });
await sio.emit("message", { id, text: summary, status: "finished" });
```
---
## 🧩 Components
### Backend Components
#### `search-agent.py`
The main FastAPI server that orchestrates the entire search process:
- **FastAPI Server** - Handles HTTP and WebSocket connections
- **Socket.io Integration** - Real-time bidirectional communication
- **Search Orchestration** - Manages search, scrape, summarize pipeline
- **LLM Chain Management** - Coordinates multiple LLM calls with retry logic
- **Memory Management** - Maintains conversation context
- **Deep Search Engine** - Recursive multi-level search implementation
Key Functions:
- `follow_up()` - Main query handler with search type routing
- `deep_search()` - Recursive search with depth control
- `scrape_contents()` - Parallel web scraping
- `summarize()` - LLM-powered summarization with citations
- `generate_final_summary()` - Deep search synthesis
#### `deep-search.py`
Standalone implementation of deep search for testing and development:
- Source tracking with global citation counter
- Recursive question exploration
- Citation preservation across levels
- Final synthesis from all sources
#### `news-agent.py`
Automated news aggregation and email delivery:
- Global news search
- Recursive deep search for context
- HTML email generation with styling
- SMTP email delivery
- Browser preview for testing
### Frontend Components
#### `chat.jsx`
Main chat interface with real-time updates:
- **Message Management** - State handling for sent/received messages
- **Socket.io Integration** - Event listeners for status, sources, messages
- **Search Type Selection** - Dropdown for Normal/Pro/Deep modes
- **Real-time Status** - Loading indicators and progress updates
- **Source Display** - Live URL cards with favicons
- **Markdown Rendering** - Safe HTML rendering with DOMPurify
- **Citation Linking** - Interactive superscript citations
- **Follow-up Questions** - Clickable suggestions
Components:
- `Chat` - Main container component
- `SentMessage` - User query display
- `ReceivedMessage` - AI response with sources and citations
- `MarkdownRenderer` - Safe markdown to HTML conversion
- `Citation` - Interactive citation superscripts
- `Sources` - URL preview cards
- `FollowUp` - Follow-up question suggestions
---
## ⚙ Configuration
### LLM Model Selection
Edit the model configuration in `search-agent.py`:
```python
# For faster, cheaper responses
agent_llm = ChatOpenAI(model='gpt-5-mini', api_key=SecretStr(api_key))
summary_llm = ChatOpenAI(model='gpt-5-mini', api_key=SecretStr(api_key))
# For higher quality, deeper reasoning
deep_search_llm = ChatOpenAI(model='gpt-5', api_key=SecretStr(api_key))
# For alternative providers
summary_llm = ChatOpenAI(
base_url='https://openrouter.ai/api/v1',
model='meta-llama/llama-3.3-70b-instruct:nitro',
api_key=SecretStr(openRouterKey)
)
```
### Search Parameters
Customize search depth and source count:
```python
# Number of concurrent scraping tasks
semaphore = asyncio.Semaphore(7) # Adjust based on system resources
# Search result counts
normal_search_results = 7
pro_search_results = 25
deep_search_results = 5 # Per query level
# Deep search recursion depth
deep_search_depth = 2 # Levels of follow-up questions
# Number of follow-up questions
follow_up_questions = 20 # For deep search
```
### Frontend Configuration
Edit Socket.io connection in `chat.jsx`:
```javascript
// Change backend URL
const socket = io("http://localhost:4000");
// For production
const socket = io(process.env.NEXT_PUBLIC_BACKEND_URL);
```
---
### ⭐ Star this repository if you find it helpful!
**Made with ❤️ and curiosity**