An open API service indexing awesome lists of open source software.

https://github.com/balaji1233/deep_research_agent

A Deep research agent that will pull live data from sources like Google, Bing, and Reddit for answering user queries
https://github.com/balaji1233/deep_research_agent

brightdata langchain langgraph-agents

Last synced: 13 days ago
JSON representation

A Deep research agent that will pull live data from sources like Google, Bing, and Reddit for answering user queries

Awesome Lists containing this project

README

          

# DEEP_RESEARCH_AGENT
A Deep research agent that will pull live data from sources like Google, Bing, and Reddit for answering user queries
# Multi-Source Research AI Agent β€” LangGraph + Bright Data + OpenAI

**One-line:** I built a production-grade, multi-step AI research agent in Python that runs parallel live searches (Google, Bing, Reddit), scrapes and normalizes results via Bright Data, analyzes them with OpenAI/GPT, and returns a structured, cited synthesis β€” all orchestrated as a LangGraph workflow.

---

## Why this project (problem β†’ solution)
**Problem:** Most LLM assistants rely on single APIs or cached corpora and thus miss fresh, diverse, real-world evidence (SERPs, social sentiment, niche forum discussions). This leads to incomplete, biased, or outdated answers for real research tasks (e.g., competitive analysis, market research, relocation decisions).

**My solution:** a graph-based research agent that:
- queries multiple live sources in parallel,
- reliably scrapes and normalizes raw results (including Reddit posts/comments),
- applies layered LLM analysis (source-level β†’ synthesis),
- enforces typed outputs and citations for traceability.

This is framed as a **research assistant** (not a toy chat), designed to reduce manual research time and increase answer fidelity.

---

## Highlights / Key capabilities
- πŸš€ **Parallel multi-source search:** Google, Bing, Reddit (via Bright Data).
- πŸ”„ **Asynchronous snapshot handling:** trigger/crawl β†’ poll for readiness β†’ download parsed results.
- 🧠 **Layered LLM analysis:** per-source analysis + cross-source synthesis using OpenAI.
- 🧩 **LangGraph orchestration:** nodes with typed inputs/outputs, deterministic wiring, retries and partial state updates.
- πŸ“ **Structured outputs:** Pydantic (typed) schemas ensure consistent, machine-readable results (URLs, snippets, sentiment).
- πŸ” **Citations & provenance:** every claim links back to source results.
- πŸ› οΈ **Production-grade scaffolding:** modular functions, prompt templates, logging, error handling.

---

## Business value / Impact
- **Faster, better research:** replaces hours of manual searching with a ~minute response (depends on crawl latency), increasing analyst throughput Γ—5–10.
- **Broader coverage:** combining search engines + Reddit reduces blind spots and improves recall for niche topics.
- **Auditability:** citations & typed outputs reduce hallucination risk and improve stakeholder trust.
- **Operational readiness:** modular design makes it easy to extend to other sources (Twitter, news APIs) and integrate into production pipelines.

---

## Architecture (high level)
User Query
β”‚
β–Ό
LangGraph Controller (State)
β”œβ”€ Parallel Nodes: GoogleSearch, BingSearch, RedditSearch (calls Bright Data)
β”œβ”€ Fetch / Snapshot Node: polls until data ready, downloads parsed JSON
β”œβ”€ SourceAnalysis Node(s): LLM analyzes each source separately (prompt templates)
β”œβ”€ Synthesizer Node: LLM synthesizes final answer, merging source analyses
└─ Output Formatter: enforces Pydantic schema + citations

## Architecture β€” Description

### User Query
**User-provided question or prompt** that starts the pipeline.

### LangGraph Controller (State)
Central orchestrator that holds the typed state object (Pydantic), schedules nodes (parallel & sequential), manages retries/backoff, and collects partial outputs.

### Parallel Nodes: `GoogleSearch`, `BingSearch`, `RedditSearch` (calls Bright Data)
Nodes that simultaneously issue search/scrape requests via Bright Data to gather SERPs and social content. Each node writes raw results into the shared state.

### Fetch / Snapshot Node
For asynchronous sources (e.g., Reddit snapshots), this node polls for crawl readiness, downloads parsed JSON results, and stores them in state.

### SourceAnalysis Node(s)
Per-source LLM analysis nodes that consume raw results, run source-specific prompt templates, and extract summaries, ranked URLs, and metadata (sentiment, notable quotes).

### Synthesizer Node
Aggregates the per-source analyses and runs a higher-capacity LLM to merge insights, resolve contradictions, and produce the consolidated answer.

### Output Formatter
Validates and formats the final output using Pydantic schemas, attaches citations and confidence scores, and serializes the structured JSON result for downstream use.

---

### Compact flow (pasteable)

Key design points:
- **State object:** dict-like (user_query, raw_results, filtered_urls, reddit_posts, analyses, final_answer).
- **LangGraph nodes:** pure functions that `read β†’ update β†’ return` state slices.
- **Bright Data:** unified API for SERP & social scraping; handles CAPTCHA/IP issues at scale.
- **Pedantic/Pydantic models:** enforce output structure (lists of URLs, score fields, summaries).

---

## Tradeoffs & Challenges
- **Latency vs Coverage:** deep scraping + Reddit snapshots increase latency. I mitigate with parallelism and partial early results, but there’s an inherent tradeoff when you need exhaustive retrieval.
- **Cost vs Freshness:** Bright Data + high-capacity LLM calls incur costs β€” tune sampling depth and model choice for use-case (GPT-4 for final synth, GPT-3.5 for source summaries).
- **Hallucination risk:** even with RAG, LLMs can hallucinate. I reduce this via explicit citation requirements, structured outputs, and a hallucination detector/QA stage.
- **Scraping reliability:** Bright Data helps, but snapshots and polling are required (complexity in async handling).
- **Rate limits & retries:** robust error handling and backoff strategies are needed for production.

---

## Metrics I track
- **Recall of relevant sources (%):** manual sampling vs agent results.
- **Precision / factuality score:** human-annotated ranking of claims.
- **End-to-end latency:** median and 95th percentile.
- **Cost per query:** API + scraping cost.
- **User satisfaction / accuracy lift:** AB tests vs human-only research.

---

## Quickstart (run locally)
> **Prereqs:** Python 3.10+, virtualenv, Bright Data account (API key), OpenAI API key.

```bash
git clone https://github.com/yourname/langgraph-research-agent.git
cd langgraph-research-agent
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# edit .env to add OPENAI_API_KEY, BRIGHTDATA_API_KEY, etc.
python run_agent.py --query "Should I move to Lisbon in 2025?"
```
## How I built it β€” Implementation notes & best practices

### Node design
- Keep nodes **deterministic** and **side-effect free**: accept state in β†’ return updated state out.
- Benefits: easier unit testing, retries, and reproducibility.

### Prompt layering
- **Summarize each source first**, then run a separate synthesis prompt over those summaries.
- This reduces LLM cognitive load and improves factual grounding.

### Typed outputs
- Use **Pydantic models** for outputs such as `SelectedURLs`, `SourceSummary`, `FinalSynthesis`.
- Validate LLM outputs against schemas; **retry or re-prompt** when structure is violated.

### Snapshot polling
- For crawls/snapshots implement robust polling with:
- exponential backoff,
- sensible timeouts,
- support for **partial retrievals** when some snapshots finish earlier.

### Cost control
- Use smaller, cheaper models for intermediate summarization tasks.
- Reserve higher-cost models for the **final synthesis** step only.

### Logging & observability
- Emit **structured JSON logs** and request traces.
- Expose metrics/dashboards for latency, error rates, and cost-per-query.

---

## Tests & validation
- **Unit tests** for every node (mock Bright Data and OpenAI responses).
- **Integration smoke tests** to validate end-to-end flow (use canned datasets).
- **Human evaluation pipeline** to measure factuality, precision, and recall; use these metrics to iterate on prompts and filters.

---

## Roadmap / Extensions
- Add more sources: **Twitter/X**, news APIs, academic sources (CrossRef, Semantic Scholar).
- Support **multi-turn conversation** and incremental updating of the state.
- Add a **knowledge-update node** to feed validated claims into a short-term cache or knowledge store.
- Build a **lightweight UI** for running queries, inspecting state & sources, and approving final outputs.

---

## Contribution
I welcome PRs that:
- add new source integrations,
- improve prompt templates,
- harden snapshot/poll handling, or
- add CI tests and sample datasets.

Please open issues for feature requests or bugs.