https://github.com/balaji1233/deep_research_agent
A Deep research agent that will pull live data from sources like Google, Bing, and Reddit for answering user queries
https://github.com/balaji1233/deep_research_agent
brightdata langchain langgraph-agents
Last synced: 13 days ago
JSON representation
A Deep research agent that will pull live data from sources like Google, Bing, and Reddit for answering user queries
- Host: GitHub
- URL: https://github.com/balaji1233/deep_research_agent
- Owner: balaji1233
- Created: 2025-09-12T15:31:51.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-09-29T15:23:32.000Z (8 months ago)
- Last Synced: 2025-09-29T17:26:28.078Z (8 months ago)
- Topics: brightdata, langchain, langgraph-agents
- Language: Python
- Homepage:
- Size: 74.2 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DEEP_RESEARCH_AGENT
A Deep research agent that will pull live data from sources like Google, Bing, and Reddit for answering user queries
# Multi-Source Research AI Agent β LangGraph + Bright Data + OpenAI
**One-line:** I built a production-grade, multi-step AI research agent in Python that runs parallel live searches (Google, Bing, Reddit), scrapes and normalizes results via Bright Data, analyzes them with OpenAI/GPT, and returns a structured, cited synthesis β all orchestrated as a LangGraph workflow.
---
## Why this project (problem β solution)
**Problem:** Most LLM assistants rely on single APIs or cached corpora and thus miss fresh, diverse, real-world evidence (SERPs, social sentiment, niche forum discussions). This leads to incomplete, biased, or outdated answers for real research tasks (e.g., competitive analysis, market research, relocation decisions).
**My solution:** a graph-based research agent that:
- queries multiple live sources in parallel,
- reliably scrapes and normalizes raw results (including Reddit posts/comments),
- applies layered LLM analysis (source-level β synthesis),
- enforces typed outputs and citations for traceability.
This is framed as a **research assistant** (not a toy chat), designed to reduce manual research time and increase answer fidelity.
---
## Highlights / Key capabilities
- π **Parallel multi-source search:** Google, Bing, Reddit (via Bright Data).
- π **Asynchronous snapshot handling:** trigger/crawl β poll for readiness β download parsed results.
- π§ **Layered LLM analysis:** per-source analysis + cross-source synthesis using OpenAI.
- π§© **LangGraph orchestration:** nodes with typed inputs/outputs, deterministic wiring, retries and partial state updates.
- π **Structured outputs:** Pydantic (typed) schemas ensure consistent, machine-readable results (URLs, snippets, sentiment).
- π **Citations & provenance:** every claim links back to source results.
- π οΈ **Production-grade scaffolding:** modular functions, prompt templates, logging, error handling.
---
## Business value / Impact
- **Faster, better research:** replaces hours of manual searching with a ~minute response (depends on crawl latency), increasing analyst throughput Γ5β10.
- **Broader coverage:** combining search engines + Reddit reduces blind spots and improves recall for niche topics.
- **Auditability:** citations & typed outputs reduce hallucination risk and improve stakeholder trust.
- **Operational readiness:** modular design makes it easy to extend to other sources (Twitter, news APIs) and integrate into production pipelines.
---
## Architecture (high level)
User Query
β
βΌ
LangGraph Controller (State)
ββ Parallel Nodes: GoogleSearch, BingSearch, RedditSearch (calls Bright Data)
ββ Fetch / Snapshot Node: polls until data ready, downloads parsed JSON
ββ SourceAnalysis Node(s): LLM analyzes each source separately (prompt templates)
ββ Synthesizer Node: LLM synthesizes final answer, merging source analyses
ββ Output Formatter: enforces Pydantic schema + citations
## Architecture β Description
### User Query
**User-provided question or prompt** that starts the pipeline.
### LangGraph Controller (State)
Central orchestrator that holds the typed state object (Pydantic), schedules nodes (parallel & sequential), manages retries/backoff, and collects partial outputs.
### Parallel Nodes: `GoogleSearch`, `BingSearch`, `RedditSearch` (calls Bright Data)
Nodes that simultaneously issue search/scrape requests via Bright Data to gather SERPs and social content. Each node writes raw results into the shared state.
### Fetch / Snapshot Node
For asynchronous sources (e.g., Reddit snapshots), this node polls for crawl readiness, downloads parsed JSON results, and stores them in state.
### SourceAnalysis Node(s)
Per-source LLM analysis nodes that consume raw results, run source-specific prompt templates, and extract summaries, ranked URLs, and metadata (sentiment, notable quotes).
### Synthesizer Node
Aggregates the per-source analyses and runs a higher-capacity LLM to merge insights, resolve contradictions, and produce the consolidated answer.
### Output Formatter
Validates and formats the final output using Pydantic schemas, attaches citations and confidence scores, and serializes the structured JSON result for downstream use.
---
### Compact flow (pasteable)
Key design points:
- **State object:** dict-like (user_query, raw_results, filtered_urls, reddit_posts, analyses, final_answer).
- **LangGraph nodes:** pure functions that `read β update β return` state slices.
- **Bright Data:** unified API for SERP & social scraping; handles CAPTCHA/IP issues at scale.
- **Pedantic/Pydantic models:** enforce output structure (lists of URLs, score fields, summaries).
---
## Tradeoffs & Challenges
- **Latency vs Coverage:** deep scraping + Reddit snapshots increase latency. I mitigate with parallelism and partial early results, but thereβs an inherent tradeoff when you need exhaustive retrieval.
- **Cost vs Freshness:** Bright Data + high-capacity LLM calls incur costs β tune sampling depth and model choice for use-case (GPT-4 for final synth, GPT-3.5 for source summaries).
- **Hallucination risk:** even with RAG, LLMs can hallucinate. I reduce this via explicit citation requirements, structured outputs, and a hallucination detector/QA stage.
- **Scraping reliability:** Bright Data helps, but snapshots and polling are required (complexity in async handling).
- **Rate limits & retries:** robust error handling and backoff strategies are needed for production.
---
## Metrics I track
- **Recall of relevant sources (%):** manual sampling vs agent results.
- **Precision / factuality score:** human-annotated ranking of claims.
- **End-to-end latency:** median and 95th percentile.
- **Cost per query:** API + scraping cost.
- **User satisfaction / accuracy lift:** AB tests vs human-only research.
---
## Quickstart (run locally)
> **Prereqs:** Python 3.10+, virtualenv, Bright Data account (API key), OpenAI API key.
```bash
git clone https://github.com/yourname/langgraph-research-agent.git
cd langgraph-research-agent
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# edit .env to add OPENAI_API_KEY, BRIGHTDATA_API_KEY, etc.
python run_agent.py --query "Should I move to Lisbon in 2025?"
```
## How I built it β Implementation notes & best practices
### Node design
- Keep nodes **deterministic** and **side-effect free**: accept state in β return updated state out.
- Benefits: easier unit testing, retries, and reproducibility.
### Prompt layering
- **Summarize each source first**, then run a separate synthesis prompt over those summaries.
- This reduces LLM cognitive load and improves factual grounding.
### Typed outputs
- Use **Pydantic models** for outputs such as `SelectedURLs`, `SourceSummary`, `FinalSynthesis`.
- Validate LLM outputs against schemas; **retry or re-prompt** when structure is violated.
### Snapshot polling
- For crawls/snapshots implement robust polling with:
- exponential backoff,
- sensible timeouts,
- support for **partial retrievals** when some snapshots finish earlier.
### Cost control
- Use smaller, cheaper models for intermediate summarization tasks.
- Reserve higher-cost models for the **final synthesis** step only.
### Logging & observability
- Emit **structured JSON logs** and request traces.
- Expose metrics/dashboards for latency, error rates, and cost-per-query.
---
## Tests & validation
- **Unit tests** for every node (mock Bright Data and OpenAI responses).
- **Integration smoke tests** to validate end-to-end flow (use canned datasets).
- **Human evaluation pipeline** to measure factuality, precision, and recall; use these metrics to iterate on prompts and filters.
---
## Roadmap / Extensions
- Add more sources: **Twitter/X**, news APIs, academic sources (CrossRef, Semantic Scholar).
- Support **multi-turn conversation** and incremental updating of the state.
- Add a **knowledge-update node** to feed validated claims into a short-term cache or knowledge store.
- Build a **lightweight UI** for running queries, inspecting state & sources, and approving final outputs.
---
## Contribution
I welcome PRs that:
- add new source integrations,
- improve prompt templates,
- harden snapshot/poll handling, or
- add CI tests and sample datasets.
Please open issues for feature requests or bugs.