https://github.com/mihirchhiber/topicresearcherbot

AI-powered topic research assistant that automates article discovery, extraction, and summarization. Built with FastAPI, LangChain, Ollama (LLama3), and MongoDB, it scrapes topic-specific content using Google Custom Search, cleans noisy text with an LLM, and generates concise summaries using LLM. Ideal for analysts and researchers seeking fast.
https://github.com/mihirchhiber/topicresearcherbot

agentic-ai ai fastapi langchain large-language-models llama3 llm mongodb ollama python summarization webscraping

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/mihirchhiber/topicresearcherbot
Owner: mihirchhiber
Created: 2025-04-24T15:28:34.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-22T15:58:54.000Z (about 1 year ago)
Last Synced: 2025-06-15T16:46:13.628Z (about 1 year ago)
Topics: agentic-ai, ai, fastapi, langchain, large-language-models, llama3, llm, mongodb, ollama, python, summarization, webscraping
Language: Python
Homepage:
Size: 93.8 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Topic Researcher Bot

**Topic Researcher Bot** is an intelligent assistant designed to streamline the process of researching and analyzing online content. It uses LLMs (via LangChain and Ollama) to automate topic-based article discovery, content cleaning, and summarization. This tool is ideal for researchers, analysts, and content teams who want fast, structured insights from recent web articles.

---

## Features

- **Automated Web Search:** Uses Google Custom Search API to find recent articles related to specific topics and sites.
- **Article Scraping:** Extracts content from article pages using BeautifulSoup and custom rules to avoid noisy or irrelevant data.
- **LLM-based Cleaning:** Leverages LLMs to isolate and retain only the core article body, removing ads, UI elements, and boilerplate.
- **Summarization:** Uses a second LLM pass to generate concise summaries of the cleaned content.
- **Structured Storage:** Articles are stored in a MongoDB database for easy retrieval, filtering, and deletion.
- **Batch Article Retrieval:** Supports multi-week historical searches and batch retrieval per topic-site pair.
- **Backend:** Developed using FastAPI

---

## Components

### `article.py`
Defines the `Article` class with fields like:
- `id`
- `topics`
- `sites`
- `title`
- `url`
- `source`
- `content`
- `clean_content`
- `summary`

### `database.py`
Handles MongoDB operations for:
- Storing new articles
- Retrieving and deleting existing ones
- Removing duplicates based on title and content hash

### `llmarticles.py`
Core script that:
- Searches and scrapes articles via Google CSE
- Cleans noisy text using LLM prompts
- Summarizes content with LangChain + Ollama
- Manages LLM memory and prompts for efficiency

---

## Usage

1. **Define Topics & Sites**
Configure your topics and target news sites.

2. **Run Retrieval**
Use `get_recent_articles(topics, sites)` to perform the search and scraping.

3. **Clean & Summarize**
Each article goes through `clean_article_text()` and `summarize_article_text()` functions for post-processing.

4. **View/Manage Data**
Use database functions to filter or delete articles as needed.

---

## Tech Stack

- FastAPI
- LangChain
- Ollama or Groq AI (LLM backend)
- MongoDB
- Google Custom Search API
- BeautifulSoup

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mihirchhiber/topicresearcherbot

Awesome Lists containing this project

README