https://github.com/mihirchhiber/topicresearcherbot
AI-powered topic research assistant that automates article discovery, extraction, and summarization. Built with FastAPI, LangChain, Ollama (LLama3), and MongoDB, it scrapes topic-specific content using Google Custom Search, cleans noisy text with an LLM, and generates concise summaries using LLM. Ideal for analysts and researchers seeking fast.
https://github.com/mihirchhiber/topicresearcherbot
agentic-ai ai fastapi langchain large-language-models llama3 llm mongodb ollama python summarization webscraping
Last synced: 2 months ago
JSON representation
AI-powered topic research assistant that automates article discovery, extraction, and summarization. Built with FastAPI, LangChain, Ollama (LLama3), and MongoDB, it scrapes topic-specific content using Google Custom Search, cleans noisy text with an LLM, and generates concise summaries using LLM. Ideal for analysts and researchers seeking fast.
- Host: GitHub
- URL: https://github.com/mihirchhiber/topicresearcherbot
- Owner: mihirchhiber
- Created: 2025-04-24T15:28:34.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-22T15:58:54.000Z (about 1 year ago)
- Last Synced: 2025-06-15T16:46:13.628Z (about 1 year ago)
- Topics: agentic-ai, ai, fastapi, langchain, large-language-models, llama3, llm, mongodb, ollama, python, summarization, webscraping
- Language: Python
- Homepage:
- Size: 93.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Topic Researcher Bot
**Topic Researcher Bot** is an intelligent assistant designed to streamline the process of researching and analyzing online content. It uses LLMs (via LangChain and Ollama) to automate topic-based article discovery, content cleaning, and summarization. This tool is ideal for researchers, analysts, and content teams who want fast, structured insights from recent web articles.
---
## Features
- **Automated Web Search:** Uses Google Custom Search API to find recent articles related to specific topics and sites.
- **Article Scraping:** Extracts content from article pages using BeautifulSoup and custom rules to avoid noisy or irrelevant data.
- **LLM-based Cleaning:** Leverages LLMs to isolate and retain only the core article body, removing ads, UI elements, and boilerplate.
- **Summarization:** Uses a second LLM pass to generate concise summaries of the cleaned content.
- **Structured Storage:** Articles are stored in a MongoDB database for easy retrieval, filtering, and deletion.
- **Batch Article Retrieval:** Supports multi-week historical searches and batch retrieval per topic-site pair.
- **Backend:** Developed using FastAPI
---
## Components
### `article.py`
Defines the `Article` class with fields like:
- `id`
- `topics`
- `sites`
- `title`
- `url`
- `source`
- `content`
- `clean_content`
- `summary`
### `database.py`
Handles MongoDB operations for:
- Storing new articles
- Retrieving and deleting existing ones
- Removing duplicates based on title and content hash
### `llmarticles.py`
Core script that:
- Searches and scrapes articles via Google CSE
- Cleans noisy text using LLM prompts
- Summarizes content with LangChain + Ollama
- Manages LLM memory and prompts for efficiency
---
## Usage
1. **Define Topics & Sites**
Configure your topics and target news sites.
2. **Run Retrieval**
Use `get_recent_articles(topics, sites)` to perform the search and scraping.
3. **Clean & Summarize**
Each article goes through `clean_article_text()` and `summarize_article_text()` functions for post-processing.
4. **View/Manage Data**
Use database functions to filter or delete articles as needed.
---
## Tech Stack
- FastAPI
- LangChain
- Ollama or Groq AI (LLM backend)
- MongoDB
- Google Custom Search API
- BeautifulSoup