https://github.com/muhd-umer/news-llm
NewsLLM is a RAG-based LLM application designed to analyze and summarize news articles
https://github.com/muhd-umer/news-llm
gemini generative-ai llm news rag
Last synced: 3 months ago
JSON representation
NewsLLM is a RAG-based LLM application designed to analyze and summarize news articles
- Host: GitHub
- URL: https://github.com/muhd-umer/news-llm
- Owner: muhd-umer
- License: mit
- Created: 2024-08-01T14:41:01.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-12-10T11:44:22.000Z (over 1 year ago)
- Last Synced: 2025-04-15T08:53:03.763Z (about 1 year ago)
- Topics: gemini, generative-ai, llm, news, rag
- Language: Python
- Homepage: https://news-llm.streamlit.app
- Size: 22.8 MB
- Stars: 5
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
NewsLLM is a RAG-based LLM application designed to analyze and summarize news articles. It uses [Groq](https://groq.com/) API to generate summaries and insights for different countries and different topics. The application is built using Streamlit and Python.
## How NewsLLM Works
NewsLLM follows a robust workflow to provide accurate and insightful news summaries:
1. **Data Acquisition**
- NewsLLM utilizes a custom web scraper (`scraper.py`) to gather news articles from various sources.
- The scraper targets specific countries and topics.
- For each topic and country combination, it scrapes up to _10 relevant news articles_ from Google Search results to ensure a diverse range of perspectives are considered.
2. **Database** (ChromaDB)
- Scraped news articles are processed and stored in a vector database using the `Chroma` library.
- This database allows for efficient storage and retrieval of news articles based on their semantic similarity.
- Each article is represented as a vector embedding, capturing its core meaning and context.
- Database is updated automatically every 24 hours to ensure the information remains current.
3. **Context Retrieval**
- When a user submits a query (country and topic), NewsLLM retrieves the most relevant articles from the database.
- User query is transformed into a vector embedding and compared against the embeddings of the stored articles.
- Top _k=7_ most similar articles are selected as the context for generating the summary and analysis.
4. **Output Generation**
- NewsLLM utilizes the `Groq` API and uses the `llama-3.1-70b-versatile` model for output generation.
- Formed context is fed to the LLM, along with carefully constructed prompts designed to elicit a comprehensive summary and analysis.
- LLM generates a structured response covering key points, trends, context, impacts, controversies, statistical insights, global relevance, and future outlook.
- Users are also giving the option to continue asking follow-up questions, if desired.
## Installation
1. Clone the repository:
```bash
git clone https://github.com/muhd-umer/news-llm.git
```
2. Install dependencies:
```bash
pip install -r requirements.txt
```
3. Obtain an API key for Groq and store it in a `.env` file as `GROQ_API_KEY=""`.
## Usage
1. Create initial database as:
```bash
python database.py
```
2. Run the application using:
```bash
streamlit run app.py
```
3. Select a country and topic from the sidebar.
4. Click _"Analyze news"_ to generate a summary and analysis.
5. Ask follow-up questions in the chat interface.
## Tech Stack
- **Streamlit:** For building the user interface.
- **Python:** The primary programming language.
- **Groq API:** For LLM-powered summary and analysis generation.
- **Chroma:** For vector database storage and retrieval.
- **Beautiful Soup:** For web scraping.
- **Google Search:** For finding relevant news articles.
- **LangChain:** For LLM orchestration and prompt management.
## Contributing
Contributions are welcome! Please feel free to open issues and submit pull requests.