An open API service indexing awesome lists of open source software.

https://github.com/balaji1233/web_master

AI tool to transforms any URL into a structured knowledge source by: extracting content using Crawl4AI ,vectorizing and summarizing data , running Retrieval-Augmented Generation (RAG) for deep information discovery, enabling a smart chatbot for interactive Q&A.
https://github.com/balaji1233/web_master

crawl4ai deepseek-r1 docker faiss-vector-database ollama rag streamlit

Last synced: 7 months ago
JSON representation

AI tool to transforms any URL into a structured knowledge source by: extracting content using Crawl4AI ,vectorizing and summarizing data , running Retrieval-Augmented Generation (RAG) for deep information discovery, enabling a smart chatbot for interactive Q&A.

Awesome Lists containing this project

README

          

# WEB_MASTER
AI tool transforms any URL into a structured knowledge source by: extracting content using Crawl4AI ,vectorizing and summarizing data , running Retrieval-Augmented Generation (RAG) for deep information discovery, enabling a smart chatbot for interactive Q&A.

**WebMaster** is a powerful AI-driven tool that transforms any URL into a structured knowledge source. Built using [Crawl4AI](#), [Ollama](#), [DeepSeek](#), and [Streamlit](#), it enables you to extract, vectorize, and summarize web content—and interact with it through a smart chatbot. Perfect for researchers, analysts, and AI enthusiasts, WebMaster isn’t just another coding exercise; it’s a real-world solution to information overload.

---

## 🚀 Why WebMaster?

### The Problem

In today’s fast-paced digital age, countless websites contain valuable data and insights—but manually extracting and understanding this content is time-consuming and error-prone.
- **For researchers and analysts:** Sifting through lengthy articles and disparate data is inefficient.
- **For businesses:** Making sense of scattered online information can hinder strategic decisions.

### Our Solution

WebMaster addresses these challenges by:
- **Extracting Web Content:** Automatically crawling and gathering text from any URL.
- **Structuring Information:** Vectorizing and summarizing data to present clear, concise insights.
- **Deep Information Discovery:** Employing Retrieval-Augmented Generation (RAG) to uncover deeper, contextual details.
- **Interactive Q&A:** Offering a chatbot interface that lets you query and interact with the extracted content in real time.

---

## 🔑 Key Features

- **Website Extraction:**
Uses Crawl4AI to efficiently crawl and extract content from web pages.

- **Summarization:**
Generates detailed summaries of the extracted content—ideal for long articles or complex websites.

- **Embeddings & Retrieval:**
Creates embeddings using FAISS for intelligent document retrieval, overcoming open-source context window limitations.

- **Chatbot Interface:**
Provides a conversational agent for interactive Q&A, letting you explore your content seamlessly.

- **Dual AI Engine Support:**
Choose between Closed Source (OpenAI) and Open Source (Ollama) engines for both summarization and conversation to suit your needs.

---

## 🎯 Impact & Value

- **Real-World Problem Solving:**
Rather than being just a coding exercise, WebSage is designed as a business tool—for instance, helping freelancers manage data or enabling researchers to efficiently analyze academic content.

- **Quantifiable Benefits:**
- **Time Savings:** Automates extraction and summarization, potentially reducing manual analysis time by up to 35%.
- **Enhanced Insight:** The RAG approach enables deeper, context-aware retrieval of information.
- **Flexibility & Cost-Efficiency:** Supports both open and closed source AI engines, allowing for tailored, budget-friendly solutions.

---

## 🛠️ How to Use WebMaster

### Prerequisites

- **Python 3.8+**
- Required packages as listed in `requirements.txt`
- API keys or access tokens for AI engines (if using Closed Source models)

### Installation

Clone the repository and install dependencies:

```bash
git clone https://github.com/yourusername/websage.git
cd webmaster
pip install -r requirements.txt

```
## Configuration

Edit the `config.yaml` file to set your preferred options:

- **AI Engine Selection:**
Choose between OpenAI (Closed Source) and Ollama (Open Source) for summarization and chat.

- **FAISS Vector Database:**
Configure local vector database settings.

- **Other Parameters:**
Set URL input, output format, etc.

---

## Running the Application

Launch the Streamlit interface to start using WebSage:

```bash
streamlit run app.py
```

This opens a browser window where you can:

- **Enter a URL:** Trigger content extraction.
- **View Summaries:** Read concise, AI-generated summaries.
- **Chat with the Bot:** Ask follow-up questions and explore your content interactively.

## Project Structure


websage/
├── app.py # Streamlit web app entry point
├── config.yaml # Configuration file for API keys, DB settings, etc.
├── crawlers/ # Content extraction using Crawl4AI
├── summarizer/ # Modules for text summarization and embeddings creation
├── chatbot/ # Chatbot interface using RAG for Q&A
├── requirements.txt # Python dependencies
└── README.md # Project documentation

## 💬 Contributing
We welcome contributions! To get involved:

- Fork the repository.
- Create a feature branch.
- Submit a pull request with your changes.

For major contributions, please open an issue to discuss your ideas first.

## 📄 License
This project is open-source and available under the MIT License.

## 🙌 Final Thoughts
For early-career developers aiming to add meaningful projects to your GitHub profile, WebMaster demonstrates not only coding ability but also strong problem-solving skills. Focus on impact, not just output—one impactful project can be far more valuable than hundreds of clone apps.

## References
- [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948)
- [Benchmarking DeepSeek R1 for Text Classification and Summarization](https://www.daniweb.com/programming/computer-science/tutorials/542973/benchmarking-deepseek-r1-for-text-classification-and-summarization)
- [FinGPT-Forecaster Model Comparison: Llama-3.1-8B vs DeepSeek-R1-Distill-Llama-8B](https://medium.com/%40zhutiancheng0611/fingpt-forecaster-model-comparison-llama-3-1-8b-vs-deepseek-r1-distill-llama-8b-682682f71d14)

```bash

Feel free to modify any section to suit your project's specifics or update links and images as needed.

```