https://github.com/balaji1233/web_master
AI tool to transforms any URL into a structured knowledge source by: extracting content using Crawl4AI ,vectorizing and summarizing data , running Retrieval-Augmented Generation (RAG) for deep information discovery, enabling a smart chatbot for interactive Q&A.
https://github.com/balaji1233/web_master
crawl4ai deepseek-r1 docker faiss-vector-database ollama rag streamlit
Last synced: 7 months ago
JSON representation
AI tool to transforms any URL into a structured knowledge source by: extracting content using Crawl4AI ,vectorizing and summarizing data , running Retrieval-Augmented Generation (RAG) for deep information discovery, enabling a smart chatbot for interactive Q&A.
- Host: GitHub
- URL: https://github.com/balaji1233/web_master
- Owner: balaji1233
- Created: 2025-03-05T18:58:42.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-03-10T18:21:49.000Z (7 months ago)
- Last Synced: 2025-03-10T19:25:34.884Z (7 months ago)
- Topics: crawl4ai, deepseek-r1, docker, faiss-vector-database, ollama, rag, streamlit
- Language: Python
- Homepage:
- Size: 8.79 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# WEB_MASTER
AI tool transforms any URL into a structured knowledge source by: extracting content using Crawl4AI ,vectorizing and summarizing data , running Retrieval-Augmented Generation (RAG) for deep information discovery, enabling a smart chatbot for interactive Q&A.**WebMaster** is a powerful AI-driven tool that transforms any URL into a structured knowledge source. Built using [Crawl4AI](#), [Ollama](#), [DeepSeek](#), and [Streamlit](#), it enables you to extract, vectorize, and summarize web content—and interact with it through a smart chatbot. Perfect for researchers, analysts, and AI enthusiasts, WebMaster isn’t just another coding exercise; it’s a real-world solution to information overload.
---
## 🚀 Why WebMaster?
### The Problem
In today’s fast-paced digital age, countless websites contain valuable data and insights—but manually extracting and understanding this content is time-consuming and error-prone.
- **For researchers and analysts:** Sifting through lengthy articles and disparate data is inefficient.
- **For businesses:** Making sense of scattered online information can hinder strategic decisions.### Our Solution
WebMaster addresses these challenges by:
- **Extracting Web Content:** Automatically crawling and gathering text from any URL.
- **Structuring Information:** Vectorizing and summarizing data to present clear, concise insights.
- **Deep Information Discovery:** Employing Retrieval-Augmented Generation (RAG) to uncover deeper, contextual details.
- **Interactive Q&A:** Offering a chatbot interface that lets you query and interact with the extracted content in real time.---
## 🔑 Key Features
- **Website Extraction:**
Uses Crawl4AI to efficiently crawl and extract content from web pages.- **Summarization:**
Generates detailed summaries of the extracted content—ideal for long articles or complex websites.- **Embeddings & Retrieval:**
Creates embeddings using FAISS for intelligent document retrieval, overcoming open-source context window limitations.- **Chatbot Interface:**
Provides a conversational agent for interactive Q&A, letting you explore your content seamlessly.- **Dual AI Engine Support:**
Choose between Closed Source (OpenAI) and Open Source (Ollama) engines for both summarization and conversation to suit your needs.---
## 🎯 Impact & Value
- **Real-World Problem Solving:**
Rather than being just a coding exercise, WebSage is designed as a business tool—for instance, helping freelancers manage data or enabling researchers to efficiently analyze academic content.- **Quantifiable Benefits:**
- **Time Savings:** Automates extraction and summarization, potentially reducing manual analysis time by up to 35%.
- **Enhanced Insight:** The RAG approach enables deeper, context-aware retrieval of information.
- **Flexibility & Cost-Efficiency:** Supports both open and closed source AI engines, allowing for tailored, budget-friendly solutions.---
## 🛠️ How to Use WebMaster
### Prerequisites
- **Python 3.8+**
- Required packages as listed in `requirements.txt`
- API keys or access tokens for AI engines (if using Closed Source models)### Installation
Clone the repository and install dependencies:
```bash
git clone https://github.com/yourusername/websage.git
cd webmaster
pip install -r requirements.txt```
## ConfigurationEdit the `config.yaml` file to set your preferred options:
- **AI Engine Selection:**
Choose between OpenAI (Closed Source) and Ollama (Open Source) for summarization and chat.- **FAISS Vector Database:**
Configure local vector database settings.- **Other Parameters:**
Set URL input, output format, etc.---
## Running the Application
Launch the Streamlit interface to start using WebSage:
```bash
streamlit run app.py
```This opens a browser window where you can:
- **Enter a URL:** Trigger content extraction.
- **View Summaries:** Read concise, AI-generated summaries.
- **Chat with the Bot:** Ask follow-up questions and explore your content interactively.## Project Structure
websage/
├── app.py # Streamlit web app entry point
├── config.yaml # Configuration file for API keys, DB settings, etc.
├── crawlers/ # Content extraction using Crawl4AI
├── summarizer/ # Modules for text summarization and embeddings creation
├── chatbot/ # Chatbot interface using RAG for Q&A
├── requirements.txt # Python dependencies
└── README.md # Project documentation## 💬 Contributing
We welcome contributions! To get involved:- Fork the repository.
- Create a feature branch.
- Submit a pull request with your changes.For major contributions, please open an issue to discuss your ideas first.
## 📄 License
This project is open-source and available under the MIT License.## 🙌 Final Thoughts
For early-career developers aiming to add meaningful projects to your GitHub profile, WebMaster demonstrates not only coding ability but also strong problem-solving skills. Focus on impact, not just output—one impactful project can be far more valuable than hundreds of clone apps.## References
- [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/abs/2501.12948)
- [Benchmarking DeepSeek R1 for Text Classification and Summarization](https://www.daniweb.com/programming/computer-science/tutorials/542973/benchmarking-deepseek-r1-for-text-classification-and-summarization)
- [FinGPT-Forecaster Model Comparison: Llama-3.1-8B vs DeepSeek-R1-Distill-Llama-8B](https://medium.com/%40zhutiancheng0611/fingpt-forecaster-model-comparison-llama-3-1-8b-vs-deepseek-r1-distill-llama-8b-682682f71d14)```bash
Feel free to modify any section to suit your project's specifics or update links and images as needed.
```