https://github.com/tejas-130704/webscraperai
WebScraperAI is a powerful tool that enables users to perform question-answering on website content using web scraping and retrieval-augmented generation (RAG) with LlamaIndex. It supports multiple LLMs, including OpenAI GPT-3.5, GPT-4, Gemini Pro, Gemini Ultra, and DeepSeek.
https://github.com/tejas-130704/webscraperai
ai llms open-source python rag-pipeline streamlit web-scraping web-scraping-ai
Last synced: 12 months ago
JSON representation
WebScraperAI is a powerful tool that enables users to perform question-answering on website content using web scraping and retrieval-augmented generation (RAG) with LlamaIndex. It supports multiple LLMs, including OpenAI GPT-3.5, GPT-4, Gemini Pro, Gemini Ultra, and DeepSeek.
- Host: GitHub
- URL: https://github.com/tejas-130704/webscraperai
- Owner: tejas-130704
- Created: 2025-02-23T07:35:10.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-25T04:21:31.000Z (over 1 year ago)
- Last Synced: 2025-03-20T18:46:15.693Z (over 1 year ago)
- Topics: ai, llms, open-source, python, rag-pipeline, streamlit, web-scraping, web-scraping-ai
- Language: Python
- Homepage:
- Size: 7.81 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# WebScraperAI
## Overview
WebScraperAI is a web-based tool that allows users to perform question-answering on a given website URL. It supports multiple LLMs and has two modes of operation:
## Preview






1. **Page-Specific Q&A**: Extracts information only from the given webpage.
2. **Deep Analysis Q&A**: Extracts information from the given page and all its linked pages (use cautiously, as it may take a long time for large websites).
The project uses **BeautifulSoup** for web scraping and a **RAG pipeline in LlamaIndex and HuggingFace** to enhance response accuracy. Supported LLMs include:
- OpenAI GPT-3.5
- OpenAI GPT-4
- Gemini Pro
- Gemini Ultra
- DeepSeek
- Groq
## Features
- Extracts and analyzes website content for Q&A.
- Offers two modes: specific page analysis and deep analysis.
- Supports multiple LLMs for flexibility.
- Built with Streamlit for an interactive UI.
## Installation
Follow these steps to set up the project on your local machine:
### 1. Clone the Repository
```sh
git clone https://github.com/tejas-130704/WebScraperAI.git
cd WebScraperAI
```
### 2. Create a Virtual Environment
```sh
python -m venv venv
```
### 3. Activate the Virtual Environment
- **Windows:**
```sh
venv\Scripts\activate
```
- **Mac/Linux:**
```sh
source venv/bin/activate
```
### 4. Install Dependencies
```sh
pip install -r requirements.txt
```
## Usage
### 1. Run the Streamlit App
```sh
streamlit run app.py
```
### 2. Enter Details
- **Select Model**: Choose an LLM for processing.
- **Enter API Key**: Provide the API key for the selected LLM.
- **Enter Website URL**: Input the URL to analyze.
- **Choose Deep Analysis (Optional)**: Check this box if you want to analyze linked pages.
### 3. Click **Load Website & LLM** to start the process.
- After processing, enter a question related to the webpage and click **Ask Question**.
## Caution ⚠️
- **Use Deep Analysis Only for Limited Scope Websites**: Avoid using it on large websites like Wikipedia, as the high number of linked pages may cause extreme delays or failures.
- **Respect Website Policies**: Some sites may have anti-scraping policies. Always ensure compliance.
- **API Limits**: LLM responses are subject to API limits and costs depending on the provider.
## Future Enhancements
- Implement caching to improve deep analysis speed.
- Add support for multi-threaded scraping.
- Introduce a ranking system for LLM performance comparison.
## Contributing
Pull requests are welcome! If you find any issues, feel free to open an issue in the repository.