https://github.com/cecile-hirschauer/ai_web_scraper

ollama-api python3 scraping-websites selenium-python streamlit

Last synced: 14 days ago
JSON representation

Host: GitHub
URL: https://github.com/cecile-hirschauer/ai_web_scraper
Owner: Cecile-Hirschauer
License: mit
Created: 2025-03-20T20:58:24.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-24T10:47:44.000Z (about 1 year ago)
Last Synced: 2025-08-08T21:53:07.713Z (11 months ago)
Topics: ollama-api, python3, scraping-websites, selenium-python, streamlit
Language: Python
Homepage:
Size: 30.3 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# AI Web Scraper

An intelligent web scraping solution that combines automated browser-based scraping with AI-powered content parsing and Google Sheets integration.

## Overview

AI Web Scraper is a Streamlit-based application that handles the complete web scraping workflow:

1. **Automated Scraping**: Uses Selenium with Bright Data proxy to handle anti-bot measures and CAPTCHAs
2. **Intelligent Parsing**: Leverages Ollama LLM (Large Language Model) to extract specific information from web content
3. **Result Storage**: Stores results in both local cache and Google Sheets for easy access and sharing
4. **User-Friendly Interface**: Provides a clean web interface for all scraping operations

## Features

- **CAPTCHA Handling**: Automatically solves CAPTCHAs using Bright Data's Scraping Browser
- **Intelligent Content Extraction**: Uses local LLM to extract exactly what you need from scraped content
- **Caching System**: Efficiently caches scraped content to minimize redundant requests
- **Google Sheets Integration**: Stores and indexes scraped data for collaborative access
- **Search & Retrieve**: Find previously parsed content through text search

## Requirements

- Python 3.8+
- Ollama with the Llama3 model installed locally
- Bright Data account with Scraping Browser access
- Google Cloud Platform account (for Google Sheets integration)

## Installation

1. Clone this repository:

```
git clone https://github.com/yourusername/AI_Web_Scraper.git
cd AI_Web_Scraper
```

2. Install dependencies:

```
pip install -r requirements.txt
```

3. Create a `.env` file in the project root with the following variables:

```
BRD_AUTH=your_bright_data_auth_key
GOOGLE_CREDENTIALS_FILE=credentials.json
```

4. For Google Sheets integration:
- Follow instructions in the Google Cloud Console to create a service account
- Download the credentials JSON file and save as `credentials.json` in the project root

## Usage

1. Start the Streamlit app:

```
streamlit run main.py
```

2. Access the web interface at `http://localhost:8501`

3. Enter a URL to scrape

4. Describe the information you want to extract

5. View and search parsed results in the app or Google Sheets

## Project Structure

- `main.py`: Streamlit web interface
- `scrape.py`: Web scraping functionality using Selenium
- `parse.py`: Content parsing using Ollama LLM
- `cache_manager.py`: Local caching system
- `gsheets_storage.py`: Google Sheets integration
- `find_sheet.py`: Utility to find available Google Sheets

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- Uses [Bright Data](https://brightdata.com/) for CAPTCHA solving and proxy services
- Powered by [Ollama](https://ollama.ai/) for local LLM inference
- Built with [Streamlit](https://streamlit.io/) for the web interface
- Integrates with [Google Sheets API](https://developers.google.com/sheets/api) for data storage

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cecile-hirschauer/ai_web_scraper

Awesome Lists containing this project

README