https://github.com/umutkayash/ai-scraper
AI Scraper
https://github.com/umutkayash/ai-scraper
beautifulsoup4 html5lib langchain langchain-ollama lxml ollama ollama-app python python-dotenv python3 selenium selenium-webdriver streamlit
Last synced: about 2 months ago
JSON representation
AI Scraper
- Host: GitHub
- URL: https://github.com/umutkayash/ai-scraper
- Owner: umutkayash
- License: mit
- Created: 2025-01-01T00:48:44.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-01T00:52:19.000Z (over 1 year ago)
- Last Synced: 2025-02-22T02:26:45.374Z (over 1 year ago)
- Topics: beautifulsoup4, html5lib, langchain, langchain-ollama, lxml, ollama, ollama-app, python, python-dotenv, python3, selenium, selenium-webdriver, streamlit
- Language: Python
- Homepage:
- Size: 8.88 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AI Scraper Project
Welcome to the AI Scraper project repository! This project uses Python, Selenium, BeautifulSoup, and the Ollama language model to scrape, parse, and extract information from web pages.
## Project Overview
The AI Scraper is designed to handle complex web scraping tasks including captcha solving, HTML parsing, and structured data extraction using advanced AI techniques.
## Prerequisites
Before you begin, ensure you have the following installed:
- Python 3.8 or higher
- pip (Python package installer)
## Installation
1. **Clone the Repository:**
```bash
git clone https://github.com/umutkayash/AI-Scraper.git
cd AI-Scraper
```
2. **Set Up a Virtual Environment (recommended):**
```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scriptsctivate`
```
3. **Install Dependencies:**
```bash
pip install -r requirements.txt
```
## Configuration
- You will need to set environment variables for the Selenium WebDriver URL. You can do this by creating a `.env` file in the project root with the following content:
```
WEBDRIVER_URL="your_webdriver_url_here"
```
## Usage
To run the scraper:
1. **Activate your virtual environment if not already activated:**
```bash
source venv/bin/activate # On Windows use `venv\Scriptsctivate`
```
2. **Run the Scraper:**
```bash
python main.py
```
Replace `main.py` with the script you wish to run.
## How It Works
The AI Scraper performs the following steps:
- Connects to a web page using Selenium.
- Handles any captchas using configured settings.
- Extracts HTML content and parses it using BeautifulSoup.
- Segments the HTML content if necessary.
- Uses the Ollama model to extract specific information based on user-defined criteria.
## Contributing
Contributions are welcome! Please fork the repository and create a pull request with your changes.
## License
This project is licensed under the MIT License - see the LICENSE file for details.