https://github.com/drankrock/autoscrape
A 10-in-one scraper with parsers as plugins
https://github.com/drankrock/autoscrape
parsing playwright python3 scraping selenium seleniumbase
Last synced: 4 months ago
JSON representation
A 10-in-one scraper with parsers as plugins
- Host: GitHub
- URL: https://github.com/drankrock/autoscrape
- Owner: DrankRock
- License: mit
- Created: 2025-03-23T15:22:00.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-03-31T09:24:08.000Z (11 months ago)
- Last Synced: 2025-10-14T18:39:34.932Z (4 months ago)
- Topics: parsing, playwright, python3, scraping, selenium, seleniumbase
- Language: Python
- Homepage:
- Size: 224 KB
- Stars: 10
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README

Scrape a list of urls with 10 different technologies, automatically, and parse the results with your own custom plugin.
## Disclaimer
About this project
This project was made using the infamous method of **vibe coding**, using Claude 3.7. I understand most of it, but the javascript using [Ulixee Hero](https://github.com/ulixee/hero) is not something i'm comfortable with. I highly encourage anyone who wants to modify it to do so. If anyone wants to fork it and update it regularly, be my guest, and I'll reference you there.
About testing environment
All testing was performed on Windows 11 with Python 3.12.4 and the latest versions of supported browsers. Performance and compatibility with other operating systems cannot be guaranteed. Users on Linux or MacOS may need to modify certain components to achieve functionality.
About liability
This software is provided for personal use in a protected environment only. I cannot and will not be held responsible for any misuse, illegal use, or any damages that may occur from using this software. Users are solely responsible for ensuring they comply with all applicable laws, terms of service, and policies when using this tool. By downloading or using this software, you acknowledge that you assume all risks associated with its use.
## Software View

## Example
Comparison: Headless vs Not Headless Mode
Hero + stealth Not Headless
Standard Selenium Headless
## Setup
Run the script setup.bat to :
1) Install Python3 if not installed
2) Install all the python dependencies
3) Install npm / javascript if not installed
4) Install npm dependencies
## Usage
To launch AutoScrape, you may either run autoscrape.bat or go in the Backend folder and run `python autoscrape.py`.
### Simple plug and play
1) Enter a url in the url list textbox
2) Chose your technology. Selenium standard is fast but not very discrete. Ulixee hero stealth is very slow bu very hard to detect. Everything has downside and upsides.
3) Chose wether to run it in headless (in background) or not (a browser window will open). Headless is easier to detect by anti-bot technologies.
4) Click Run
5) If the page was accessed and no cloudflare page was detected, the html is saved in `Backend/scraped_html/`
### Advanced Usage
#### Input
A list of urls instead of a single url can be used, either by pasting it, or loading, in .txt format, using the Load URLs button. All urls are *consumed* by the execution, each time one is scraped, it is removed from the list.
#### Scraping technologies
[Selenium](https://github.com/SeleniumHQ/selenium) is a webdriver made for browser automation.
* Selenium Standard is the default. It is very fast, but not very stealthy. It is easily detectable but works for basic usage or unprotected websites.
* Selenium Stealth is just like selenium standard, but uses the [selenium_stealth](https://github.com/fedorenko22116/selenium-stealth) module, a bit more secure and undetected. Selenium stealth has not been updated for four years though.
* Selenium Undetected uses a different chromedriver, [undetected-chromedriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver), made for stealth. It has not been updated for a year.
* Selenium Base uses a different selenium, [SeleniumBase](https://github.com/seleniumbase/SeleniumBase), built for scraping. It is much better than selenium. AutoScrape is not using SeleniumBase at its fullest, I might update this in the future.
[Ulixee Hero](https://github.com/ulixee/hero) is a browser built for scraping. It uses selenium for scraping, but is way less detectable.
* Hero Standard is the default. It runs with normal settings.
* Hero Puppeteer runs hero alongside [puppeteer](https://github.com/puppeteer/puppeteer), an API made to control Chrome and Firefox, and is good for scraping.
* Hero Extra runs hero alongside [puppeteer-extra](https://github.com/berstend/puppeteer-extra), which enables plugin usage. This option automatically uses the [puppeteer-extra-plugin-stealth](https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth) plugin for basic undetectability.
* Hero Stealth is an enhanced version that uses multiple undetection plugins including:
* puppeteer-extra-plugin-stealth
* puppeteer-extra-plugin-anonymize-ua
* puppeteer-extra-plugin-block-resources
* puppeteer-extra-plugin-user-preferences
* puppeteer-extra-plugin-user-data-dir
* puppeteer-extra-plugin-font-size
* puppeteer-extra-plugin-click-and-wait
* puppeteer-extra-plugin-proxy (if configured)
* puppeteer-extra-plugin-random-user-agent
[Playwright](https://github.com/microsoft/playwright) is a framework made by Microsoft for web testing and automation.
* Playwright standard is the basic experience
* Playwright puppeteer+stealth is similar to the ulixe hero extra, but using [playwright extra](https://github.com/berstend/puppeteer-extra/tree/master/packages/playwright-extra) instead of puppeteer extra
#### Human Behavior
Human Behavior is just some tweak which adds in some scrolling, clicking etc to appear more humane, with a low to high setting. I have not tested this much, i advice just not using it, and it's useless in headless.
#### Headless
Headless mode runs the browser without showing a window.
- **Headless (True)**: Way faster and lets you use your computer while scraping happens in background. Easier for websites to detect as a bot though.
- **Not Headless (False)**: Browser window opens and takes over your screen, making it unusable while scraping. Slower but way harder to detect. Use this for heavily protected websites.
#### Using Plugins
By default, AutoScrape only saves the raw HTML of scraped pages to the `Backend/scraped_html/` directory. To extract structured data:
1. Select a plugin from the dropdown menu in the interface
2. When you run the scraper, it will process the HTML with your selected plugin
3. Extracted data is saved as CSV files in `Backend/scraped_data/`
This lets you automatically extract specific information like prices, product details, or other structured data from the scraped websites.
## Creating Custom Plugins
AutoScrape supports custom parser plugins that can extract specific data from the scraped HTML. These plugins process websites you scrape and output structured data.
### Plugin Structure
Plugins are Python classes with a specific interface. Each plugin must:
1. Be placed in the `Backend/plugins` directory
2. Import the necessary modules (`ScrapedField` and `DataType` from `templated_plugin`)
3. Implement all required interface methods
### Basic Plugin Template
Click to view basic plugin template
```python
from dataclasses import dataclass
from typing import Any, List, Optional, Type, Union
from bs4 import BeautifulSoup
from templated_plugin import ScrapedField, DataType
class MyCustomPlugin:
"""Plugin that extracts specific data from a website."""
def get_name(self) -> str:
"""Return the name of the plugin."""
return "My Custom Plugin"
def get_description(self) -> str:
"""Return a description of what the plugin extracts."""
return "Extracts important data from my favorite website"
def get_version(self) -> str:
"""Return the version of the plugin."""
return "1.0.0"
def get_available_fields(self) -> List[ScrapedField]:
"""
Returns all possible fields this plugin can extract, with default values.
"""
return [
ScrapedField(
name="title",
value="Example Title",
field_type=DataType.STRING,
description="The title of the page",
accumulate=True
),
ScrapedField(
name="price",
value="$19.99",
field_type=DataType.STRING,
description="The price of the item",
accumulate=True
)
]
def parse(self, html: str) -> List[ScrapedField]:
"""
Parse HTML content and extract data.
"""
soup = BeautifulSoup(html, 'html.parser')
results = []
# Extract title
title_element = soup.select_one('h1.product-title')
if title_element:
results.append(ScrapedField(
name="title",
value=title_element.get_text().strip(),
field_type=DataType.STRING,
description="The title of the page",
accumulate=True
))
# Extract price
price_element = soup.select_one('span.price')
if price_element:
results.append(ScrapedField(
name="price",
value=price_element.get_text().strip(),
field_type=DataType.STRING,
description="The price of the item",
accumulate=True
))
return results
```
### The ScrapedField Class
The `ScrapedField` class defines the data fields your plugin extracts:
- **name**: Identifier for the field
- **value**: The extracted value
- **field_type**: Data type (STRING, INTEGER, FLOAT, BOOLEAN, etc.)
- **description**: Human-readable description of the field
- **accumulate**: Whether to collect multiple values for this field across scrapes
### Data Types
Available data types from the `DataType` enum:
- `DataType.STRING`: For text values
- `DataType.INTEGER`: For whole numbers
- `DataType.FLOAT`: For decimal numbers
- `DataType.BOOLEAN`: For true/false values
- `DataType.JSON`: For structured data
### Advanced Plugin Example
Click to view advanced plugin example
```python
class CardmarketPricePlugin:
"""Plugin that extracts price information from Cardmarket pages."""
# Global configuration flag to control whether prices are stored as floats or formatted strings
STORE_PRICES_AS_FLOAT = False # Set to True to store prices as float values without currency symbols
def get_name(self) -> str:
"""Return the name of the plugin."""
return "Cardmarket Price Plugin"
def get_description(self) -> str:
"""Return a description of what the plugin extracts."""
return "Extracts price information from Cardmarket product pages across different games and languages"
def get_version(self) -> str:
"""Return the version of the plugin."""
return "1.0.0"
def get_available_fields(self) -> List[ScrapedField]:
"""
Returns all possible fields this plugin can extract, with default values.
"""
return [
ScrapedField(
name="card_name",
value="Example Card",
field_type=DataType.STRING,
description="Name of the card",
accumulate=True
),
ScrapedField(
name="card_set",
value="Example Set",
field_type=DataType.STRING,
description="Set/expansion the card belongs to",
accumulate=True
),
ScrapedField(
name="available_items",
value=500,
field_type=DataType.INTEGER,
description="Number of available items for sale",
accumulate=True
),
ScrapedField(
name="lowest_price",
value=1.00 if self.STORE_PRICES_AS_FLOAT else "1,00 €",
field_type=DataType.FLOAT if self.STORE_PRICES_AS_FLOAT else DataType.STRING,
description="Lowest price available for the card",
accumulate=True
),
ScrapedField(
name="card_rarity",
value="Uncommon",
field_type=DataType.STRING,
description="Rarity of the card",
accumulate=True
)
]
def _clean_price_string(self, price_string: str) -> str:
"""Clean and fix encoding issues in price strings."""
if not price_string:
return ""
# Handle common encoding issues
cleaned = price_string.replace("€", "€")
cleaned = cleaned.replace("£", "£")
cleaned = cleaned.replace("Â$", "$")
# Remove any extra whitespace
cleaned = cleaned.strip()
return cleaned
def _parse_price_to_float(self, price_string: str) -> float:
"""Parse a price string into a float value, removing currency symbols."""
if not price_string:
return 0.0
try:
# Remove currency symbols and other non-numeric characters
cleaned = ''.join(c for c in price_string if c.isdigit() or c in ',.').strip()
# Handle European number format (comma as decimal separator)
if ',' in cleaned and '.' in cleaned:
# If both are present, assume European format with thousand separators
cleaned = cleaned.replace('.', '') # Remove thousand separators
cleaned = cleaned.replace(',', '.') # Convert decimal separator
elif ',' in cleaned:
# Only comma present, assume it's a decimal separator
cleaned = cleaned.replace(',', '.')
return float(cleaned)
except ValueError:
return 0.0
def parse(self, html: str) -> List[ScrapedField]:
"""Parse HTML content and extract Cardmarket price information."""
soup = BeautifulSoup(html, 'html.parser')
results = []
# Extract card name and set
try:
title_container = soup.select_one('.page-title-container')
if title_container:
h1 = title_container.select_one('h1')
if h1:
# Extract main card name (text before the span)
card_name = h1.get_text().strip()
set_span = h1.select_one('span')
if set_span:
card_name = card_name.replace(set_span.get_text(), '').strip()
card_set = set_span.get_text().strip()
results.append(ScrapedField(
name="card_name",
value=card_name,
field_type=DataType.STRING,
description="Name of the card",
accumulate=True
))
results.append(ScrapedField(
name="card_set",
value=card_set,
field_type=DataType.STRING,
description="Set/expansion the card belongs to",
accumulate=True
))
except Exception:
# Continue even if card name extraction fails
pass
# Find the info container
container = soup.select_one('.info-list-container')
if not container:
return results
# Process prices, rarity, etc.
# ... (additional extraction code)
return results
```
### Using Your Plugin
Once you've created your plugin:
1. Place the Python file in the `Backend/plugins` directory
2. Restart AutoScrape
3. Your plugin will be automatically loaded and available for use
4. When scraping a website, your plugin will process the HTML and save structured data
The extracted data from plugins is saved in the `Backend/scraped_data/` directory in CSV format.
## Why the cat
Isn't she adorable ?