Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/D4Vinci/Scrapling

Lightning-Fast, Adaptive Web Scraping for Python
https://github.com/D4Vinci/Scrapling

automation crawler crawling crawling-python css dom-manipulation hacktoberfest lxml playwright python python3 scraping selectors selenium stealth web-scraper web-scraping web-scraping-python webscraping xpath

Last synced: 2 months ago
JSON representation

Lightning-Fast, Adaptive Web Scraping for Python

Awesome Lists containing this project

README

        

# πŸ•·οΈ Scrapling: Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python
[![Tests](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml/badge.svg)](https://github.com/D4Vinci/Scrapling/actions/workflows/tests.yml) [![PyPI version](https://badge.fury.io/py/Scrapling.svg)](https://badge.fury.io/py/Scrapling) [![Supported Python versions](https://img.shields.io/pypi/pyversions/scrapling.svg)](https://pypi.org/project/scrapling/) [![PyPI Downloads](https://static.pepy.tech/badge/scrapling)](https://pepy.tech/project/scrapling)

Dealing with failing web scrapers due to anti-bot protections or website changes? Meet Scrapling.

Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity.

```python
>> from scrapling.default import Fetcher, StealthyFetcher, PlayWrightFetcher
# Fetch websites' source under the radar!
>> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
>> print(page.status)
200
>> products = page.css('.product', auto_save=True) # Scrape data that survives website design changes!
>> # Later, if the website structure changes, pass `auto_match=True`
>> products = page.css('.product', auto_match=True) # and Scrapling still finds them!
```

# Sponsors

[Evomi](https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling) is your Swiss Quality Proxy Provider, starting at **$0.49/GB**

- πŸ‘©β€πŸ’» **$0.49 per GB Residential Proxies**: Our price is unbeatable
- πŸ‘©β€πŸ’» **24/7 Expert Support**: We will join your Slack Channel
- 🌍 **Global Presence**: Available in 150+ Countries
- ⚑ **Low Latency**
- πŸ”’ **Swiss Quality and Privacy**
- 🎁 **Free Trial**
- πŸ›‘οΈ **99.9% Uptime**
- 🀝 **Special IP Pool selection**: Optimize for fast, quality or quantity of ips
- πŸ”§ **Easy Integration**: Compatible with most software and programming languages

[![Evomi Banner](https://my.evomi.com/images/brand/cta.png)](https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling)
---

## Table of content
* [Key Features](#key-features)
* [Fetch websites as you prefer](#fetch-websites-as-you-prefer)
* [Adaptive Scraping](#adaptive-scraping)
* [Performance](#performance)
* [Developing Experience](#developing-experience)
* [Getting Started](#getting-started)
* [Parsing Performance](#parsing-performance)
* [Text Extraction Speed Test (5000 nested elements).](#text-extraction-speed-test-5000-nested-elements)
* [Extraction By Text Speed Test](#extraction-by-text-speed-test)
* [Installation](#installation)
* [Fetching Websites Features](#fetching-websites-features)
* [Fetcher](#fetcher)
* [StealthyFetcher](#stealthyfetcher)
* [PlayWrightFetcher](#playwrightfetcher)
* [Advanced Parsing Features](#advanced-parsing-features)
* [Smart Navigation](#smart-navigation)
* [Content-based Selection & Finding Similar Elements](#content-based-selection--finding-similar-elements)
* [Handling Structural Changes](#handling-structural-changes)
* [Real World Scenario](#real-world-scenario)
* [Find elements by filters](#find-elements-by-filters)
* [Is That All?](#is-that-all)
* [More Advanced Usage](#more-advanced-usage)
* [⚑ Enlightening Questions and FAQs](#-enlightening-questions-and-faqs)
* [How does auto-matching work?](#how-does-auto-matching-work)
* [How does the auto-matching work if I didn't pass a URL while initializing the Adaptor object?](#how-does-the-auto-matching-work-if-i-didnt-pass-a-url-while-initializing-the-adaptor-object)
* [If all things about an element can change or get removed, what are the unique properties to be saved?](#if-all-things-about-an-element-can-change-or-get-removed-what-are-the-unique-properties-to-be-saved)
* [I have enabled the `auto_save`/`auto_match` parameter while selecting and it got completely ignored with a warning message](#i-have-enabled-the-auto_saveauto_match-parameter-while-selecting-and-it-got-completely-ignored-with-a-warning-message)
* [I have done everything as the docs but the auto-matching didn't return anything, what's wrong?](#i-have-done-everything-as-the-docs-but-the-auto-matching-didnt-return-anything-whats-wrong)
* [Can Scrapling replace code built on top of BeautifulSoup4?](#can-scrapling-replace-code-built-on-top-of-beautifulsoup4)
* [Can Scrapling replace code built on top of AutoScraper?](#can-scrapling-replace-code-built-on-top-of-autoscraper)
* [Is Scrapling thread-safe?](#is-scrapling-thread-safe)
* [More Sponsors!](#more-sponsors)
* [Contributing](#contributing)
* [Disclaimer for Scrapling Project](#disclaimer-for-scrapling-project)
* [License](#license)
* [Acknowledgments](#acknowledgments)
* [Thanks and References](#thanks-and-references)
* [Known Issues](#known-issues)

## Key Features

### Fetch websites as you prefer
- **HTTP requests**: Stealthy and fast HTTP requests with `Fetcher`
- **Stealthy fetcher**: Annoying anti-bot protection? No problem! Scrapling can bypass almost all of them with `StealthyFetcher` with default configuration!
- **Your preferred browser**: Use your real browser with CDP, [NSTbrowser](https://app.nstbrowser.io/r/1vO5e5)'s browserless, PlayWright with stealth mode, or even vanilla PlayWright - All is possible with `PlayWrightFetcher`!

### Adaptive Scraping
- πŸ”„ **Smart Element Tracking**: Locate previously identified elements after website structure changes, using an intelligent similarity system and integrated storage.
- 🎯 **Flexible Querying**: Use CSS selectors, XPath, Elements filters, text search, or regex - chain them however you want!
- πŸ” **Find Similar Elements**: Automatically locate elements similar to the element you want on the page (Ex: other products like the product you found on the page).
- 🧠 **Smart Content Scraping**: Extract data from multiple websites without specific selectors using Scrapling powerful features.

### Performance
- πŸš€ **Lightning Fast**: Built from the ground up with performance in mind, outperforming most popular Python scraping libraries (outperforming BeautifulSoup in parsing by up to 620x in our tests).
- πŸ”‹ **Memory Efficient**: Optimized data structures for minimal memory footprint.
- ⚑ **Fast JSON serialization**: 10x faster JSON serialization than the standard json library with more options.

### Developing Experience
- πŸ› οΈ **Powerful Navigation API**: Traverse the DOM tree easily in all directions and get the info you want (parent, ancestors, sibling, children, next/previous element, and more).
- 🧬 **Rich Text Processing**: All strings have built-in methods for regex matching, cleaning, and more. All elements' attributes are read-only dictionaries that are faster than standard dictionaries with added methods.
- πŸ“ **Automatic Selector Generation**: Create robust CSS/XPath selectors for any element.
- πŸ”Œ **API Similar to Scrapy/BeautifulSoup**: Familiar methods and similar pseudo-elements for Scrapy and BeautifulSoup users.
- πŸ“˜ **Type hints and test coverage**: Complete type coverage and almost full test coverage for better IDE support and fewer bugs, respectively.

## Getting Started

```python
from scrapling import Fetcher

fetcher = Fetcher(auto_match=False)

# Fetch a web page and create an Adaptor instance
page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
# Get all strings in the full page
page.get_all_text(ignore_tags=('script', 'style'))

# Get all quotes, any of these methods will return a list of strings (TextHandlers)
quotes = page.css('.quote .text::text') # CSS selector
quotes = page.xpath('//span[@class="text"]/text()') # XPath
quotes = page.css('.quote').css('.text::text') # Chained selectors
quotes = [element.text for element in page.css('.quote .text')] # Slower than bulk query above

# Get the first quote element
quote = page.css_first('.quote') # / page.css('.quote').first / page.css('.quote')[0]

# Tired of selectors? Use find_all/find
quotes = page.find_all('div', {'class': 'quote'})
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote') # and so on...

# Working with elements
quote.html_content # Inner HTML
quote.prettify() # Prettified version of Inner HTML
quote.attrib # Element attributes
quote.path # DOM path to element (List)
```
To keep it simple, all methods can be chained on top of each other!

## Parsing Performance

Scrapling isn't just powerful - it's also blazing fast. Scrapling implements many best practices, design patterns, and numerous optimizations to save fractions of seconds. All of that while focusing exclusively on parsing HTML documents.
Here are benchmarks comparing Scrapling to popular Python libraries in two tests.

### Text Extraction Speed Test (5000 nested elements).

| # | Library | Time (ms) | vs Scrapling |
|---|:-----------------:|:---------:|:------------:|
| 1 | Scrapling | 5.44 | 1.0x |
| 2 | Parsel/Scrapy | 5.53 | 1.017x |
| 3 | Raw Lxml | 6.76 | 1.243x |
| 4 | PyQuery | 21.96 | 4.037x |
| 5 | Selectolax | 67.12 | 12.338x |
| 6 | BS4 with Lxml | 1307.03 | 240.263x |
| 7 | MechanicalSoup | 1322.64 | 243.132x |
| 8 | BS4 with html5lib | 3373.75 | 620.175x |

As you see, Scrapling is on par with Scrapy and slightly faster than Lxml which both libraries are built on top of. These are the closest results to Scrapling. PyQuery is also built on top of Lxml but still, Scrapling is 4 times faster.

### Extraction By Text Speed Test

| Library | Time (ms) | vs Scrapling |
|:-----------:|:---------:|:------------:|
| Scrapling | 2.51 | 1.0x |
| AutoScraper | 11.41 | 4.546x |

Scrapling can find elements with more methods and it returns full element `Adaptor` objects not only the text like AutoScraper. So, to make this test fair, both libraries will extract an element with text, find similar elements, and then extract the text content for all of them. As you see, Scrapling is still 4.5 times faster at the same task.

> All benchmarks' results are an average of 100 runs. See our [benchmarks.py](https://github.com/D4Vinci/Scrapling/blob/main/benchmarks.py) for methodology and to run your comparisons.

## Installation
Scrapling is a breeze to get started with - Starting from version 0.2, we require at least Python 3.8 to work.
```bash
pip3 install scrapling
```
- For using the `StealthyFetcher`, go to the command line and download the browser with
Windows OS

```bash
camoufox fetch --browserforge
```

MacOS

```bash
python3 -m camoufox fetch --browserforge
```

Linux

```bash
python -m camoufox fetch --browserforge
```
On a fresh installation of Linux, you may also need the following Firefox dependencies:
- Debian-based distros
```bash
sudo apt install -y libgtk-3-0 libx11-xcb1 libasound2
```
- Arch-based distros
```bash
sudo pacman -S gtk3 libx11 libxcb cairo libasound alsa-lib
```

See the official Camoufox documentation for more info on installation

- If you are going to use the `PlayWrightFetcher` options, then install Playwright's Chromium browser with:
```commandline
playwright install chromium
```
- If you are going to use normal requests only with the `Fetcher` class then update the fingerprints files with:
```commandline
python -m browserforge update
```

## Fetching Websites Features
You might be a little bit confused by now so let me clear things up. All fetcher-type classes are imported in the same way
```python
from scrapling import Fetcher, StealthyFetcher, PlayWrightFetcher
```
And all of them can take these initialization arguments: `auto_match`, `huge_tree`, `keep_comments`, `storage`, `storage_args`, and `debug` which are the same ones you give to the `Adaptor` class.

If you don't want to pass arguments to the generated `Adaptor` object and want to use the default values, you can use this import instead for cleaner code:
```python
from scrapling.default import Fetcher, StealthyFetcher, PlayWrightFetcher
```
then use it right away without initializing like:
```python
page = StealthyFetcher.fetch('https://example.com')
```

Also, the `Response` object returned from all fetchers is the same as `Adaptor` object except it has these added attributes: `status`, `reason`, `cookies`, `headers`, and `request_headers`. All `cookies`, `headers`, and `request_headers` are always of type `dictionary`.
> [!NOTE]
> The `auto_match` argument is enabled by default which is the one you should care about the most as you will see later.
### Fetcher
This class is built on top of [httpx](https://www.python-httpx.org/) with additional configuration options, here you can do `GET`, `POST`, `PUT`, and `DELETE` requests.

For all methods, you have `stealth_headers` which makes `Fetcher` create and use real browser's headers then create a referer header as if this request came from Google's search of this URL's domain. It's enabled by default.
```python
>> page = Fetcher().get('https://httpbin.org/get', stealth_headers=True, follow_redirects=True)
>> page = Fetcher().post('https://httpbin.org/post', data={'key': 'value'})
>> page = Fetcher().put('https://httpbin.org/put', data={'key': 'value'})
>> page = Fetcher().delete('https://httpbin.org/delete')
```
### StealthyFetcher
This class is built on top of [Camoufox](https://github.com/daijro/camoufox) which by default bypasses most of the anti-bot protections. Scrapling adds extra layers of flavors and configurations to increase performance and undetectability even further.
```python
>> page = StealthyFetcher().fetch('https://www.browserscan.net/bot-detection') # Running headless by default
>> page.status == 200
True
```
> Note: all requests done by this fetcher is waiting by default for all JS to be fully loaded and executed so you don't have to :)

For the sake of simplicity, expand this for the complete list of arguments

| Argument | Description | Optional |
|:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
| url | Target url | ❌ |
| headless | Pass `True` to run the browser in headless/hidden (**default**), `virtual` to run it in virtual screen mode, or `False` for headful/visible mode. The `virtual` mode requires having `xvfb` installed. | βœ”οΈ |
| block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | βœ”οΈ |
| disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | βœ”οΈ |
| google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | βœ”οΈ |
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | βœ”οΈ |
| block_webrtc | Blocks WebRTC entirely. | βœ”οΈ |
| page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | βœ”οΈ |
| addons | List of Firefox addons to use. **Must be paths to extracted addons.** | βœ”οΈ |
| humanize | Humanize the cursor movement. Takes either True or the MAX duration in seconds of the cursor movement. The cursor typically takes up to 1.5 seconds to move across the window. | βœ”οΈ |
| allow_webgl | Whether to allow WebGL. To prevent leaks, only use this for special cases. | βœ”οΈ |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | βœ”οΈ |
| timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | βœ”οΈ |
| wait_selector | Wait for a specific css selector to be in a specific state. | βœ”οΈ |
| proxy | The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | βœ”οΈ |
| os_randomize | If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS. | βœ”οΈ |
| wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | βœ”οΈ |

This list isn't final so expect a lot more additions and flexibility to be added in the next versions!

### PlayWrightFetcher
This class is built on top of [Playwright](https://playwright.dev/python/) which currently provides 4 main run options but they can be mixed as you want.
```python
>> page = PlayWrightFetcher().fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option
>> page.css_first("#search a::attr(href)")
'https://github.com/D4Vinci/Scrapling'
```
> Note: all requests done by this fetcher is waiting by default for all JS to be fully loaded and executed so you don't have to :)

Using this Fetcher class, you can make requests with:
1) Vanilla Playwright without any modifications other than the ones you chose.
2) Stealthy Playwright with the stealth mode I wrote for it. It's still a WIP but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/). Some of the things this fetcher's stealth mode does include:
* Patching the CDP runtime fingerprint.
* Mimics some of the real browsers' properties by injecting several JS files and using custom options.
* Using custom flags on launch to hide Playwright even more and make it faster.
* Generates real browser's headers of the same type and same user OS then append it to the request's headers.
3) Real browsers by passing the CDP URL of your browser to be controlled by the Fetcher and most of the options can be enabled on it.
4) [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option.

Add that to a lot of controlling/hiding options as you will see in the arguments list below.

Expand this for the complete list of arguments

| Argument | Description | Optional |
|:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
| url | Target url | ❌ |
| headless | Pass `True` to run the browser in headless/hidden (**default**), or `False` for headful/visible mode. | βœ”οΈ |
| disable_resources | Drop requests of unnecessary resources for a speed boost. It depends but it made requests ~25% faster in my tests for some websites.
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage but be careful with this option as it makes some websites never finish loading._ | βœ”οΈ |
| useragent | Pass a useragent string to be used. **Otherwise the fetcher will generate a real Useragent of the same browser and use it.** | βœ”οΈ |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | βœ”οΈ |
| timeout | The timeout in milliseconds that is used in all operations and waits through the page. The default is 30000. | βœ”οΈ |
| page_action | Added for automation. A function that takes the `page` object, does the automation you need, then returns `page` again. | βœ”οΈ |
| wait_selector | Wait for a specific css selector to be in a specific state. | βœ”οΈ |
| wait_selector_state | The state to wait for the selector given with `wait_selector`. _Default state is `attached`._ | βœ”οΈ |
| google_search | Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search for this website's domain name. | βœ”οΈ |
| extra_headers | A dictionary of extra headers to add to the request. The referer set by the `google_search` argument takes priority over the referer set here if used together. | βœ”οΈ |
| proxy | The proxy to be used with requests, it can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | βœ”οΈ |
| hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | βœ”οΈ |
| disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | βœ”οΈ |
| stealth | Enables stealth mode, always check the documentation to see what stealth mode does currently. | βœ”οΈ |
| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers/NSTBrowser through CDP. | βœ”οΈ |
| nstbrowser_mode | Enables NSTBrowser mode, **it have to be used with `cdp_url` argument or it will get completely ignored.** | βœ”οΈ |
| nstbrowser_config | The config you want to send with requests to the NSTBrowser. _If left empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config._ | βœ”οΈ |

This list isn't final so expect a lot more additions and flexibility to be added in the next versions!

## Advanced Parsing Features
### Smart Navigation
```python
>>> quote.tag
'div'

>>> quote.parent

...'>

>>> quote.parent.tag
'div'

>>> quote.children
[β€œThe...' parent='

,
Tags: