An open API service indexing awesome lists of open source software.

https://github.com/luminati-io/undetected-chromedriver-web-scraping

How to use the undetected_chromedriver Python library to bypass anti-bot measures for web scraping, with step-by-step instructions, advanced tips, and Bright Data integration recommendations.
https://github.com/luminati-io/undetected-chromedriver-web-scraping

chromedriver python selenium undetected-chromedriver web-scraping

Last synced: 28 days ago
JSON representation

How to use the undetected_chromedriver Python library to bypass anti-bot measures for web scraping, with step-by-step instructions, advanced tips, and Bright Data integration recommendations.

Awesome Lists containing this project

README

          

# Using Undetected ChromeDriver for Web Scraping

[![Promo](https://github.com/luminati-io/LinkedIn-Scraper/raw/main/Proxies%20and%20scrapers%20GitHub%20bonus%20banner.png)](https://brightdata.com/)

This guide explains how to use the Undetected ChromeDriver library for Python to bypass anti-bot systems for web scraping.

- [What Is Undetected ChromeDriver?](#what-is-undetected-chromedriver)
- [How It Works](#how-it-works)
- [Using Undetected ChromeDriver for Web Scraping: Step-by-Step Guide](#using-undetected-chromedriver-for-web-scraping-step-by-step-guide)
- [Advanced Usage of `undetected_chromedriver`](#advanced-usage-of-undetected_chromedriver)
- [Limitations of the `undetected_chromedriver` Library](#limitations-of-the-undetected_chromedriver-library)

## What Is Undetected ChromeDriver?

[Undetected ChromeDriver](https://github.com/ultrafunkamsterdam/undetected-chromedriver) is a Python library that offers a modified version of Selenium’s ChromeDriver. It minimizes browser "leaks" to reduce detection by anti-bot services like Imperva, DataDome, and Distil Networks, and can also help bypass some Cloudflare protections. This makes it especially useful for web scraping on sites with robust anti-scraping measures.

## How It Works

Undetected ChromeDriver minimizes detection by Cloudflare, Imperva, DataDome, and similar solutions through several techniques:

- **Variable Renaming**: It renames Selenium variables to mirror those used by genuine browsers.
- **Authentic User-Agent Strings**: It employs real-world User-Agent strings to avoid being flagged.
- **Simulated Human Interaction**: It allows for natural, human-like interactions.
- **Cookie & Session Management**: It properly manages cookies and sessions during browsing.
- **Proxy Support**: It enables the use of proxies to bypass IP blocking and rate limiting.

These strategies work together to help the browser controlled by the library effectively bypass anti-scraping defenses.

## Using Undetected ChromeDriver for Web Scraping: Step-by-Step Guide

Many websites implement sophisticated anti-bot measures to block automated scripts from accessing their content. As a result, these defenses are highly effective at stopping web scraping bots.

Let's scrape the title and description from the following [GoDaddy product page](https://www.godaddy.com/hosting/wordpress-hosting):

![The GoDaddy target page](https://github.com/luminati-io/undetected-chromedriver-web-scraping/blob/main/Images/image-54-1024x494.png)

With plain Selenium in Python, your scraping script will look like this:

```python
# pip install selenium

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

# configure a Chrome instance to start in headless mode
options = Options()
options.add_argument("--headless")

# create a Chrome web driver instance
driver = webdriver.Chrome(service=Service(), options=options)

# connect to the target page
driver.get("https://www.godaddy.com/hosting/wordpress-hosting")

# scraping logic...

# close the browser
driver.quit()
```

Running this script will fail because it will be blocked by an anti-bot solution (Akamai, in this case):

![An "Access Denied" page from GoDaddy](https://github.com/luminati-io/undetected-chromedriver-web-scraping/blob/main/Images/image-55.png)

To work around that, you need to use the `undetected_chromedriver` Python library.

### Step #1: Prerequisites and Project Setup

Undetected ChromeDriver has the following prerequisites:

- **Latest version of Chrome**
- **Python 3.6+**: If Python 3.6 or later is not installed on your machine, [download it from the official site](https://www.python.org/downloads/) and follow the installation instructions.

> **Note**:
>
> The library automatically downloads and patches the driver binary for you, so there is no need to manually download [`ChromeDriver`](https://developer.chrome.com/docs/chromedriver/downloads).

Now, use the following command to create a directory for your project:

```bash
mkdir undetected-chromedriver-scraper
```

The `undetected-chromedriver-scraper` directory will serve as the project folder for your Python scraper.

Navigate into it and initialize a [virtual environment](https://docs.python.org/3/library/venv.html):

```bash
cd undetected-chromedriver-scraper
python -m venv env
```

Open the project folder in your preferred Python IDE and create a `scraper.py` file inside the project folder, following the structure shown below:

![scraper.py in the project folder](https://github.com/luminati-io/undetected-chromedriver-web-scraping/blob/main/Images/image-56.png)

Activate the virtual environment. On Linux or macOS, use:

```bash
./env/bin/activate
```

For Windows, run:

```bash
env/Scripts/activate
```

### Step #2: Install Undetected ChromeDriver

In an activated virtual environment, install Undetected ChromeDriver:

```bash
pip install undetected_chromedriver
```

### Step #3: Initial Setup

Import `undetected_chromedriver`:

```python
import undetected_chromedriver as uc
```

Initialize a Chrome WebDriver:

```python
driver = uc.Chrome()
```

Like Selenium, this tool launches a browser window that you can control using the Selenium API. The `driver` object supports all standard Selenium methods, plus some extra features.

> **Important**:
>
> The main distinction is that this patched Chrome driver is engineered to bypass certain anti-bot solutions.

Call the `quit()` method to close the driver:

```python
driver.quit()
```

Here is a basic Undetected ChromeDriver setup:

```python
import undetected_chromedriver as uc

# Initialize a Chrome instance
driver = uc.Chrome()

# Scraping logic...

# Close the browser and release its resources
driver.quit()
```

### Step #4: Use It for Web Scraping

Use the `get()` method to navigate the browser to your target page:

```python
driver.get("https://www.godaddy.com/hosting/wordpress-hosting")
```

Next, visit the page in incognito mode in your browser and inspect the element you want to scrape:

![The DevTools inspection of the HTML elements to scrape data with](https://github.com/luminati-io/undetected-chromedriver-web-scraping/blob/main/Images/image-57-1024x287.png)

Let's extract the product title, tagline, and description. Here is how you can scrape all of these:

```python
headline_element = driver.find_element(By.CSS_SELECTOR, "[data-cy=\"headline\"]")

title_element = headline_element.find_element(By.CSS_SELECTOR, "h1")
title = title_element.text

tagline_element = headline_element.find_element(By.CSS_SELECTOR, "h2")
tagline = tagline_element.text

description_element = headline_element.find_element(By.CSS_SELECTOR, "[data-cy=\"description\"]")
description = description_element.text
```

Import `By` from Selenium to make the above code work:

```python
from selenium.webdriver.common.by import By
```

Store the scraped data in a Python dictionary:

```python
product = {
"title": title,
"tagline": tagline,
"description": description
}
```

Finally, export the data to a JSON file:

```python
with open("product.json", "w") as json_file:
json.dump(product, json_file, indent=4)
```

Import `json` from the Python standard library:

```python
import json
```

### Step #5: Put It All Together

This is the final scraping script:

```python
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
import json

# Create a Chrome web driver instance
driver = uc.Chrome()

# Connect to the target page
driver.get("https://www.godaddy.com/hosting/wordpress-hosting")

# Scraping logic
headline_element = driver.find_element(By.CSS_SELECTOR, "[data-cy=\"headline\"]")

title_element = headline_element.find_element(By.CSS_SELECTOR, "h1")
title = title_element.text

tagline_element = headline_element.find_element(By.CSS_SELECTOR, "h2")
tagline = tagline_element.text

description_element = headline_element.find_element(By.CSS_SELECTOR, "[data-cy=\"description\"]")
description = description_element.text

# Populate a dictionary with the scraped data
product = {
"title": title,
"tagline": tagline,
"description": description
}

# Export the scraped data to JSON
with open("product.json", "w") as json_file:
json.dump(product, json_file, indent=4)

# Close the browser and release its resources
driver.quit()
```

Execute it:

```bash
python3 scraper.py
```

Or, on Windows:

```bash
python scraper.py
```

This will open a browser showing the target web page:

![a browser showing the target web page](https://github.com/luminati-io/undetected-chromedriver-web-scraping/blob/main/Images/image-58-1024x547.png)

The script will extract data from the page and produce the following `product.json` file:

```json
{
"title": "Managed WordPress Hosting",
"tagline": "Get WordPress hosting — simplified",
"description": "We make it easier to create, launch, and manage your WordPress site"
}
```

## Advanced Usage of `undetected_chromedriver`:

### Choosing a Specific Chrome Version

You can specify a particular version of Chrome for the library to use by setting the `version_main` argument:

```python
import undetected_chromedriver as uc

# Specify the target version of Chrome
driver = uc.Chrome(version_main=105)
```

The library also works with other Chromium-based browsers, but that requires some additional tweaking.

### The `with`Syntax

Use the [`with`](https://docs.python.org/3/reference/compound_stmts.html#with) syntax to avoid manually calling the `quit()` method when you no longer need the driver:

```python
import undetected_chromedriver as uc

with uc.Chrome() as driver:
driver.get("")
```

When the code inside the `with` block completes, Python will automatically close the browser for you.

> **Note**:
>
> This syntax is supported starting from version 3.1.0.

### Proxy Integration

The syntax for adding a proxy to Undetected ChromeDriver is similar to regular Selenium. Simply pass your proxy URL to the `--proxy-server` flag as shown below:

```python
import undetected_chromedriver as uc

proxy_url = ""

options = uc.ChromeOptions()
options.add_argument(f"--proxy-server={proxy}")
```

> **Note**:
>
> Chrome does not support authenticated proxies through the `--proxy-server` flag.

### Extended API

The `undetected_chromedriver` library has some extra methods that extend regular Selenium functionality:

- `WebElement.click_safe()`: Use it when clicking a link causes detection.
- `WebElement.children(tag=None, recursive=False)`: Use it to easily find child elements. For example:

```python
# Get the 6th child (of any tag) within the body, then find all elements recursively
images = body.children()[6].children("img", True)
```

## Limitations of the `undetected_chromedriver` Library

While `undetected_chromedriver` is a powerful Python library, it does have some known limitations. Here are the most important ones you should be aware of!

### IP Blocks

The GitHub page for the library clearly states: "This package does not hide your IP address". Running your script from a datacenter may still result in detection, and a poorly regarded home IP can also lead to blocks.

![IP Blocks Warning on GitHub](https://github.com/luminati-io/undetected-chromedriver-web-scraping/blob/main/Images/image-59.png)

To hide your IP, you must integrate the controlled browser with a proxy server, as demonstrated earlier.

### No Support for GUI Navigation

Due to how the module works, you need to navigate programmatically using the `get()` method. Avoid manual navigation through the browser GUI, as using your keyboard or mouse increases the risk of detection.

This rule also applies when managing new tabs. If you require multiple tabs, open a new one with a blank page by using the URL `data:,` (including the comma), which the driver accepts. Then, continue with your normal automation workflow.

Following these guidelines will help reduce detection and ensure smoother web scraping sessions.

### Limited Support for Headless Mode

Since version 3.4.5, The `undetected_chromedriver` library has an experimental (read: not guaranteed) headless mode. Try this:

```python
driver = uc.Chrome(headless=True)
```

### Stability Issues

As noted on the package’s PyPI page, outcomes can vary due to many factors. While there's no guarantee of success, the developers continually work to understand and counter detection algorithms.

![Alert about unpredictable results on PyPI](https://github.com/luminati-io/undetected-chromedriver-web-scraping/blob/main/Images/image-60-1024x244.png)

This means a script that bypasses anti-bot measures like Distil, Cloudflare, Imperva, DataDome, or hCaptcha today might fail if these defenses are updated tomorrow:

![CAPTCHA triggered by Undetected ChromeDriver](https://github.com/luminati-io/undetected-chromedriver-web-scraping/blob/main/Images/image-61-1024x547.png)

The image above, taken from the official documentation, shows that even developer-provided scripts can sometimes trigger a CAPTCHA, potentially halting your automation.

## Conclusion

While Undetected ChromeDriver provides a patched ChromeDriver for web scraping, advanced anti-bot systems like Cloudflare can still block your scripts. The issue isn’t with Selenium’s API but with the browser’s settings. The true solution is a cloud-based, always-updated, scalable browser with built-in anti-bot capabilities—enter [Scraping Browser](https://brightdata.com/products/scraping-browser).

Create a free Bright Data account today to try out our scraping browser or test our proxies.