https://github.com/luminati-io/complex-navigation-scraping
Selenium for web scraping on sites with complex navigation patterns like dynamic pagination, infinite scrolling, and 'Load More' buttons.
https://github.com/luminati-io/complex-navigation-scraping
html pagination python selenium web-scraping
Last synced: about 1 month ago
JSON representation
Selenium for web scraping on sites with complex navigation patterns like dynamic pagination, infinite scrolling, and 'Load More' buttons.
- Host: GitHub
- URL: https://github.com/luminati-io/complex-navigation-scraping
- Owner: luminati-io
- Created: 2025-01-28T12:05:19.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-28T12:08:27.000Z (over 1 year ago)
- Last Synced: 2025-08-07T18:07:38.381Z (11 months ago)
- Topics: html, pagination, python, selenium, web-scraping
- Homepage: https://brightdata.com/blog/web-data/scraping-complex-navigation
- Size: 206 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Scraping Websites With Complex Navigation
[](https://brightdata.com/)
This guide explains how to use Selenium and browser automation to scrape websites with complex navigation patterns, such as dynamic pagination, infinite scrolling, and ‘Load More’ buttons, using Selenium and browser automation.
- [What Is Considered Complex Navigation?](#what-is-considered-complex-navigation)
- [Tools to Handle Complex Navigation Websites](#tools-to-handle-complex-navigation-websites)
- [Scraping Common Complex Navigation Patterns](#scraping-common-complex-navigation-patterns)
- [Dynamic Pagination](#dynamic-pagination)
- [‘Load More’ Button](#load-more-button)
- [Infinite Scrolling](#infinite-scrolling)
- [Conclusion](#conclusion)
## What Is Considered Complex Navigation?
In web scraping, complex navigation refers to website structures where content or pages are not easily accessible. Complex navigation scenarios often involve dynamic elements, asynchronous data loading, or user-driven interactions. These aspects may enhance user experiences, but they also significantly complicate data extraction. Here are some very common examples:
- **JavaScript-rendered navigation**: Websites that rely on JavaScript frameworks to generate content directly in the browser.
- **Paginated content**: Sites with data spread across multiple pages where pagination is loaded dynamically via AJAX.
- **Infinite scrolling**: Pages that load additional content dynamically as users scroll, typical for social media feeds, Discourse-based forums, and news websites.
- **Multi-level menus**: Sites with nested menus requiring multiple clicks or hover actions to reveal deeper layers of navigation, common for product category trees on marketplaces.
- **Interactive map interfaces**: Websites displaying data on maps or graphs, where information is dynamically loaded as users pan or zoom.
- **Tabs or accordions**: Pages with content hidden under dynamically rendered tabs or collapsible accordions that are not directly embedded in the HTML returned by the server.
- **Dynamic filters and sorting options**: Sites with complex filtering systems where applying multiple filters reloads the item listing dynamically without altering the URL structure.
## Tools to Handle Complex Navigation Websites
Many of the complex interactions listed above need JavaScript execution, something only a browser can do. This means you cannot rely on simple [HTML parsers](https://brightdata.com/blog/web-data/best-html-parsers) for such pages. Instead, you must use a browser automation tool like Selenium, Playwright, or Puppeteer. These solutions allow you to programmatically instruct a browser to perform specific actions on a web page, mimicking user behavior.
## Scraping Common Complex Navigation Patterns
This guides covers three specific types of complex navigation patterns:
- **Dynamic pagination**: Sites with paginated data loaded dynamically via AJAX.
- **‘Load More’ button**: A common JavaScript-based navigation example.
- **Infinite scrolling**: A page that continuously loads data as the user scrolls down.
We will use Selenium in Python, but the logic can be adapted to Playwright, Puppeteer, or any other browser automation tools. The guide also assumes that you are already familiar with the basics of [web scraping using Selenium](https://brightdata.com/blog/how-tos/using-selenium-for-web-scraping).
### Dynamic Pagination
We will use the “[Oscar Winning Films: AJAX and Javascript](https://www.scrapethissite.com/pages/ajax-javascript/#2014)” scraping sandbox:

This site dynamically loads Oscar-winning film data, paginated by year.
To navigate and scrape such a page effectively, you need to follow the following steps:
1. Click on a new year to trigger data loading (a loader element will appear).
2. Wait for the loader element to disappear (indicating the data has fully loaded).
3. Verify that the table with the data has been properly rendered on the page.
4. Scrape the data once it becomes available.
Below is an example of how to implement this logic using Selenium in Python:
```python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
# Set up Chrome options for headless mode
options = Options()
options.add_argument("--headless")
# Create a Chrome web driver instance
driver = webdriver.Chrome(service=Service(), options=options)
# Connect to the target page
driver.get("https://www.scrapethissite.com/pages/ajax-javascript/")
# Click the "2012" pagination button
element = driver.find_element(By.ID, "2012")
element.click()
# Wait until the loader is no longer visible
WebDriverWait(driver, 10).until(
lambda d: d.find_element(By.CSS_SELECTOR, "#loading").get_attribute("style") == "display: none;"
)
# Data should now be loaded...
# Wait for the table to be present on the page
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".table"))
)
# Where to store the scraped data
films = []
# Scrape data from the table
table_body = driver.find_element(By.CSS_SELECTOR, "#table-body")
rows = table_body.find_elements(By.CSS_SELECTOR, ".film")
for row in rows:
title = row.find_element(By.CSS_SELECTOR, ".film-title").text
nominations = row.find_element(By.CSS_SELECTOR, ".film-nominations").text
awards = row.find_element(By.CSS_SELECTOR, ".film-awards").text
best_picture_icon = row.find_element(By.CSS_SELECTOR, ".film-best-picture").find_elements(By.TAG_NAME, "i")
best_picture = True if best_picture_icon else False
# Store the scraped data
films.append({
"title": title,
"nominations": nominations,
"awards": awards,
"best_picture": best_picture
})
# Data export logic...
# Close the browser driver
driver.quit()
```
Here is the breakdown of that code snippet:
1. The code sets up a headless Chrome instance.
2. The script opens the target page and clicks the “2012” pagination button to trigger data loading.
3. Selenium waits for the loader to disappear using [`WebDriverWait()`](https://selenium-python.readthedocs.io/waits.html).
4. After the loader disappears, the script waits for the table to appear.
5. After the data is fully loaded, the script extracts details such as film titles, nominations, awards, and whether the film won Best Picture. The extracted information is then stored in a list of dictionaries.
The result will be:
```json
[
{
"title": "Argo",
"nominations": "7",
"awards": "3",
"best_picture": true
},
// ...
{
"title": "Curfew",
"nominations": "1",
"awards": "1",
"best_picture": false
}
]
```
Keep in mind that there isn’t always a single best approach to handling this navigation pattern. Alternative methods may be necessary depending on the page's behavior. Here are some examples:
* Use `WebDriverWait()` in combination with expected conditions to wait for specific HTML elements to appear or disappear.
* Monitor traffic for AJAX requests to detect when new content is fetched. This may involve using browser logging.
* Identify the API request triggered by pagination and make direct requests to fetch the data programmatically (e.g., using the [`requests` library](https://brightdata.com/blog/web-data/python-requests-guide)).
### ‘Load More’ Button
To illustrate JavaScript-based complex navigation scenarios involving user interactions, let's use an example of a 'Load More' button. The concept is straightforward: a list of items is displayed, and clicking the button loads additional items.
This time, the target site will be the [‘Load More’ example](https://www.scrapingcourse.com/button-click) page from the Scraping Course:

To handle this complex navigation scraping pattern, follow these steps:
1. Find the ‘Load More’ button and click it.
2. Wait for the new elements to load onto the page.
Here is the code to use with Selenium:
```python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
# Set up Chrome options for headless mode
options = Options()
options.add_argument("--headless")
# Create a Chrome web driver instance
driver = webdriver.Chrome(options=options)
# Connect to the target page
driver.get("https://www.scrapingcourse.com/button-click")
# Collect the initial number of products
initial_product_count = len(driver.find_elements(By.CSS_SELECTOR, ".product-item"))
# Locate the "Load More" button and click it
load_more_button = driver.find_element(By.CSS_SELECTOR, "#load-more-btn")
load_more_button.click()
# Wait until the number of product items on the page has increased
WebDriverWait(driver, 10).until(lambda driver: len(driver.find_elements(By.CSS_SELECTOR, ".product-item")) > initial_product_count)
# Where to store the scraped data
products = []
# Scrape product details
product_elements = driver.find_elements(By.CSS_SELECTOR, ".product-item")
for product_element in product_elements:
# Extract product details
name = product_element.find_element(By.CSS_SELECTOR, ".product-name").text
image = product_element.find_element(By.CSS_SELECTOR, ".product-image").get_attribute("src")
price = product_element.find_element(By.CSS_SELECTOR, ".product-price").text
url = product_element.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
# Store the scraped data
products.append({
"name": name,
"image": image,
"price": price,
"url": url
})
# Data export logic...
# Close the browser driver
driver.quit()
```
To handle the 'Load More' button navigation pattern, the script:
1. Records the initial number of products on the page
2. Clicks the “Load More” button
3. Waits until the product count increases, confirming that new items have been added
This approach is both efficient and versatile, as it eliminates the need to know the exact number of elements to be loaded. However, alternative methods can also achieve similar results.
### Infinite Scrolling
Infinite scrolling is a popular interaction widely used on social media and e-commerce platforms to enhance user engagement. In this case, the target will be the same page as above but with [infinite scrolling instead of a ‘Load More’ button](https://www.scrapingcourse.com/infinite-scrolling):

Most browser automation tools do not provide a direct method for scrolling down or up a page, and Selenium is not an exception. Instead, you need to execute a JavaScript script on the page to perform the scrolling operation.
The solution is to write a custom JavaScript script that scrolls down:
1. A specified number of times, or
2. Until no more data is available to load.
> **Note**:\
> Each scroll loads new data and increments the number of elements on the page.
After that, you can scrape the newly loaded content.
Here is the code to use infinite scrolling in Selenium:
```python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
# Set up Chrome options for headless mode
options = Options()
# options.add_argument("--headless")
# Create a Chrome web driver instance
driver = webdriver.Chrome(options=options)
# Connect to the target page with infinite scrolling
driver.get("https://www.scrapingcourse.com/infinite-scrolling")
# Current page height
scroll_height = driver.execute_script("return document.body.scrollHeight")
# Number of products on the page
product_count = len(driver.find_elements(By.CSS_SELECTOR, ".product-item"))
# Max number of scrolls
max_scrolls = 10
scroll_count = 1
# Limit the number of scrolls to 10
while scroll_count < max_scrolls:
# Scroll down
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait until the number of product items on the page has increased
WebDriverWait(driver, 10).until(lambda driver: len(driver.find_elements(By.CSS_SELECTOR, ".product-item")) > product_count)
# Update the product count
product_count = len(driver.find_elements(By.CSS_SELECTOR, ".product-item"))
# Get the new page height
new_scroll_height = driver.execute_script("return document.body.scrollHeight")
# If no new content has been loaded
if new_scroll_height == scroll_height:
break
# Update scroll height and increment scroll count
scroll_height = new_scroll_height
scroll_count += 1
# Scrape product details after infinite scrolling
products = []
product_elements = driver.find_elements(By.CSS_SELECTOR, ".product-item")
for product_element in product_elements:
# Extract product details
name = product_element.find_element(By.CSS_SELECTOR, ".product-name").text
image = product_element.find_element(By.CSS_SELECTOR, ".product-image").get_attribute("src")
price = product_element.find_element(By.CSS_SELECTOR, ".product-price").text
url = product_element.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
# Store the scraped data
products.append({
"name": name,
"image": image,
"price": price,
"url": url
})
# Export to CSV/JSON...
# Close the browser driver
driver.quit()
```
This script handles infinite scrolling by first identifying the current page height and product count. It limits the scrolling process to a maximum of 10 iterations. During each iteration, it:
1. Scrolls down to the bottom
2. Waits for the product count to increase (indicating new content has loaded)
3. Compares the page height to detect whether further content is available
If the page height remains unchanged after a scroll, the loop terminates, signaling that there is no more data to load.
## Conclusion
Web scraping can be challenging when complex navigation patterns are involved, businesses can make it even more difficult by employing anti-scraping measures to block automated scripts. Browser automation tools, like Selenium, cannot bypass those restrictions.
The solution is to use a cloud-based browser like [Scraping Browser](https://brightdata.com/products/scraping-browser) which integrates with Playwright, Puppeteer, Selenium, and other tools, automatically rotating IPs with each request. It can manage browser fingerprinting, retries, CAPTCHA solving, and more. Say goodbye to getting blocked when navigating complex sites!