https://github.com/luminati-io/seleniumbase-web-scraping
Simplify web scraping with SeleniumBase, leveraging its user-friendly framework and advanced automation features to extract data seamlessly.
https://github.com/luminati-io/seleniumbase-web-scraping
python selenium seleniumbase web-scraping
Last synced: about 1 month ago
JSON representation
Simplify web scraping with SeleniumBase, leveraging its user-friendly framework and advanced automation features to extract data seamlessly.
- Host: GitHub
- URL: https://github.com/luminati-io/seleniumbase-web-scraping
- Owner: luminati-io
- Created: 2025-01-08T10:01:06.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-01-15T13:46:48.000Z (over 1 year ago)
- Last Synced: 2025-03-22T07:02:03.781Z (about 1 year ago)
- Topics: python, selenium, seleniumbase, web-scraping
- Homepage: https://brightdata.com/blog/web-data/web-scraping-with-seleniumbase
- Size: 12.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Web Scraping With SeleniumBase
[](https://brightdata.com)
Simplify web scraping with SeleniumBase using its advanced features and step-by-step guide. Interested in Selenium web scraping? Check out [this guide](https://brightdata.com/blog/how-tos/using-selenium-for-web-scraping).
## What Is SeleniumBase?
SeleniumBase is a Python framework for browser automation, built on top of Selenium/WebDriver APIs. It supports tasks from testing to scraping and includes features like CAPTCHA bypassing and bot-detection avoidance.
## SeleniumBase vs Selenium: Feature and API Comparison
| Feature | SeleniumBase | Selenium |
|--------------------------|---------------------------------------------------|---------------------------------------------|
| Built-in test runners | Integrates with pytest, pynose, and behave | Requires manual setup for test integration |
| Driver management | Auto-downloads matching browser driver | Manual download and configuration |
| Web automation logic | Combines steps into single method call | Requires multiple lines of code |
| Selector handling | Auto-detects CSS or XPath selectors | Requires explicit selector types |
| Timeout handling | Default timeouts to prevent failures | Immediate failures without explicit timeouts|
| Error outputs | Clean, readable error messages | Verbose, less interpretable error logs |
| Dashboards and reports | Built-in dashboards, reports, and screenshots | No built-in dashboards or reporting |
| Desktop GUI applications | Visual tools for test running | Lacks desktop GUI tools |
| Test recorder | Built-in test recorder | Requires manual script writing |
| Test case management | Provides CasePlans | No built-in test case management |
| Data app support | Includes ChartMaker for data apps | No additional tools for data apps |
## Using SeleniumBase for Web Scraping: Step-By-Step Guide
### Step #1: Project Initialization
```bash
mkdir seleniumbase-scraper
cd seleniumbase-scraper
python -m venv env
```
Activate the virtual environment:
- On Linux/macOS: `./env/bin/activate`
- On Windows: `env/Scripts/activate`
Install SeleniumBase:
```bash
pip install seleniumbase
```
### Step #2: SeleniumBase Test Setup
```python
from seleniumbase import SB
with SB() as sb:
pass
```
Run the script:
```bash
python3 scraper.py --headless
```
### Step #3: Connect to the Target Page
```python
sb.open("https://quotes.toscrape.com/")
```
### Step #4: Select the Quote Elements
```python
quote_elements = sb.find_elements(".quote")
```
### Step #5: Scrape Quote Data
```python
from selenium.webdriver.common.by import By
for quote_element in quote_elements:
text_element = quote_element.find_element(By.CSS_SELECTOR, ".text")
text = text_element.text.replace("“", "").replace("”", "")
author_element = quote_element.find_element(By.CSS_SELECTOR, ".author")
author = author_element.text
tags = [tag.text for tag in quote_element.find_elements(By.CSS_SELECTOR, ".tag")]
```
### Step #6: Populate the Quotes Array
```python
quotes.append({"text": text, "author": author, "tags": tags})
```
### Step #7: Implement Crawling Logic
```python
while sb.is_element_present(".next"):
sb.click(".next a")
```
### Step #8: Export the Scraped Data
```python
import csv
with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
writer.writeheader()
for quote in quotes:
writer.writerow({"text": quote["text"], "author": quote["author"], "tags": ";".join(quote["tags"])})
```
### Step #9: Put It All Together
```python
from seleniumbase import SB
from selenium.webdriver.common.by import By
import csv
with SB() as sb:
sb.open("https://quotes.toscrape.com/")
quotes = []
while sb.is_element_present(".next"):
quote_elements = sb.find_elements(".quote")
for quote_element in quote_elements:
text_element = quote_element.find_element(By.CSS_SELECTOR, ".text")
text = text_element.text.replace("“", "").replace("”", "")
author_element = quote_element.find_element(By.CSS_SELECTOR, ".author")
author = author_element.text
tags = [tag.text for tag in quote_element.find_elements(By.CSS_SELECTOR, ".tag")]
quotes.append({"text": text, "author": author, "tags": tags})
sb.click(".next a")
with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
writer.writeheader()
for quote in quotes:
writer.writerow({"text": quote["text"], "author": quote["author"], "tags": ";".join(quote["tags"])})
```
Run the scraper:
```bash
python3 script.py --headless
```
## Advanced SeleniumBase Scraping Use Cases
### Automate Form Filling and Submission
```python
from seleniumbase import BaseCase
BaseCase.main(__name__, __file__)
class LoginTest(BaseCase):
def test_submit_login_form(self):
self.open("https://quotes.toscrape.com/login")
self.type("#username", "test")
self.type("#password", "test")
self.click("input[type=\"submit\"]")
self.assert_text("Top Ten tags")
```
Run the test:
```bash
pytest login.py
```
### Bypass Simple Anti-Bot Technologies
```python
from seleniumbase import SB
with SB(uc=True) as sb:
url = "https://www.scrapingcourse.com/antibot-challenge"
sb.uc_open_with_reconnect(url, reconnect_time=4)
sb.uc_gui_click_captcha()
sb.save_screenshot("screenshot.png")
```
### Bypass Complex Anti-Bot Technologies
```python
from seleniumbase import SB
with SB(uc=True, test=True) as sb:
url = "https://gitlab.com/users/sign_in"
sb.activate_cdp_mode(url)
sb.uc_gui_click_captcha()
sb.sleep(2)
sb.save_screenshot("screenshot.png")
```
## Conclusion
SeleniumBase offers advanced features for web scraping, including UC Mode and CDP Mode for bypassing anti-bot measures. For more robust solutions, consider using cloud-based browsers like [Scraping Browser from Bright Data](https://brightdata.com/products/scraping-browser).