https://github.com/nouraalgohary/web-scraping
https://github.com/nouraalgohary/web-scraping
pandas selenium webscraping
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/nouraalgohary/web-scraping
- Owner: NouraAlgohary
- Created: 2023-12-20T18:10:05.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-12-21T07:26:41.000Z (over 2 years ago)
- Last Synced: 2025-06-19T04:07:39.060Z (about 1 year ago)
- Topics: pandas, selenium, webscraping
- Language: Python
- Homepage:
- Size: 56.6 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Toscrape
🛠️ Web Scraping Exploration with Selenium
Take a gentle dive into the basics of web scraping with this repository! Using Selenium, the project walks you through extracting data from books and quotes websites.
It's a simple yet effective exercise to get hands-on experience with web scraping techniques. The data collected is neatly organized into a CSV file, offering a practical glimpse into data processing.
Whether you're new to web scraping or just looking for a straightforward example, this repository provides a humble starting point for your exploration. Happy coding!
## [1. Books to Scrape](http://books.toscrape.com/)

## [2. Quotes to Scrape](https://quotes.toscrape.com/)

## Files
- [booksToScrape.csv](https://github.com/NouraAlgohary/Web-Scraping/blob/main/booksToScrape.csv) Books data as a CSV file
- [quotesToScrape.csv](https://github.com/NouraAlgohary/Web-Scraping/blob/main/QuotesToScrape.csv) Quotes data as a CSV file
- [books_web_scraping.py](https://github.com/NouraAlgohary/Web-Scraping/blob/main/books_web_scraping.py) Books website web scraping code
- [quotes_web_scraping.py](https://github.com/NouraAlgohary/Web-Scraping/blob/main/quotes_web_scraping.py) Quotes website web scraping code
## Steps
### Setting Up Libraries
Selenium is a powerful web automation library for Python, widely used for web scraping and testing.
```pip install selenium```
Pandas is a versatile data manipulation library in Python, commonly employed for data analysis and storage, such as saving data to CSV files.
```pip install pandas```
### Getting Started
1. Create a webdriver instance
```
driver = webdriver.Chrome()
url = "http://books.toscrape.com/"
driver.get(url)
```
2. Chrome must be loaded with the message
```Chrome is being controlled by automated test software.```
### Explicit Waits
Use explicit waits for a smoother web scraping experience:
```
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
try:
# Explicitly wait for the next page button to be present
WebDriverWait(driver, 20).until(EC.presence_of_element_located(next_page_button_locator))
# Explicitly wait for the next page button to be clickable
WebDriverWait(driver, 20).until(EC.element_to_be_clickable(next_page_button_locator))
# Find the next page button and click it
next_page_button = driver.find_element(*next_page_button_locator)
next_page_button.click()
except Exception as e:
print(f"Exception: {type(e).__name__} - {e}. Refreshing the page and retrying click.")
driver.refresh()
```
### Data Extraction
Use various locators using By for element identification:
``` By.```
```
from selenium.webdriver.common.by import By
```
- ```find_element(By.CSS_SELECTOR, some_string)``` Finds element using CSS selector. It performs the same tasks as the old one. ```find_element_by_css_selector```
- ```find_element(By.XPATH, some_string)``` Finds elment by XPATH instead of ```find_element_by_xpath```
- ```find_element(By.CLASS_NAME, some_string)``` Finds element by Class Name as the old one did ```find_element_by_class_name```
These methods return an instance of ```WebElement```
#### WebElement
- ```element.click()``` Clicking on the element
- ```element.get_attribute(‘class’)``` Accessing attribute class, title...etc
- - ```element.text``` Accessing text element
### Store data
Save a list of lists as a data frame using Pandas
```
df = pd.DataFrame(books_list)
```
Save the data frame to a CSV file for further use
```
df.to_csv('path-to-folder/booksToScrape.csv', index=True)
```
### Finally
Close the browser
```
driver.quit()
```