https://github.com/victorcezeh/understanding-web-scraping
Python web scraping projects using requests and BeautifulSoup.
https://github.com/victorcezeh/understanding-web-scraping
beautifulsoup git github python requests
Last synced: 8 months ago
JSON representation
Python web scraping projects using requests and BeautifulSoup.
- Host: GitHub
- URL: https://github.com/victorcezeh/understanding-web-scraping
- Owner: victorcezeh
- License: mit
- Created: 2025-07-28T19:41:47.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-07-28T22:48:19.000Z (11 months ago)
- Last Synced: 2025-07-29T00:25:25.174Z (11 months ago)
- Topics: beautifulsoup, git, github, python, requests
- Language: Python
- Homepage:
- Size: 42 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Web Scraping Projects
A collection of Python web scraping projects demonstrating various techniques using `requests`, `BeautifulSoup`, and `lxml`. These projects showcase fundamental web scraping concepts from basic single-page extraction to multi-page data collection.
📖 Table of Contents
- [Technologies Used](#️-technologies-used)
- [Project Structure](#-project-structure)
- [Projects Overview](#-projects-overview)
- [Key Learning Concepts](#-key-learning-concepts)
- [Getting Started](#-getting-started)
- [Usage Examples](#-usage-examples)
- [Skills Demonstrated](#-skills-demonstrated)
- [Target Websites](#-target-websites)
- [Notes](#-notes)
- [Ethical Considerations](#-ethical-considerations)
## 🛠️ Technologies Used
- **Python 3.x**
- **requests** - For making HTTP requests
- **BeautifulSoup4** - For parsing HTML content
- **lxml** - As the XML/HTML parser engine
## 📁 Project Structure
```
understanding-web-scraping/
├── practice-src-code/
│ ├── authors_scraper.py
│ ├── basic_page_scraper.py
│ ├── book_ratings_scraper.py
│ ├── multi_page_authors_scraper.py
│ ├── my_computer_image.jpg
│ ├── quotes_scraper.py
│ ├── top_tags_scraper.py
│ ├── wikipedia_elements_scraper.py
│ └── wikipedia_image_downloader.py
├── .gitignore
├── README.md
└── requirements.txt
```
## 🚀 Projects Overview
### 1. Book Ratings Scraper (`book_ratings_scraper.py`)
**Target Site:** [books.toscrape.com](http://books.toscrape.com)
**Objective:** Extract titles of all books with 2-star ratings across multiple pages (1-50)
- Demonstrates multi-page scraping
- CSS class-based element selection
- List data collection
### 2. Wikipedia Elements Scraper (`wikipedia_elements_scraper.py`)
**Target Site:** [Wikipedia - Grace Hopper](https://en.wikipedia.org/wiki/Grace_Hopper)
**Objective:** Extract specific elements from table of contents
- Shows targeted element selection
- Working with Wikipedia's CSS structure
- Text extraction techniques
### 3. Basic Page Scraper (`basic_page_scraper.py`)
**Target Site:** [example.com](https://example.com)
**Objective:** Extract page title and paragraph content
- Fundamental scraping concepts
- Basic HTML element selection
- Data type exploration
### 4. Wikipedia Image Downloader (`wikipedia_image_downloader.py`)
**Target Site:** [Wikipedia - Deep Blue (chess computer)](https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer))
**Objective:** Extract and download images from Wikipedia pages
- Image URL extraction
- Binary file downloading
- File I/O operations
### 5. Quotes Scraper (`quotes_scraper.py`)
**Target Site:** [quotes.toscrape.com](http://quotes.toscrape.com)
**Objective:** Collect all quotes from the first page
- CSS class selection
- Text content extraction
- List building
### 6. Authors Scraper (`authors_scraper.py`)
**Target Site:** [quotes.toscrape.com](http://quotes.toscrape.com)
**Objective:** Extract unique author names from the first page
- Set usage for duplicate removal
- Author name collection
- Unique data handling
### 7. Top Tags Scraper (`top_tags_scraper.py`)
**Target Site:** [quotes.toscrape.com](http://quotes.toscrape.com)
**Objective:** Extract the top ten tags from the homepage sidebar
- Sidebar content scraping
- Text cleaning with `.strip()`
- Popular content extraction
### 8. Multi-Page Authors Scraper (`multi_page_authors_scraper.py`)
**Target Site:** [quotes.toscrape.com](http://quotes.toscrape.com)
**Objective:** Extract all unique authors from all pages using two different approaches
- **Method 1:** Fixed range (when you know the total pages)
- **Method 2:** Dynamic detection (robust approach for unknown page count)
- Advanced pagination handling
- End-of-content detection
## 🎯 Key Learning Concepts
### Web Scraping Fundamentals
- Making HTTP requests with `requests.get()`
- Parsing HTML with `BeautifulSoup`
- CSS selector usage for element targeting
- Text extraction and cleaning
### Data Collection Strategies
- **Lists** - For ordered data collection
- **Sets** - For unique data collection (automatic duplicate removal)
- **File I/O** - For saving downloaded content
### Pagination Techniques
- **URL templating** with `.format()` for dynamic page URLs
- **Fixed range pagination** for known page counts
- **Dynamic pagination** with condition checking for unknown page counts
- **End detection** using content-based signals
## 🚀 Getting Started
### Prerequisites
```bash
pip install requests beautifulsoup4 lxml
```
### Running the Scripts
Each script can be run independently:
```bash
python book_ratings_scraper.py
python quotes_scraper.py
python multi_page_authors_scraper.py
# ... etc
```
## 📋 Usage Examples
### Basic Single Page Scraping
```python
import requests
import bs4
url = "https://example.com"
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, "lxml")
title = soup.select("title")[0].get_text()
```
### Multi-Page Scraping with Known Page Count
```python
base_url = "https://site.com/page/{}/"
data = []
for page in range(1, 11): # Pages 1-10
url = base_url.format(page)
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, "lxml")
# Extract data...
```
### Dynamic Multi-Page Scraping
```python
page = 1
while True:
url = base_url.format(page)
response = requests.get(url)
if "No content found!" in response.text:
break
# Process page...
page += 1
```
## 🎓 Skills Demonstrated
- **HTTP Requests** - Fetching web page content
- **HTML Parsing** - Extracting structured data from web pages
- **CSS Selectors** - Targeting specific elements
- **Data Structures** - Using lists and sets effectively
- **File Operations** - Downloading and saving binary content
- **Control Flow** - Implementing various looping strategies
- **String Manipulation** - Cleaning and formatting extracted text
- **Pagination Logic** - Handling multi-page data sources
## 🔍 Target Websites
- **books.toscrape.com** - Demo bookstore for scraping practice
- **quotes.toscrape.com** - Demo quotes site for scraping practice
- **Wikipedia** - Real-world content extraction
- **example.com** - Basic HTML structure demonstration
## 📝 Notes
- All scripts include comprehensive comments and docstrings
- Code follows clean, readable Python practices
- Each project demonstrates different aspects of web scraping
- Projects progress from basic to more advanced techniques
- Suitable for learning and portfolio demonstration
## 🚨 Ethical Considerations
- Respect rate limits and server resources
- Use scraped data responsibly
- Consider the website's terms of service
- These projects use demo/educational websites designed for scraping practice
---
*This repository showcases fundamental web scraping techniques and serves as a learning resource for Python web scraping with requests and BeautifulSoup.*