https://github.com/victorcezeh/understanding-web-scraping

Python web scraping projects using requests and BeautifulSoup.
https://github.com/victorcezeh/understanding-web-scraping

beautifulsoup git github python requests

Last synced: 9 months ago
JSON representation

Python web scraping projects using requests and BeautifulSoup.

Host: GitHub
URL: https://github.com/victorcezeh/understanding-web-scraping
Owner: victorcezeh
License: mit
Created: 2025-07-28T19:41:47.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-07-28T22:48:19.000Z (about 1 year ago)
Last Synced: 2025-07-29T00:25:25.174Z (about 1 year ago)
Topics: beautifulsoup, git, github, python, requests
Language: Python
Homepage:
Size: 42 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Web Scraping Projects

A collection of Python web scraping projects demonstrating various techniques using `requests`, `BeautifulSoup`, and `lxml`. These projects showcase fundamental web scraping concepts from basic single-page extraction to multi-page data collection.

📖 Table of Contents

- [Technologies Used](#️-technologies-used)
- [Project Structure](#-project-structure)
- [Projects Overview](#-projects-overview)
- [Key Learning Concepts](#-key-learning-concepts)
- [Getting Started](#-getting-started)
- [Usage Examples](#-usage-examples)
- [Skills Demonstrated](#-skills-demonstrated)
- [Target Websites](#-target-websites)
- [Notes](#-notes)
- [Ethical Considerations](#-ethical-considerations)

## 🛠️ Technologies Used

- **Python 3.x**
- **requests** - For making HTTP requests
- **BeautifulSoup4** - For parsing HTML content
- **lxml** - As the XML/HTML parser engine

## 📁 Project Structure

```
understanding-web-scraping/
├── practice-src-code/
│ ├── authors_scraper.py
│ ├── basic_page_scraper.py
│ ├── book_ratings_scraper.py
│ ├── multi_page_authors_scraper.py
│ ├── my_computer_image.jpg
│ ├── quotes_scraper.py
│ ├── top_tags_scraper.py
│ ├── wikipedia_elements_scraper.py
│ └── wikipedia_image_downloader.py
├── .gitignore
├── README.md
└── requirements.txt
```

## 🚀 Projects Overview

### 1. Book Ratings Scraper (`book_ratings_scraper.py`)
**Target Site:** [books.toscrape.com](http://books.toscrape.com)
**Objective:** Extract titles of all books with 2-star ratings across multiple pages (1-50)
- Demonstrates multi-page scraping
- CSS class-based element selection
- List data collection

### 2. Wikipedia Elements Scraper (`wikipedia_elements_scraper.py`)
**Target Site:** [Wikipedia - Grace Hopper](https://en.wikipedia.org/wiki/Grace_Hopper)
**Objective:** Extract specific elements from table of contents
- Shows targeted element selection
- Working with Wikipedia's CSS structure
- Text extraction techniques

### 3. Basic Page Scraper (`basic_page_scraper.py`)
**Target Site:** [example.com](https://example.com)
**Objective:** Extract page title and paragraph content
- Fundamental scraping concepts
- Basic HTML element selection
- Data type exploration

### 4. Wikipedia Image Downloader (`wikipedia_image_downloader.py`)
**Target Site:** [Wikipedia - Deep Blue (chess computer)](https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer))
**Objective:** Extract and download images from Wikipedia pages
- Image URL extraction
- Binary file downloading
- File I/O operations

### 5. Quotes Scraper (`quotes_scraper.py`)
**Target Site:** [quotes.toscrape.com](http://quotes.toscrape.com)
**Objective:** Collect all quotes from the first page
- CSS class selection
- Text content extraction
- List building

### 6. Authors Scraper (`authors_scraper.py`)
**Target Site:** [quotes.toscrape.com](http://quotes.toscrape.com)
**Objective:** Extract unique author names from the first page
- Set usage for duplicate removal
- Author name collection
- Unique data handling

### 7. Top Tags Scraper (`top_tags_scraper.py`)
**Target Site:** [quotes.toscrape.com](http://quotes.toscrape.com)
**Objective:** Extract the top ten tags from the homepage sidebar
- Sidebar content scraping
- Text cleaning with `.strip()`
- Popular content extraction

### 8. Multi-Page Authors Scraper (`multi_page_authors_scraper.py`)
**Target Site:** [quotes.toscrape.com](http://quotes.toscrape.com)
**Objective:** Extract all unique authors from all pages using two different approaches
- **Method 1:** Fixed range (when you know the total pages)
- **Method 2:** Dynamic detection (robust approach for unknown page count)
- Advanced pagination handling
- End-of-content detection

## 🎯 Key Learning Concepts

### Web Scraping Fundamentals
- Making HTTP requests with `requests.get()`
- Parsing HTML with `BeautifulSoup`
- CSS selector usage for element targeting
- Text extraction and cleaning

### Data Collection Strategies
- **Lists** - For ordered data collection
- **Sets** - For unique data collection (automatic duplicate removal)
- **File I/O** - For saving downloaded content

### Pagination Techniques
- **URL templating** with `.format()` for dynamic page URLs
- **Fixed range pagination** for known page counts
- **Dynamic pagination** with condition checking for unknown page counts
- **End detection** using content-based signals

## 🚀 Getting Started

### Prerequisites
```bash
pip install requests beautifulsoup4 lxml
```

### Running the Scripts
Each script can be run independently:
```bash
python book_ratings_scraper.py
python quotes_scraper.py
python multi_page_authors_scraper.py
# ... etc
```

## 📋 Usage Examples

### Basic Single Page Scraping
```python
import requests
import bs4

url = "https://example.com"
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, "lxml")
title = soup.select("title")[0].get_text()
```

### Multi-Page Scraping with Known Page Count
```python
base_url = "https://site.com/page/{}/"
data = []

for page in range(1, 11): # Pages 1-10
url = base_url.format(page)
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, "lxml")
# Extract data...
```

### Dynamic Multi-Page Scraping
```python
page = 1
while True:
url = base_url.format(page)
response = requests.get(url)

if "No content found!" in response.text:
break

# Process page...
page += 1
```

## 🎓 Skills Demonstrated

- **HTTP Requests** - Fetching web page content
- **HTML Parsing** - Extracting structured data from web pages
- **CSS Selectors** - Targeting specific elements
- **Data Structures** - Using lists and sets effectively
- **File Operations** - Downloading and saving binary content
- **Control Flow** - Implementing various looping strategies
- **String Manipulation** - Cleaning and formatting extracted text
- **Pagination Logic** - Handling multi-page data sources

## 🔍 Target Websites

- **books.toscrape.com** - Demo bookstore for scraping practice
- **quotes.toscrape.com** - Demo quotes site for scraping practice
- **Wikipedia** - Real-world content extraction
- **example.com** - Basic HTML structure demonstration

## 📝 Notes

- All scripts include comprehensive comments and docstrings
- Code follows clean, readable Python practices
- Each project demonstrates different aspects of web scraping
- Projects progress from basic to more advanced techniques
- Suitable for learning and portfolio demonstration

## 🚨 Ethical Considerations

- Respect rate limits and server resources
- Use scraped data responsibly
- Consider the website's terms of service
- These projects use demo/educational websites designed for scraping practice

---

*This repository showcases fundamental web scraping techniques and serves as a learning resource for Python web scraping with requests and BeautifulSoup.*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/victorcezeh/understanding-web-scraping

Awesome Lists containing this project

README