https://github.com/oxylabs/web-scraping-data-parsing-beautiful-soup

Web Scraping and Data Parsing Using Beautiful Soup
https://github.com/oxylabs/web-scraping-data-parsing-beautiful-soup

beautifulsoup data-parsing github-python python web-scraping

Last synced: 3 months ago
JSON representation

Web Scraping and Data Parsing Using Beautiful Soup

Host: GitHub
URL: https://github.com/oxylabs/web-scraping-data-parsing-beautiful-soup
Owner: oxylabs
Created: 2022-11-22T10:52:45.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2024-04-19T11:14:43.000Z (over 1 year ago)
Last Synced: 2024-11-17T02:09:16.589Z (about 1 year ago)
Topics: beautifulsoup, data-parsing, github-python, python, web-scraping
Language: Python
Homepage:
Size: 9.77 KB
Stars: 5
Watchers: 2
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Web Scraping and Data Parsing Using Beautiful Soup

[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=web-scraping-data-parsing-beautiful-soup-github&transaction_id=102f49063ab94276ae8f116d224b67)

[![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs)

[](https://github.com/topics/python)
[](https://github.com/topics/beautifulsoup)
[](https://github.com/topics/web-scraping)

This project provides a clear and concise example of how to fetch content from a website using the [Requests](https://pypi.org/project/requests/) module and then parse it using [BeautifulSoup](https://pypi.org/project/beautifulsoup4/).

## Setting Up

To run this example you will need Python 3. We recommend [setting up a virtual environment](https://docs.python.org/3/library/venv.html)

Install dependencies by running
```bash
$ pip install requests
$ pip install BeautifulSoup4
$ pip install pandas
```

**Note**: You can also install them by using the [requirements.txt](src/requirements.txt) file included in this repository.

```bash
$ pip install -r src/requirements.txt
```

## Web Scraping

A mock bookstore website called https://books.toscrape.com is our scraping target.

Use the `requests` module to fetch a page from it

```python
response = requests.get('https://books.toscrape.com')
```

Once the response is retrieved, check whether the request was successful or not by verifying the `status_code` property

```python
if response.status_code != 200:
print('Page not found')
exit(1)

print('Successfully fetched the page')
```

Save the script as `src/scrape.py` and run it.

```bash
$ python3 src/scrape.py

Successfully fetched the page
```

The `requests` module has successfully retrieved the html content from the website and now all that's left is to parse it.

A working example can be found [here](src/scrape.py)

## Parse HTML

Take a look at the structure of the HTML that you're trying to scrape.

```html

...

A Light in the ...

...

```

The book info is neatly wrapped in an `article` tag. Inside the article, there's a heading (`h3`) that contains an anchor (`a`), which contains the title of the book inside an attribute.

```html
...
```

To parse this HTML content use the BeautifulSoup4 library.

Firstly, import BeautifulSoup
```python
from bs4 import BeautifulSoup
```

Then, create an instance of the `BeautifulSoup` class and load the HTML content that has been retrieved from the web page previously.

```python
soup = BeautifulSoup(response.content, 'html.parser')
```

Retrieve all the article tags

```python
articles = soup.find_all('article')
```

Define a `titles` array that will hold all the book titles extracted from the current HTML

```python
titles = []
```

Iterate through every article to extract the title attribute of the anchor tag. You may want to print the title as well, just to see whether the script works as expected

```python
for article in articles:
title = article.h3.a.attrs['title']
titles.append(title)
print(title)
```

Save the script as `src/parse.py` and run it

```python
$ python3 src/parse.py
Successfully fetched the page
A Light in the Attic
Tipping the Velvet
Soumission
...
```

All the book titles have been parsed successfully!

A working example can be found [here](src/parse.py)

## Save to csv

Printing everything to standard output can become messy at times. Instead, it is a good idea to save the results into a CSV file.

Start by deleting the `print` function.
```python
print(title) # delete this!
```

Next, create a data frame object by using the `pandas` library. In the constructor, pass a dictionary that contains the name of the column ("Title") and an array of titles that was parsed previously.

```python
data_frame = pandas.DataFrame({'Title': titles})
```

Finally, save the data frame to a file by using the `to_csv` method

```python
data_frame.to_csv('books.csv', index=False, encoding='utf-8')
```

Save the script as `src/save.py` and execute it.

```bash
$ cd src
$ python3 save.py
Successfully fetched the page
```

Use the `cat` Unix utility to print the csv file.

```bash
$ cat books.csv
Title
A Light in the Attic
Tipping the Velvet
Soumission
...
```

The newly created file now contains all the book titles from the web page.

The final version of the script can be found [here](src/save.py)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/oxylabs/web-scraping-data-parsing-beautiful-soup

Awesome Lists containing this project

README

A Light in the ...