Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/oxylabs/beautiful-soup-parsing-tutorial
Using Python & Beautiful Soup to Parse Data
https://github.com/oxylabs/beautiful-soup-parsing-tutorial
beautiful-soup-parsing beautifulsoup beautifulsoup-html-parser github-python how-to-use-beautiful-soup-in-python parser python python-beautiful-soup python-web-scraper
Last synced: 3 days ago
JSON representation
Using Python & Beautiful Soup to Parse Data
- Host: GitHub
- URL: https://github.com/oxylabs/beautiful-soup-parsing-tutorial
- Owner: oxylabs
- Created: 2022-08-18T07:15:40.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-04-04T13:29:35.000Z (10 months ago)
- Last Synced: 2024-11-17T02:11:42.200Z (2 months ago)
- Topics: beautiful-soup-parsing, beautifulsoup, beautifulsoup-html-parser, github-python, how-to-use-beautiful-soup-in-python, parser, python, python-beautiful-soup, python-web-scraper
- Language: Python
- Homepage:
- Size: 11.7 KB
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Using Python and Beautiful Soup to Parse Data: Intro Tutorial
[![Oxylabs promo code](https://user-images.githubusercontent.com/129506779/250792357-8289e25e-9c36-4dc0-a5e2-2706db797bb5.png)](https://oxylabs.go2cloud.org/aff_c?offer_id=7&aff_id=877&url_id=112)
[![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/GbxmdGhZjq)
## Installing Beautiful Soup
```bash
pip install BeautifulSoup4
```## Getting started
A sample HTML file will help demonstrate the main methods of how Beautiful Soup parses data. This file is much more simple than your average modern website, however, it will be sufficient for the scope of this tutorial.
```html
What is a Proxy?
Proxy types
There are many different ways to categorize proxies. However, two of
the most popular types are residential and data center proxies. Here is a list of the most common types.
- Residential proxies
- Datacenter proxies
- Shared proxies
- Semi-dedicated proxies
- Private proxies
```
## Traversing for HTML tags
First, we can use Beautiful Soup to extract a list of all the tags used in our sample HTML file. For this, we will use the soup.descendants generator.
```python
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, features="html.parser")
for child in soup.descendants:
if child.name:
print(child.name)
```
After running this code (right click on code and click “Run”) you should get the below output:
```html
html
head
title
meta
body
h2
p
ul
li
li
li
li
li
```
What just happened? Beautiful Soup traversed our HTML file and printed all the HTML tags that it has found sequentially. Let’s take a quick look at what each line did.
```python
from bs4 import BeautifulSoup
```
This tells Python to use the Beautiful Soup library.
```python
with open('index.html', 'r') as f:
contents = f.read()
```
And this code, as you could probably guess, gives an instruction to open our sample HTML file and read its contents.
```python
soup = BeautifulSoup(contents, features="html.parser")
```
This line creates a BeautifulSoup object and passes it to Python’s built-in BeautifulSoup HTML parser. Other parsers, such as lxml, might also be used, but it is a separate external library and for the purpose of this tutorial the built-in parser will do just fine.
```python
for child in soup.descendants:
if child.name:
print(child.name)
```
The final pieces of code, namely the soup.descendants generator, instruct Beautiful Soup to look for HTML tags and print them in the PyCharm console. The results can also easily be exported to a .csv file but we will get to this later.
## Getting the full content of tags
To get the content of tags, this is what we can do:
```python
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, features="html.parser")
print(soup.h2)
print(soup.p)
print(soup.li)
```
This is a simple instruction that outputs the HTML tag with its full content in the specified order. Here’s what the output should look like:
```html
Proxy types
There are many different ways to categorize proxies. However, two of the most popular types are residential and data center proxies. Here is a list of the most common types.
```
You could also remove the HTML tags and print text only, by using, for example:
```python
print(soup.li.text)
```
Which in our case will give the following output:
```html
Residential proxies
```
Note that this only prints the first instance of the specified tag. Let’s continue to see how to find elements by ID or using the find_all method to filter elements by specific criteria.
## Using Beautiful Soup to find elements by ID
We can use two similar ways to find elements by ID:
```python
print(soup.find('ul', attrs={'id': 'proxytypes'}))
```
or
```python
print(soup.find('ul', id='proxytypes'))
```
Both of these will output the same result in the Python Console:
```html
- Residential proxies
- Datacenter proxies
- Shared proxies
- Semi-dedicated proxies
- Private proxies
```
## Finding all specified tags and extracting text
The find_all method is a great way to extract specific data from an HTML file. It accepts many criteria that make it a flexible tool allowing us to filter data in convenient ways. Yet for this tutorial we do not need anything more complex. Let’s find all items of our list and print them as text only:
```python
for tag in soup.find_all('li'):
print(tag.text)
```
This is how the full code should look like:
```python
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, features="html.parser")
for tag in soup.find_all('li'):
print(tag.text)
```
And here’s the output:
```
Residential proxies
Datacenter proxies
Shared proxies
Semi-dedicated proxies
Private proxies
```
## Exporting data to a .csv file
```bash
pip install pandas
```
Add this line to the beginning of your code to import the library:
```python
import pandas as pd
```
Going further, let’s add some lines that will export the list we extracted earlier to a .csv file. This is how our full code should look like:
```python
from bs4 import BeautifulSoup
import pandas as pd
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, features="html.parser")
results = soup.find_all('li')
df = pd.DataFrame({'Names': results})
df.to_csv('names.csv', index=False, encoding='utf-8')
```
What happened here? Let’s take a look:
```python
results = soup.find_all('li')
```
This line finds all instances of the `
```python
df = pd.DataFrame({'Names': results})
df.to_csv('names.csv', index=False, encoding='utf-8')
```