Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/layer-se7en/scrapethissite-simple
Countries of the World: A Simple Example - Walkthrough
https://github.com/layer-se7en/scrapethissite-simple
beautifulsoup python requests webscraping
Last synced: 1 day ago
JSON representation
Countries of the World: A Simple Example - Walkthrough
- Host: GitHub
- URL: https://github.com/layer-se7en/scrapethissite-simple
- Owner: layer-se7en
- License: mit
- Created: 2024-03-20T01:35:36.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-03-20T01:40:07.000Z (10 months ago)
- Last Synced: 2024-12-31T03:29:33.915Z (10 days ago)
- Topics: beautifulsoup, python, requests, webscraping
- Language: Python
- Homepage: https://www.scrapethissite.com/pages/simple/
- Size: 2.93 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Countries of the World: A Simple Example - Walkthrough
> walkthrough for the web-scrapping exercise found on https://www.scrapethissite.com
## Step 0: Install necessary dependencies
### - Step 0.1: Create a virtual environment:
```bash
python -m venv venv
```### - Step 0.2: Activate the virtual environment:
***Windows***:
```bash
venv\Scripts\activate
```***Mac/Linux***:
```bash
source venv/bin/activate
```
### - Step 0.3: Install dependencies:
```bash
pip install requests beautifulsoup4
```
* * *
## Step 1: Get the Webpage HTML
Start by fetching the HTML content of the target webpage. We'll use the `requests` library to do this.```python
import requests# Define the URL of the webpage
url = 'https://www.scrapethissite.com/pages/simple/'# Send a GET request to fetch the webpage content
response = requests.get(url)# Extract the HTML content from the response
html_content = response.text
```
* * *
## Step 2: Spot HTML PatternsInspect the HTML structure for any recurring patterns. It appears that each country's data is contained within `
` elements having the class ``"country"``.```html
Afghanistan
Capital: Kabul
Population: 29,121,286
Area (km2): 647,500.0
Anguilla
Capital: The Valley
Population: 13,254
Area (km2): 102.0
```
* * *
## Step 3: Parse the HTMLWe'll use ``BeautifulSoup``, to parse the HTML content and make it ready for extraction.
```python
from bs4 import BeautifulSoup# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')
```
* * *
## Step 4: Extract the DataHaving identified the pattern, we can gather all `
` elements with the class ``"country"`` as individual datasets. Then, we'll loop through each dataset and extract details like country name, capital, population, and area.```python
# Find allelements with the class "country"
country_divs = soup.select('div.country')# Iterate through each
element for data extraction
for country_div in country_divs:# Extract country name
country_name = country_div.select_one('h3.country-name')# gets:
#
#
# Andorra
## We only need the text:
country_name = country_name.text
# returns:
# Andorra
# But with extra whitespace
# We can remove the extra whitespace:
country_name = country_name.strip()
# returns:
# Andorra
# Without any extra whitespace
# We can achieve clean text extraction in a single line by appending the .text.strip() methods.
# For example, to extract the country capital, we can write:# Extract country capital
country_capital = country_div.select_one('span.country-capital').text.strip()# This line not only selects the country capital element but also extracts its text content
# and removes any leading or trailing whitespace, ensuring clean and properly formatted data.# Extract country population
country_population = country_div.select_one('span.country-population').text.strip()# Extract country area
country_area = country_div.select_one('span.country-area').text.strip()
```* * *
### Put it all together:
```python
import json # optional, added for output formating
import requests
from bs4 import BeautifulSoupurl = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
country_divs = soup.select('div.country')
country_data = []
# note [:5], only will extract first 5 countries
for country_div in country_divs[:5]:
country_name = country_div.select_one('h3.country-name').text.strip()
country_capital = country_div.select_one('span.country-capital').text.strip()
country_population = country_div.select_one('span.country-population').text.strip()
country_area = country_div.select_one('span.country-area').text.strip()country_data.append({
'country_name': country_name,
'country_capital': country_capital,
'country_population': country_population,
'country_area': country_area
})# you can just do:
# print(country_data)# but this will output the country_data, with indentation
print(json.dumps(country_data, indent=4))```
### Outputs:
```json
[
{
"country_name": "Andorra",
"country_capital": "Andorra la Vella",
"country_population": "84000",
"country_area": "468.0"
},
{
"country_name": "United Arab Emirates",
"country_capital": "Abu Dhabi",
"country_population": "4975593",
"country_area": "82880.0"
},
{
"country_name": "Afghanistan",
"country_capital": "Kabul",
"country_population": "29121286",
"country_area": "647500.0"
},
{
"country_name": "Antigua and Barbuda",
"country_capital": "St. John's",
"country_population": "86754",
"country_area": "443.0"
},
{
"country_name": "Anguilla",
"country_capital": "The Valley",
"country_population": "13254",
"country_area": "102.0"
}
]
```