Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/layer-se7en/scrapethissite-forms
Hockey Teams: Forms, Searching and Pagination - Walkthrough
https://github.com/layer-se7en/scrapethissite-forms
beautifulsoup python requests webscraping
Last synced: 1 day ago
JSON representation
Hockey Teams: Forms, Searching and Pagination - Walkthrough
- Host: GitHub
- URL: https://github.com/layer-se7en/scrapethissite-forms
- Owner: layer-se7en
- License: mit
- Created: 2024-03-20T01:49:54.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-03-20T07:23:21.000Z (10 months ago)
- Last Synced: 2024-12-31T03:29:33.692Z (10 days ago)
- Topics: beautifulsoup, python, requests, webscraping
- Language: Python
- Homepage: https://www.scrapethissite.com/pages/forms/
- Size: 9.77 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Hockey Teams: Forms, Searching and Pagination - Walkthrough
> walkthrough for the web-scrapping exercise found on https://www.scrapethissite.com
## Step 0: Install necessary dependencies
### - Step 0.1: Create a virtual environment:
```bash
python -m venv venv
```### - Step 0.2: Activate the virtual environment:
***Windows***:
```bash
venv\Scripts\activate
```***Mac/Linux***:
```bash
source venv/bin/activate
```
### - Step 0.3: Install dependencies:
```bash
pip install requests beautifulsoup4
```
* * *
## Step 1: Get the Webpage HTML
Start by fetching the HTML content of the target webpage. We'll use the `requests` library to do this.```python
import requests# Define the URL of the webpage
url = 'https://www.scrapethissite.com/pages/forms/'# Send a GET request to fetch the webpage content
response = requests.get(url)# Extract the HTML content from the response
html_content = response.text
```
* * *
## Step 2: Spot HTML PatternsInspect the HTML structure for any recurring patterns. It appears that each team's data is contained within `` elements having the class ``"team"``.
```html
Boston Bruins
1990
44
24
0.55
299
264
35
Buffalo Sabres
1990
31
30
0.388
292
278
14
```
* * *
## Step 3: Parse the HTMLWe'll use ``BeautifulSoup``, to parse the HTML content and make it ready for extraction.
```python
from bs4 import BeautifulSoup# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')
```
* * *
## Step 4: Extract the DataHaving identified the pattern, we can gather all `` elements with the class ``"team"`` as individual datasets. Then, we'll loop through each dataset and extract details like name, wins, losses, etc...
```python
# Find all elements with the class "team"
team_trs = soup.select('tr.team')# Iterate through each element for data extraction
for team_tr in team_trs:# Extract Team Name
name = team_tr.select_one('td.name')# gets:
#
# Boston Bruins
#
# We only need the text:
name = name.text
# returns:
# Boston Bruins
# But with extra whitespace
# We can remove the extra whitespace:
name = name.strip()
# returns:
# Boston Bruins
# Without any extra whitespace
# We can achieve clean text extraction in a single line by appending the .text.strip() methods.
# For example, to extract the year, we can write:# Extract Year
year = team_tr.select_one('td.year').text.strip()# This line not only selects the year element but also extracts its text content
# and removes any leading or trailing whitespace, ensuring clean and properly formatted data.# Extract Wins
wins = team_tr.select_one('td.wins').text.strip()
# Extract Losses
losses = team_tr.select_one('td.losses').text.strip()
# Extract OT Losses
ot_losses = team_tr.select_one('td.ot-losses').text.strip()
# Extract Win %
win_pct = team_tr.select_one('td.pct').text.strip()
# Extract Goals For
goals_for = team_tr.select_one('td.gf').text.strip()
# Extract Goals Against
goals_against = team_tr.select_one('td.ga').text.strip()
# Extract Goal Difference
goal_difference = team_tr.select_one('td.diff').text.strip()
```* * *
### Put it all together:
```python
import json # optional, added for output formating
import requests
from bs4 import BeautifulSoupurl = 'https://www.scrapethissite.com/pages/forms/'
response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
team_trs = soup.select('tr.team')
team_data = []
# note [:5], only will extract first 5 countries
for team_tr in team_trs[:5]:
name = team_tr.select_one('td.name').text.strip()
year = team_tr.select_one('td.year').text.strip()
wins = team_tr.select_one('td.wins').text.strip()
losses = team_tr.select_one('td.losses').text.strip()
ot_losses = team_tr.select_one('td.ot-losses').text.strip()
win_pct = team_tr.select_one('td.pct').text.strip()
goals_for = team_tr.select_one('td.gf').text.strip()
goals_against = team_tr.select_one('td.ga').text.strip()
goal_difference = team_tr.select_one('td.diff').text.strip()team_data.append({
'team_name': name,
'year': year,
'wins': wins,
'losses': losses,
'ot_losses': ot_losses,
'win_pct': win_pct,
'goals_for': goals_for,
'goals_against': goals_against,
'goal_difference': goal_difference
})# you can just do:
# print(team_data)# but this will output the team_data, with indentation
print(json.dumps(team_data, indent=4))```
### Outputs:
```json
[
{
"team_name": "Boston Bruins",
"year": "1990",
"wins": "44",
"losses": "24",
"ot_losses": "",
"win_pct": "0.55",
"goals_for": "299",
"goals_against": "264",
"goal_difference": "35"
},
{
"team_name": "Buffalo Sabres",
"year": "1990",
"wins": "31",
"losses": "30",
"ot_losses": "",
"win_pct": "0.388",
"goals_for": "292",
"goals_against": "278",
"goal_difference": "14"
},
{
"team_name": "Calgary Flames",
"year": "1990",
"wins": "46",
"losses": "26",
"ot_losses": "",
"win_pct": "0.575",
"goals_for": "344",
"goals_against": "263",
"goal_difference": "81"
},
{
"team_name": "Chicago Blackhawks",
"year": "1990",
"wins": "49",
"losses": "23",
"ot_losses": "",
"win_pct": "0.613",
"goals_for": "284",
"goals_against": "211",
"goal_difference": "73"
},
{
"team_name": "Detroit Red Wings",
"year": "1990",
"wins": "34",
"losses": "38",
"ot_losses": "",
"win_pct": "0.425",
"goals_for": "273",
"goals_against": "298",
"goal_difference": "-25"
}
]
```
***
## This is only 1/3 of what we need to do
- [x] Fetch data from the first page.
- [ ] Fetch data from all or a specific page.
- [ ] Fetch data for a queried team name.### Understanding Query Parameters
Our base URL is: `https://www.scrapethissite.com/pages/forms/`
To modify the behavior of data retrieval, you can append the following parameters to the base URL:
- `per_page=[25, 50, 100]`: Specifies the number of items per page.
- `page_num=[1...24]` at 25 per page (default):
- Example: `?page_num=25&per_page=25` or `?page_num=25`
- `page_num=[1...12]` at 50 per page:
- Example: `?page_num=12&per_page=50`
- `page_num=[1...6]` at 100 per page:
- Example: `?page_num=6&per_page=100`
- `q=""` to query for a specific team:
- Example: `q=boston`
***## Step 1: Break code up into reusable modules
### (separation of concerns)
```python
def fetch_team_data(url):
response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
return soupdef extract_team_data(soup):
team_trs = soup.select('tr.team')
team_data = []
for team_tr in team_trs[:5]:
name = team_tr.select_one('td.name').text.strip()
year = team_tr.select_one('td.year').text.strip()
wins = team_tr.select_one('td.wins').text.strip()
losses = team_tr.select_one('td.losses').text.strip()
ot_losses = team_tr.select_one('td.ot-losses').text.strip()
win_pct = team_tr.select_one('td.pct').text.strip()
goals_for = team_tr.select_one('td.gf').text.strip()
goals_against = team_tr.select_one('td.ga').text.strip()
goal_difference = team_tr.select_one('td.diff').text.strip()
team_data.append({
'team_name': name,
'year': year,
'wins': wins,
'losses': losses,
'ot_losses': ot_losses,
'win_pct': win_pct,
'goals_for': goals_for,
'goals_against': goals_against,
'goal_difference': goal_difference
})
return team_data
```### Modification we can do to `fetch_team_data`:
```python
def fetch_team_data(base_url, params={}):
# Initialize parameters with the given params
parameters = params.copy()
# If 'per_page' is not provided, set it to 100; to minimize requests
if 'per_page' not in parameters:
parameters['per_page'] = '100'# Send a GET request to fetch the webpage content
response = requests.get(base_url, params=parameters)# Extract the HTML content from the response
html_content = response.text# Parse HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')# Initialize an empty list to store parsed soup objects
soup_list = [soup]# If 'page_num' is provided or params is empty, return the parsed soup
if 'page_num' in params or not params:
return soup_list# Initialize page number
page_num = 2 # Start from the second page# Loop to fetch data from subsequent pages
while True:
# Update page number in parameters
parameters['page_num'] = page_num
# Send a GET request with updated parameters
response = requests.get(base_url, params=parameters)# Extract the HTML content from the response
html_content = response.text# Parse HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')# Select a team row from the parsed HTML
team_tr = soup.select_one('tr.team')# If a team row is found, append the parsed soup object to soup_list and increment page number
if team_tr:
soup_list.append(soup)
page_num += 1
else:
# If no team row is found, break out of the loop
breakreturn soup_list
```### Then all we need to do:
```python
# Define the base URL for fetching data
base_url = 'https://www.scrapethissite.com/pages/forms/'# Initialize an empty list to store extracted team data
team_data = []# Define optional parameters for the request
params = {'q': 'sharks', 'per_page': '25', 'page_num':'1'}# Fetch HTML content for each page of the request
team_soups = fetch_team_data(base_url, params=params)# Extract data from each page
for team_soup in team_soups:
team_data.extend(extract_team_data(team_soup))# Output (first 5, note [:5]) the extracted data with indentation
print(json.dumps(team_data[:5], indent=4))
```
### Output:
```json
[
{
"team_name": "San Jose Sharks",
"year": "1991",
"wins": "17",
"losses": "58",
"ot_losses": "",
"win_pct": "0.212",
"goals_for": "219",
"goals_against": "359",
"goal_difference": "-140"
},
{
"team_name": "San Jose Sharks",
"year": "1992",
"wins": "11",
"losses": "71",
"ot_losses": "",
"win_pct": "0.131",
"goals_for": "218",
"goals_against": "414",
"goal_difference": "-196"
},
{
"team_name": "San Jose Sharks",
"year": "1993",
"wins": "33",
"losses": "35",
"ot_losses": "",
"win_pct": "0.393",
"goals_for": "252",
"goals_against": "265",
"goal_difference": "-13"
},
{
"team_name": "San Jose Sharks",
"year": "1994",
"wins": "19",
"losses": "25",
"ot_losses": "",
"win_pct": "0.396",
"goals_for": "129",
"goals_against": "161",
"goal_difference": "-32"
},
{
"team_name": "San Jose Sharks",
"year": "1995",
"wins": "20",
"losses": "55",
"ot_losses": "",
"win_pct": "0.244",
"goals_for": "252",
"goals_against": "357",
"goal_difference": "-105"
}
]
```
**One thing to consider is error boundary/exceptions...**
**As user can provide a page_num that doesn't have content,**
**Or provide an invalid per_page number.**### Basic Error Boundary:
```python
def fetch_team_data(base_url, params={}):parameters = params.copy()
if 'per_page' not in parameters or parameters['per_page'] not in {'25', '50', '100'}:
parameters['per_page'] = '100'
response = requests.get(base_url, params=parameters)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
team_tr = soup.select_one('tr.team')
if not team_tr:
return Nonesoup_list = [soup]
if 'page_num' in params or not params:
return soup_listpage_num = 2
while True:
parameters['page_num'] = page_num
response = requests.get(base_url, params=parameters)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
team_tr = soup.select('tr.team')
if team_tr:
soup_list.append(soup)
if len(team_tr) == parameters['per_page']:
page_num += 1
else:
break
else:
breakreturn soup_list
...
base_url = 'https://www.scrapethissite.com/pages/forms/'
team_data = []
params = {'q': 'boston', 'page_num': '12'}
team_soups = fetch_team_data(base_url, params=params)
if team_soups:
for team_soup in team_soups:
team_data.extend(extract_team_data(team_soup))
print(json.dumps(team_data[:25], indent=4))
else:
print('An error occurred')
```