Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/layer-se7en/scrapethissite-ajax
Oscar Winning Films: AJAX and Javascript - Walkthrough
https://github.com/layer-se7en/scrapethissite-ajax
beautifulsoup python requests webscraping
Last synced: 1 day ago
JSON representation
Oscar Winning Films: AJAX and Javascript - Walkthrough
- Host: GitHub
- URL: https://github.com/layer-se7en/scrapethissite-ajax
- Owner: layer-se7en
- License: mit
- Created: 2024-03-20T08:25:25.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-03-20T08:26:14.000Z (10 months ago)
- Last Synced: 2024-12-31T03:29:33.792Z (10 days ago)
- Topics: beautifulsoup, python, requests, webscraping
- Language: Python
- Homepage: https://www.scrapethissite.com/pages/ajax-javascript/
- Size: 2.93 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Oscar Winning Films: AJAX and Javascript - Walkthrough
> walkthrough for the web-scrapping exercise found on https://www.scrapethissite.com
## Step 0: Install necessary dependencies
### - Step 0.1: Create a virtual environment:
```bash
python -m venv venv
```### - Step 0.2: Activate the virtual environment:
***Windows***:
```bash
venv\Scripts\activate
```***Mac/Linux***:
```bash
source venv/bin/activate
```
### - Step 0.3: Install dependencies:
```bash
pip install requests beautifulsoup4
```
* * *
## Step 1: Get the Webpage HTML
Start by fetching the HTML content of the target webpage. We'll use the `requests` library to do this.```python
import requests# Define the URL of the webpage
url = 'https://www.scrapethissite.com/pages/ajax-javascript/'# Send a GET request to fetch the webpage content
response = requests.get(url)# Extract the HTML content from the response
html_content = response.text
```
* * *
## Step 2: Spot HTML PatternsInspect the HTML structure for any recurring patterns. It appears that each movie year is contained within `` elements having the class ``"year-link"``.
```html
```
* * *
## Step 3: Parse the HTMLWe'll use ``BeautifulSoup``, to parse the HTML content and make it ready for extraction.
```python
from bs4 import BeautifulSoup# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')
```
* * *
## Step 4: Extract the DataHaving identified the pattern, we can gather all `` elements with the class ``"year-link"`` as individual datasets. Then, we'll GET movie data for each year; from which we'll extract details like movie title, nominations, awards, etc...
```python
# Find all elements with the class "year-link"
years_tags = soup.select('a.year-link')movie_data = []
# Extract the JSON content from the response
movie_data.extend(response.json)
```* * *
### Put it all together:
```python
import json
import requests
from bs4 import BeautifulSoupdef fetch_movie_data(url):
# Send a GET request to fetch the webpage content
response = requests.get(url)# Extract the HTML content from the response
html_content = response.text# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')# Return the BeautifulSoup object
return soupdef extract_movie_data(soup, url):
# Select all tags with class 'year-link'
years_tags = soup.select('a.year-link')movie_data = []
# Iterate through each tag containing year data
for year_tag in years_tags:
# Extract year from the text of the tag
year = year_tag.text.strip()# Send a GET request to fetch the webpage content for the given year
response = requests.get(url, params={'ajax': 'true', 'year': year})# Extract JSON content from the response
json_content = response.json()# Add extracted movie data to the list
movie_data.extend(json_content)return movie_data
url = 'https://www.scrapethissite.com/pages/ajax-javascript/'
soup = fetch_movie_data(url)
movie_data = extract_movie_data(soup, url)
print(json.dumps(movie_data[:5], indent=4))
```
### Outputs:
```json
[
{
"title": "Spotlight ",
"year": 2015,
"awards": 2,
"nominations": 6,
"best_picture": true
},
{
"title": "Mad Max: Fury Road ",
"year": 2015,
"awards": 6,
"nominations": 10
},
{
"title": "The Revenant ",
"year": 2015,
"awards": 3,
"nominations": 12
},
{
"title": "Bridge of Spies",
"year": 2015,
"awards": 1,
"nominations": 6
},
{
"title": "The Big Short ",
"year": 2015,
"awards": 1,
"nominations": 5
}
]
```