Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/layer-se7en/scrapethissite-ajax

Oscar Winning Films: AJAX and Javascript - Walkthrough
https://github.com/layer-se7en/scrapethissite-ajax

beautifulsoup python requests webscraping

Last synced: 1 day ago
JSON representation

Oscar Winning Films: AJAX and Javascript - Walkthrough

Awesome Lists containing this project

README

        

# Oscar Winning Films: AJAX and Javascript - Walkthrough
> walkthrough for the web-scrapping exercise found on https://www.scrapethissite.com
## Step 0: Install necessary dependencies
### - Step 0.1: Create a virtual environment:
```bash
python -m venv venv
```

### - Step 0.2: Activate the virtual environment:

***Windows***:
```bash
venv\Scripts\activate
```

***Mac/Linux***:
```bash
source venv/bin/activate
```
### - Step 0.3: Install dependencies:
```bash
pip install requests beautifulsoup4
```
* * *
## Step 1: Get the Webpage HTML
Start by fetching the HTML content of the target webpage. We'll use the `requests` library to do this.

```python
import requests

# Define the URL of the webpage
url = 'https://www.scrapethissite.com/pages/ajax-javascript/'

# Send a GET request to fetch the webpage content
response = requests.get(url)

# Extract the HTML content from the response
html_content = response.text
```
* * *
## Step 2: Spot HTML Patterns

Inspect the HTML structure for any recurring patterns. It appears that each movie year is contained within `` elements having the class ``"year-link"``.

```html

```
* * *
## Step 3: Parse the HTML

We'll use ``BeautifulSoup``, to parse the HTML content and make it ready for extraction.

```python
from bs4 import BeautifulSoup

# Parse the HTML content
soup = BeautifulSoup(html_content, 'html.parser')
```
* * *
## Step 4: Extract the Data

Having identified the pattern, we can gather all `` elements with the class ``"year-link"`` as individual datasets. Then, we'll GET movie data for each year; from which we'll extract details like movie title, nominations, awards, etc...

```python
# Find all
elements with the class "year-link"
years_tags = soup.select('a.year-link')

movie_data = []

# Iterate through each element for data extraction
for years_tag in years_tags:

# Extract Year
year = years_tag.text.strip()

# Send a GET request to fetch the webpage content
response = requests.get(url, params={ 'ajax': 'true', 'year': year})

# Extract the JSON content from the response
movie_data.extend(response.json)
```

* * *
### Put it all together:
```python
import json
import requests
from bs4 import BeautifulSoup

def fetch_movie_data(url):

# Send a GET request to fetch the webpage content
response = requests.get(url)

# Extract the HTML content from the response
html_content = response.text

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Return the BeautifulSoup object
return soup

def extract_movie_data(soup, url):

# Select all tags with class 'year-link'
years_tags = soup.select('a.year-link')

movie_data = []

# Iterate through each tag containing year data
for year_tag in years_tags:
# Extract year from the text of the
tag
year = year_tag.text.strip()

# Send a GET request to fetch the webpage content for the given year
response = requests.get(url, params={'ajax': 'true', 'year': year})

# Extract JSON content from the response
json_content = response.json()

# Add extracted movie data to the list
movie_data.extend(json_content)

return movie_data

url = 'https://www.scrapethissite.com/pages/ajax-javascript/'

soup = fetch_movie_data(url)

movie_data = extract_movie_data(soup, url)

print(json.dumps(movie_data[:5], indent=4))
```
### Outputs:
```json
[
{
"title": "Spotlight ",
"year": 2015,
"awards": 2,
"nominations": 6,
"best_picture": true
},
{
"title": "Mad Max: Fury Road ",
"year": 2015,
"awards": 6,
"nominations": 10
},
{
"title": "The Revenant ",
"year": 2015,
"awards": 3,
"nominations": 12
},
{
"title": "Bridge of Spies",
"year": 2015,
"awards": 1,
"nominations": 6
},
{
"title": "The Big Short ",
"year": 2015,
"awards": 1,
"nominations": 5
}
]
```