Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/viiviiiix/scrape-this-site-sandbox

A collection of projects that is used to learn web scraping.
https://github.com/viiviiiix/scrape-this-site-sandbox

beautifulsoup python scrape-this-site web-scraping

Last synced: 8 days ago
JSON representation

A collection of projects that is used to learn web scraping.

Host: GitHub
URL: https://github.com/viiviiiix/scrape-this-site-sandbox
Owner: VIIVIIIIX
Created: 2024-04-18T13:08:35.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-05-01T13:00:56.000Z (9 months ago)
Last Synced: 2024-11-13T16:13:46.643Z (2 months ago)
Topics: beautifulsoup, python, scrape-this-site, web-scraping
Language: Python
Homepage:
Size: 12.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Scrape This Site - Sandbox

A collection of projects that we'll use to learn web scraping.

- **Countries of the World: A Simple Example**

A single page that lists information about all the countries in the world.

There are following information that can be scraped...

- Country Name
- Country Capital
- Country Population
- Country Area

- **Hockey Teams: Forms, Searching and Pagination**

Browse through a database of NHL team stats since 1990 and build a scraper that handles common website interface components.

There are following information that can be scraped...

- Team Name
- Year
- Wins
- Losses
- OT-Losses
- Win %
- Goals For (GF)
- Goals Against (GA)
- Difference (+ / -)

- **Oscar Winning Films: AJAX and Javascript**

Click through a bunch of great films. Learn how content is added to the page asynchronously with Javascript and how you can scrape it.

There are following information that can be scraped...

- Title
- Nominations
- Awards
- Best Picture

- **Turtles All the Way Down: Frames & iFrames**

Some older sites might still use frames to break up thier pages. Modern ones might be using iFrames to expose data. Learn about turtles as you scrape content inside frames.

There are following information that can be scraped...

- Species Name
- Discription

- **Spoofing Headers**

Sometimes you need to make your web scraper appear to be making an HTTP requests as a browser in order to get the web server to return the same data that you see in your browser.

- returns "Headers properly spoofed, request appears to be coming from a browser :)" in the HTML.

# How to Run?

1. Clone this repository.

```
git clone https://github.com/VIIVIIIIX/scrape-this-site-sandbox.git
```

2. Create a virtual environment.

```
cd scrape-this-site-sandbox
python3 -m venv .venv
```

3. Activate the virtual environment and install necessary libraries.

```
cd .venv
source ./bin/activate
cd ..
pip install -r requirements.txt
```

4. Change the directory and run the code to generate the csv containing data.

- Countries Of the World

```
cd countries-of-the-world
python3 countries.py
```

- Hockey Teams

```
cd hockey-teams
python3 hockey-teams.py
```

- Oscar Winning Films

```
cd oscar-winning-films
python3 oscar.py
```

- Turtles iframes

```
cd turtles-iframes
python3 turtles.py
```

- Spoofing Headers

```
cd spoofing-headers
python3 spoofing-headers.py
```