https://github.com/pawsanie/steam_statistics_etl

This pipeline can be used to collect statistical information about all games, distributed through the Steam platform.
https://github.com/pawsanie/steam_statistics_etl

crawler-python data-crawler etl etl-pipeline extract-transform-load games luigi python python-3 python3 scraper scraping scraping-websites statistics steam steam-games steam-store steam-web-api

Last synced: 3 months ago
JSON representation

This pipeline can be used to collect statistical information about all games, distributed through the Steam platform.

Host: GitHub
URL: https://github.com/pawsanie/steam_statistics_etl
Owner: Pawsanie
License: 0bsd
Created: 2022-04-25T22:23:48.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2024-05-26T06:51:27.000Z (12 months ago)
Last Synced: 2025-01-02T21:20:20.144Z (5 months ago)
Topics: crawler-python, data-crawler, etl, etl-pipeline, extract-transform-load, games, luigi, python, python-3, python3, scraper, scraping, scraping-websites, statistics, steam, steam-games, steam-store, steam-web-api
Language: Python
Homepage:
Size: 2.35 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.md

Awesome Lists containing this project

README

# Steam statistics ETL

## Disclaimer:
:warning:**Using** some or all of the elements of this code, **You** assume **responsibility for any consequences!**

:warning:The **licenses** for the technologies on which the code **depends** are subject to **change by their authors**.

## Description of the pipeline:
This pipeline is used to collect statistical information about all games,

distributed through the Steam platform, including:
* :moneybag:Price
* :label:Tags
* :globe_with_meridians:Publisher
* :hammer_and_wrench:Developer
* :date:Steam release date

Unfortunately, the [Steam Web API](https://developer.valvesoftware.com/wiki/Steam_Web_API) does not provide such information when requested directly through its [methods](https://wiki.teamfortress.com/wiki/WebAPI) at the moment.

To solve this problem, this pipeline is being developed.

The pipeline also receives directly from the Steam Web API:
* :abc:Application names
* :id:And their id on Steam.

:spiral_calendar:Additionally, the pipeline remembers the scan date.

****

## Required:
The application code is written in python and obviously depends on it.

**Python** version 3.6 [Python Software Foundation License / (with) Zero-Clause BSD license (after 3.8.6 version Python)]:
* :octocat:[Python GitHub](https://github.com/python)
* :bookmark_tabs:[Python internet page](https://www.python.org/)

## Required Packages:
Used to Luigi tasks conveyor.

**Luigi** [Apache License 2.0]:
* :octocat:[Luigi GitHub](https://github.com/spotify/luigi)

Used to work with tabular data.

**Pandas** [BSD-3-Clause license]:
* :octocat:[Pandas GitHub](https://github.com/pandas-dev/pandas/)
* :bookmark_tabs:[Pandas internet page](https://pandas.pydata.org/)

Used to create a random user agent.

**fake-useragent** [Apache-2.0 license]:
* :octocat:[fake-useragent GitHub](https://github.com/fake-useragent/fake-useragent)

Used to send requests and receive responses.

**Requests** [Apache-2.0 license]:
* :octocat:[Requests GitHub](https://github.com/psf/requests)
* :bookmark_tabs:[Requests internet page](https://requests.readthedocs.io/en/latest/)

Used for scraping.

**BeautifulSoup4** [MIT]:
* :octocat:[BeautifulSoup4 GitHub](https://github.com/getanewsletter/BeautifulSoup4)
* :bookmark_tabs:[BeautifulSoup4 internet page](https://www.crummy.com/software/BeautifulSoup/)

Used to bring the table cells to the desired value.

**NumPy** [BSD-3-Clause license]:
* :octocat:[NumPy GitHub](https://github.com/numpy/numpy)
* :bookmark_tabs:[NumPy internet page](https://numpy.org/)

Used to save data in parquet format.

**PyArrow** [Apache-2.0 license]:
* :octocat:[PyArrow GitHub](https://github.com/apache/arrow)
* :bookmark_tabs:[PyArrow internet page](https://arrow.apache.org/)

Used to monitor the progress of certain tasks from the terminal while they are running.

**tqdm** [MIT/ (with) Mozilla v. 2.0]:
* :octocat:[tqdm GitHub](https://github.com/tqdm/tqdm)
* :bookmark_tabs:[tqdm internet page](https://tqdm.github.io/)

## Installing the Required Packages:
```bash
pip install luigi
pip install pandas
pip install fake_useragent
pip install requests
pip install beautifulsoup4
pip install numpy
pip install pyarrow
pip install tqdm
```
## Launch:
If Your OS has a bash shell the ETL pipeline can be started using the bash script:
```bash
./start_steam_statistics_ETL.sh
```
At the beginning of this script, the values of variables are described,

by changing the values of which you can change this pipeline.

**File location:**

./:open_file_folder:Steam_statistics_ETL

   └── :page_facing_up:start_steam_statistics_ETL.sh

The script contains an example of all the necessary arguments to run.

To launch the pipeline through this script, do not forget to make it executable.
```bash
chmod +x ./start_steam_statistics_ETL.sh
```
The script can also be run directly with 'python' command.

**Example of one task with 'python' command:**
```bash
python3 -B -m steam_statistics_luigi_ETL AllSteamProductsData.AllSteamProductsData \
\
--AllSteamProductsData.AllSteamProductsData-landing-path-part $all_steam_products_data_path \
--AllSteamProductsData.AllSteamProductsData-date-path-part $date_path_part \
--AllSteamProductsData.AllSteamProductsData-file-mask $all_steam_products_data_file_mask \
--AllSteamProductsData.AllSteamProductsData-ancestor-file-mask $all_steam_products_data_ancestor_file_mask \
--AllSteamProductsData.AllSteamProductsData-file-name $all_steam_products_data_file_name \
--AllSteamProductsData.AllSteamProductsData-logfile-path $all_steam_products_logfile_path \
--AllSteamProductsData.AllSteamProductsData-loglevel $all_steam_products_loglevel \

```
The example above shows the launch of one task.

Also note that the task pipeline itself is described in the 'steam_statistics_luigi_ETL.py' script.

**File location:**

./:open_file_folder:Steam_statistics_ETL

   └── :page_facing_up:steam_statistics_luigi_ETL.py

## Launch in Docker:
To run a docker container with etl, you can use a ready-made dockerfile.

To do this, run the build command with administrator rights in Windows, or with sudo privileges in Unix-like systems.

**Example docker build command:**
```commandline
docker build -t luigi-steam -f С:\Git\Steam_statistics_ETL\Docker\luigi_ETL.df С:\Git\Steam_statistics_ETL\Docker\
```
Wait for the image to build.

Start building the docker image with a shell command.

**Example docker run command:**
```commandline
docker run -d -t --name Steam_Statistics luigi-steam
```

## Description of tasks:
**AllSteamProductsData**
* Retrieves a list of applications from steam Web-API.
* If the launch is not the first time, saves the difference with the previous launch as a result.
****
**GetSteamProductsDataInfo**
* Requests application pages received from the last task.
* Masquerades as a new user every request and waits for a random value of seconds between 3 and 6 before a new request.
* Scraping data on these pages.
* Filters out everything that is not applications and additions to them.
* Sifts DLC and saves them to a separate file from applications.
* Separately saves applications and add-ons that are not available to the request in this region.
* Saves each request to a local cache, in case the pipeline crashes.
* Reads the local cache of applications and DLC every unsuccessful instances.
* Deletes the local cache of applications and add-ons after successful instances.

Result features:
* Some apps require registration to scrape. Information about them cannot be collected.

All columns in the row about this application will be empty, except for the 'id' and 'name'.
* The fields of the 'rating_30d_percent' and 'rating_30d_count' columns can be empty if no one has left a review in the last 30 days.
* 'not_available_in_steam_now' is set in the 'price' column value if the app is no longer available in the Steam store.

Usually this situation is adjacent to the previous point.
* Some apps and DLCs may be missing tags. Most often this applies to various OSTs,
in which case the cell will be empty.
* If the release of the application, or DLC has not yet occurred, then the cells of the steam_release_date column will be marked "in the pipelane".

****

**[SteamAppInfoCSVJoiner, SteamDLCInfoCSVJoiner]**
* Collects the results of all successful instances of the past task and merges them into a new file containing statistics about applications.
* Fills in the empty cells 'nan'.

## Known Problems:
### FakeUserAgentError:
When a task tries to send a request to a page, a number of **fake-useragent errors** appear:
```text
FakeUserAgentError:
Error occurred during loading data. Trying to use cache server https://fake-useragent.herokuapp.com/browsers/0.1.11
...
urllib.error.HTTPError: HTTP Error 503: Service Unavailable
...
fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached
...
```
This is due to the outdated version of the fake-useragent.

**Solution:**

Update the fake-useragent with the terminal, or command line.
```bash
pip install fake-useragent --upgrade
```
### bs4.FeatureNotFound:
```text
raise FeatureNotFound:
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?
```
It looks like you don't have the lxml library installed.

**Solution:**

Install the lxml library with terminal, or command line.
```bash
pip install lxml
```

## Known Bugs:
* Applications that do not have a price receive as a value not 0, but literally emptiness.
* Sometimes it is not possible to scrape information about the publisher and developer of the application.
The value in the cell will be empty.

***

**Thank you** for your interest in my work.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pawsanie/steam_statistics_etl

Awesome Lists containing this project

README