Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/manuelandersen/football-pipeline
DE Zoomcamp 2024 Final Project π§
https://github.com/manuelandersen/football-pipeline
bigquery data-engineering data-lake data-warehouse dbt dbt-cloud etl-pipeline google-cloud looker-studio mageai python
Last synced: about 1 month ago
JSON representation
DE Zoomcamp 2024 Final Project π§
- Host: GitHub
- URL: https://github.com/manuelandersen/football-pipeline
- Owner: manuelandersen
- License: mit
- Created: 2024-07-08T00:10:32.000Z (4 months ago)
- Default Branch: master
- Last Pushed: 2024-07-24T23:43:00.000Z (4 months ago)
- Last Synced: 2024-09-25T23:01:29.400Z (about 2 months ago)
- Topics: bigquery, data-engineering, data-lake, data-warehouse, dbt, dbt-cloud, etl-pipeline, google-cloud, looker-studio, mageai, python
- Language: Python
- Homepage:
- Size: 975 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# β½ Football Transfermarkt Data Pipeline
![](docs/images/transfermarkt-logo.jpeg)
## π Table of contents
- [Introduction](#π-introduction)
- [Problem description](#β-problem-description)
- [Architecture Overview](#ποΈ-architecture-overview)
- [Data source](#π₯-data-source)
- [Data Ingestion](#π₯-data-ingestion)
- [Data Storage](#πΎ-data-storage)
- [Transformations](#π-transformations)
- [Dashboard](#π-dashboard)
- [Usage](#π§-usage)
- [Further Improvements](#π-further-improvements)## ποΈ Architecture
![](docs/images/giphy.gif)
## π Introduction
This repository represents my final project for the [Data Engineer Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp). It aims to analyze a large dataset of football data scraped from the page [transfermarkt.es](https://www.transfermarkt.es/).
## β Problem Description
The project aims to answer several questions regarding players and their statistics. The focus is on market value, goals, and assists that players have accumulated over the years of their football careers.
## ποΈ Architecture Overview
We use π§[Mage](https://www.mage.ai/) as an orchestrator for the whole pipeline, mounted inside a π³Docker container.
### π₯ Data Source
Data is inside a Kaggle [dataset](https://www.kaggle.com/datasets/davidcariboo/player-scores), and consists in 9 csv files:
- apearances.csv (124.22 MB)
- club_games.csv (8.45 MB)
- clubs.csv (96.06 MB)
- competitions.csv (7.47 MB)
- game_events.csv (75.46 MB)
- game_lineups.csv (244.38 MB)
- games.csv (19.88 MB)
- player_valuations.csv (15.8 MB)
- players.csv (10.33 MB)### π₯ Data Ingestion
- Batch: for the data ingestion we use the [Kaggle API](https://github.com/Kaggle/kaggle-api) to extract all 9 CSV files into a π§Mage block to prepare it for the data lake.
- Batch: for the data warehouse, we use π§Mage to download the data from the Google Cloud Storage bucket and prepare it for loading.### πΎ Data Storage
Both the data lake and the data warehouse are managed by Terraform as a way to learn IaC.
- Google cloud storage: a Google Cloud Storage bucket was created using Terraform. Data is store as Parquet files to reduce memory consumption. All datasets were stored as whole file, except for game_lineups, wich was partitioned by year and month.
- Big query: a BigQuery database was created using terraform.### π Transformations
We use [dbt cloud](https://www.getdbt.com/product/dbt-cloud) to manage all the transformations of the data inside the data warehouse. We clean and join the apearances and the player_valuations files to get the aggregations by year of the statistics that we care about.
## π Dashboard
> [!WARNING]
> Since i was using the free credits of Google Cloud, all the information was deleted and the report is no longer available.Here is the [report](https://lookerstudio.google.com/reporting/affeeeed-5583-4da6-988a-06170c6d15cf). And here is a quick look at how it looks:
![](docs/images/looker1.gif)
![](docs/images/looker2.gif)
## π§ Usage
You can refer to this [video]() where we provide an explicit tour of the project and how it works.
If you prefer instructions, check the [README of the docs section](docs/README.md).
## π Further Improvements
- Add unit test, and integrate them into the CI/CD pipeline (besides the dbt ones).
- Add a Makefile.