Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/manuelandersen/football-pipeline

DE Zoomcamp 2024 Final Project πŸ§™
https://github.com/manuelandersen/football-pipeline

bigquery data-engineering data-lake data-warehouse dbt dbt-cloud etl-pipeline google-cloud looker-studio mageai python

Last synced: about 1 month ago
JSON representation

DE Zoomcamp 2024 Final Project πŸ§™

Awesome Lists containing this project

README

        

# ⚽ Football Transfermarkt Data Pipeline

![](docs/images/transfermarkt-logo.jpeg)

## πŸ“‹ Table of contents
- [Introduction](#πŸ“–-introduction)
- [Problem description](#❓-problem-description)
- [Architecture Overview](#πŸ—οΈ-architecture-overview)
- [Data source](#πŸ“₯-data-source)
- [Data Ingestion](#πŸ“₯-data-ingestion)
- [Data Storage](#πŸ’Ύ-data-storage)
- [Transformations](#πŸ”„-transformations)
- [Dashboard](#πŸ“Š-dashboard)
- [Usage](#πŸ”§-usage)
- [Further Improvements](#πŸš€-further-improvements)

## πŸ—οΈ Architecture

![](docs/images/giphy.gif)

## πŸ“– Introduction

This repository represents my final project for the [Data Engineer Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp). It aims to analyze a large dataset of football data scraped from the page [transfermarkt.es](https://www.transfermarkt.es/).

## ❓ Problem Description

The project aims to answer several questions regarding players and their statistics. The focus is on market value, goals, and assists that players have accumulated over the years of their football careers.

## πŸ—οΈ Architecture Overview

We use πŸ§™[Mage](https://www.mage.ai/) as an orchestrator for the whole pipeline, mounted inside a 🐳Docker container.

### πŸ“₯ Data Source

Data is inside a Kaggle [dataset](https://www.kaggle.com/datasets/davidcariboo/player-scores), and consists in 9 csv files:

- apearances.csv (124.22 MB)
- club_games.csv (8.45 MB)
- clubs.csv (96.06 MB)
- competitions.csv (7.47 MB)
- game_events.csv (75.46 MB)
- game_lineups.csv (244.38 MB)
- games.csv (19.88 MB)
- player_valuations.csv (15.8 MB)
- players.csv (10.33 MB)

### πŸ“₯ Data Ingestion

- Batch: for the data ingestion we use the [Kaggle API](https://github.com/Kaggle/kaggle-api) to extract all 9 CSV files into a πŸ§™Mage block to prepare it for the data lake.
- Batch: for the data warehouse, we use πŸ§™Mage to download the data from the Google Cloud Storage bucket and prepare it for loading.

### πŸ’Ύ Data Storage

Both the data lake and the data warehouse are managed by Terraform as a way to learn IaC.

- Google cloud storage: a Google Cloud Storage bucket was created using Terraform. Data is store as Parquet files to reduce memory consumption. All datasets were stored as whole file, except for game_lineups, wich was partitioned by year and month.
- Big query: a BigQuery database was created using terraform.

### πŸ”„ Transformations

We use [dbt cloud](https://www.getdbt.com/product/dbt-cloud) to manage all the transformations of the data inside the data warehouse. We clean and join the apearances and the player_valuations files to get the aggregations by year of the statistics that we care about.

## πŸ“Š Dashboard

> [!WARNING]
> Since i was using the free credits of Google Cloud, all the information was deleted and the report is no longer available.

Here is the [report](https://lookerstudio.google.com/reporting/affeeeed-5583-4da6-988a-06170c6d15cf). And here is a quick look at how it looks:

![](docs/images/looker1.gif)

![](docs/images/looker2.gif)

## πŸ”§ Usage

You can refer to this [video]() where we provide an explicit tour of the project and how it works.

If you prefer instructions, check the [README of the docs section](docs/README.md).

## πŸš€ Further Improvements

- Add unit test, and integrate them into the CI/CD pipeline (besides the dbt ones).
- Add a Makefile.