Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/airscholar/footballdataengineering
An end-to-end data engineering pipeline that fetches data from Wikipedia, cleans and transforms it with Apache Airflow and saves it on Azure Data Lake. Other processing takes place on Azure Data Factory, Azure Synapse and Tableau.
https://github.com/airscholar/footballdataengineering
apache-airflow azure-data-factory azure-data-lake-gen2 azure-databricks azure-synapse-analytics data-engineering dataengineering
Last synced: about 2 months ago
JSON representation
An end-to-end data engineering pipeline that fetches data from Wikipedia, cleans and transforms it with Apache Airflow and saves it on Azure Data Lake. Other processing takes place on Azure Data Factory, Azure Synapse and Tableau.
- Host: GitHub
- URL: https://github.com/airscholar/footballdataengineering
- Owner: airscholar
- Created: 2023-10-02T19:11:25.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-10-02T19:22:44.000Z (over 1 year ago)
- Last Synced: 2024-04-18T02:57:13.289Z (9 months ago)
- Topics: apache-airflow, azure-data-factory, azure-data-lake-gen2, azure-databricks, azure-synapse-analytics, data-engineering, dataengineering
- Language: Python
- Homepage: https://www.youtube.com/watch?v=tKIXUqz17W8
- Size: 469 KB
- Stars: 9
- Watchers: 2
- Forks: 6
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Football Data Engineering
This Python-based project crawls data from Wikipedia using Apache Airflow, cleans it and pushes it Azure Data Lake for processing.
## Table of Contents
1. [System Architecture](#system-architecture)
2. [Requirements](#requirements)
3. [Getting Started](#getting-started)
4. [Running the Code With Docker](#running-the-code-with-docker)
5. [How It Works](#how-it-works)
6. [Video](#video)## System Architecture
![system_architecture.png](assets%2Fsystem_architecture.png)## Requirements
- Python 3.9 (minimum)
- Docker
- PostgreSQL
- Apache Airflow 2.6 (minimum)## Getting Started
1. Clone the repository.
```bash
git clone https://github.com/airscholar/FootballDataEngineering.git
```2. Install Python dependencies.
```bash
pip install -r requirements.txt
```
## Running the Code With Docker1. Start your services on Docker with
```bash
docker compose up -d
```
2. Trigger the DAG on the Airflow UI.## How It Works
1. Fetches data from Wikipedia.
2. Cleans the data.
3. Transforms the data.
4. Pushes the data to Azure Data Lake.## Video
[![FootballDataEngineering](https://img.youtube.com/vi/tKIXUqz17W8/0.jpg)](https://www.youtube.com/watch?v=tKIXUqz17W8)