https://github.com/hawa1222/data-stream-etl
Python-MySQL ETL pipeline to centralise personal data from sources like YouTube and Apple into a structured database, enabling advanced data analysis and application development.
https://github.com/hawa1222/data-stream-etl
api-integration etl html-parser mysql pipeline python xml-parser
Last synced: 19 days ago
JSON representation
Python-MySQL ETL pipeline to centralise personal data from sources like YouTube and Apple into a structured database, enabling advanced data analysis and application development.
- Host: GitHub
- URL: https://github.com/hawa1222/data-stream-etl
- Owner: hawa1222
- Created: 2024-03-04T21:06:59.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-06T11:24:23.000Z (8 months ago)
- Last Synced: 2024-12-26T15:44:32.532Z (5 months ago)
- Topics: api-integration, etl, html-parser, mysql, pipeline, python, xml-parser
- Language: Python
- Homepage:
- Size: 1.39 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Stream ETL
This repository contains a Python based ETL pipeline to centralise data from multiple sources into a structured database, enabling data analysis and application development. The project includes use of Redis for caching, AWS S3 and MySQL for data storage and Airflow for scheduling and monitoring.
## Project Outline
- Establish a comprehensive personal data repository, laying foundation for developing personalised applications and dashboards with ability to query historical data.
- Utilise a range of technologies such as Python, MySQL, API integration, HTML scraping, Redis, S3, and Airflow.
- Implement diverse data extraction techniques such as API calls, HTML scraping, and manual exports.
- Design physical data models to store data in a structured format.
- Integrate data from various formats including CSV, JSON, XML into distinct MySQL tables.
- Develop a modular pipeline for scalable data integration and transformation.### Data Sources
- **Apple Health**: Large XML file covering walking metrics, daily activity, blood glucose levels, heart rate, fitness metrics, heart rate metrics, running metrics, sleep analysis, steps, and more.
- **Strava**: JSON data covering performance metrics, sport types, equipment information, and activity details.
- **YouTube**: JSON data covering YouTube likes/dislikes, subscriptions, etc.
- **Daylio**: CSV data covering mood tracking, activities, and notes.
- **Spend**: CSV data covering 6+ years of financial transactions.### Physical Data Model

## Project Architecture
- **Extractors**: Modules to retrieve data from various sources, including Redis for caching and S3 as a data lake.
- **Transformers**: Modules to clean, standardise, and manipulate data format.
- **Loader**: Modules responsible for inserting data into MySQL database.
- **Utility**: Modules containing helper functions and classes for database interactions (`DatabaseHandler`), file management (`FileManager`), logging (`setup_logging`), and more.
- **Validation**: Script for post-load data validation to ensure data integrity and consistency.### Data Flow Diagram

## Requirements
- Python (version 3.12.3)
- MySQL (version 8.3.0)
- Redis (version 7.2.5)
- AWS S3
- Airflow (version 2.9.2)## Setup Instructions
Clone Repository:
```bash
git clone https://github.com/hawa1222/data-stream-etl.git
cd data-stream-etl
```Set up Python environment:
``` bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```Set up MySQL:
- Install MySQL server and create a new database.
- Update database connection details in `config.py` file.Set up Redis:
- Install Redis server and start service.
- Update Redis connection details in `config.py` file.Set up AWS S3:
- Create an AWS account and set up an S3 bucket.
- Update AWS S3 connection details in `config.py` file.Set up Airflow:
- Installed in requirements.txt or install manually:
``` bash
pip install apache-airflow
```- Initialise Airflow database:
``` bash
airflow db migrate
```- Create an Airflow user:
``` bash
airflow users create \
--username admin \
--firstname admin \
--lastname admin \
--role Admin \
```- Start Airflow web server:
``` bash
airflow webserver --port 8080
```- Start Airflow scheduler:
``` bash
airflow scheduler
```- Access Airflow web interface at `http://localhost:8080`.
Configure environment variables:
- Copy .env_template to .env
- Fill in all required variables in .env## Usage
To execute ETL process manually, run following command in your terminal:
```bash
python main.py
```