https://github.com/hamamiasma/python-data-engineering-projekt-
Extract, Transform, Load (ETL) Pipeline using Python to process CSV and JSON and XML files into unified structured data.
https://github.com/hamamiasma/python-data-engineering-projekt-
batch-processing csv data-engineering data-pipeline etl-pipeline json pandas python xml
Last synced: 11 months ago
JSON representation
Extract, Transform, Load (ETL) Pipeline using Python to process CSV and JSON and XML files into unified structured data.
- Host: GitHub
- URL: https://github.com/hamamiasma/python-data-engineering-projekt-
- Owner: hamamiasma
- Created: 2025-04-28T09:57:36.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-04-28T10:13:59.000Z (11 months ago)
- Last Synced: 2025-04-28T11:31:06.330Z (11 months ago)
- Topics: batch-processing, csv, data-engineering, data-pipeline, etl-pipeline, json, pandas, python, xml
- Homepage:
- Size: 1.95 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# About This Project
Welcome to the **ETL Data Engineering Project**!
This project demonstrates how to build a simple ETL (Extract, Transform, Load) pipeline using **Python** and **Pandas**.
The pipeline collects data from multiple file formats (CSV and JSON and XML), transforms it into a unified structure, and saves it into a new CSV file.
## Main Features
- 📥 Extract data from multiple CSV and JSON files automatically.
- 🔄 Transform data: Convert units (height from inches to meters, weight from pounds to kilograms).
- 💾 Load the cleaned and transformed data into a single target CSV file.
- 📝 Log every ETL step into a logfile with timestamps.
## Technologies Used
- Python 3
- Pandas
- Glob
- Logging
- CSV and JSON file handling
## Target Audience
- Data Engineering beginners
- Python developers
- Students interested in ETL pipelines
- Anyone curious about batch processing
## 📂 Project Structure
## Einleitung
> **Have you ever wondered how data was collected from multiple sources and combined to become a single source of information?**
>
> This type of data collection is called **Batch processing**, and today we will be exploring a type of batch processing called **Extract, Transform and Load (ETL)**.
>
> **ETL** does exactly what the name implies. It is the process of extracting large amounts of data from multiple sources and formats and transforming it into one specific format before loading it into a database or target file.
>
> **Example:**
> Imagine you are the owner of a start-up that built an AI model to predict if someone is at risk for diabetes based on their height and body weight.
> Some of your data is in **CSV** format, while the other data is in **JSON** and **XML** files.

## ETL Prozess Überblick
- **Extract**: Daten werden aus CSV- und XML- und JSON-Dateien extrahiert.
- **Transform**: Daten werden konvertiert (z.B. Höhe von Zoll auf Meter).
- **Load**: Die verarbeiteten Daten werden in eine neue CSV-Datei gespeichert.
## Ausführen des Projekts
```bash
python etl.py
```