https://github.com/istinnew/etl-pipeline-ganz-project
End-to-end ETL pipeline project for collecting, transforming, and loading data into a cloud-based database using Python, MySQL, and Google Cloud Analytics
https://github.com/istinnew/etl-pipeline-ganz-project
cloud cloud-engineering cloud-services data data-science dataanalytics database database-schema googlecloud mysql mysql-database python python-lambda
Last synced: 7 months ago
JSON representation
End-to-end ETL pipeline project for collecting, transforming, and loading data into a cloud-based database using Python, MySQL, and Google Cloud Analytics
- Host: GitHub
- URL: https://github.com/istinnew/etl-pipeline-ganz-project
- Owner: IstinNew
- License: mit
- Created: 2024-11-26T09:35:48.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-12-04T11:20:39.000Z (10 months ago)
- Last Synced: 2025-02-02T02:28:15.656Z (8 months ago)
- Topics: cloud, cloud-engineering, cloud-services, data, data-science, dataanalytics, database, database-schema, googlecloud, mysql, mysql-database, python, python-lambda
- Language: Jupyter Notebook
- Homepage:
- Size: 305 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ETL Pipeline Ganz Project
## Project Overview
The end-to-end EXTRACT -TRANSFER - LOAD (ETL) Pipeline Ganz project is designed to showcase an end-to-end data engineering pipeline. The project involves collecting data through web scraping or APIs, storing the data in a MySQL database, moving the pipeline to the cloud using Google Cloud Analytics, and automating the entire data collection and storage process.## Project Mind Map
- Graphical Representation
## How I Built It
The project is built using Python for data processing, MySQL for data storage, and Google Cloud Analytics for cloud deployment. The key steps include:
- **Data Collection**: Extracting data from various sources using web scraping techniques and APIs.
- **Data Storage**: Storing the collected data in a MySQL database for structured storage and querying.
- **Pipeline to the Cloud**: Deploying the ETL pipeline to Google Cloud Analytics for scalability and reliability.
- **Pipeline Automation**: Automating the data collection and storage process using scheduling tools.## Project Structure
- `scripts/`: Python scripts for data extraction, transformation, and loading.
- `queries/`: SQL run scripts for storage and querying. _(not complete at this stage)_
- `data/`: Directory for data files, charts, presentations etc.
- `config/`: Configuration files for database and settings.## Skills Demonstrated
- **Python Programming**: Efficient coding practices for data manipulation and pipeline management.
- **Data Cleaning**: Handling missing data, outliers, and inconsistencies in the data.
- **Data Warehousing**: Using MySQL for data storage and Google Cloud Analytics for cloud deployment.
- **API Integration**: Extracting data from various APIs.
- **Web Scraping**: Using BeautifulSoup and Requests for data extraction from websites.
- **Cloud Deployment**: Deploying and managing ETL pipelines on Google Cloud Analytics.## ETL Process Summary
1. **Data Collection**: Data is collected from multiple sources using web scraping and APIs.
2. **Data Cleaning**: Data is cleaned and preprocessed to ensure quality.
3. **Data Transformation**: Data is transformed into a consistent format suitable for analysis.
4. **Data Storage**: The transformed data is stored in a MySQL database.
5. **Pipeline to the Cloud**: The ETL pipeline is deployed to Google Cloud Analytics for better scalability and reliability.## Challenges Overcome
- **Data Quality Issues**: Addressed missing values and outliers to ensure high-quality data.
- **Integration of Multiple Data Sources**: Efficiently managed data from diverse sources with varying formats.
- **Scalability**: Ensured the ETL pipeline can handle large datasets and scale as needed.## Accomplishments
- **Diverse Data Sources**: Successfully integrated and processed data from multiple sources.
- **Efficient ETL Pipeline**: Built a robust pipeline that ensures timely and accurate data processing.
- **Cloud Deployment**: Successfully deployed the ETL pipeline to Google Cloud Analytics for enhanced performance and scalability.
- **Automation**: Automated the data collection and storage process, reducing manual intervention.## Getting Started
Refer to the [INSTRUCTIONS](https://github.com/IstinNew/ETL-Pipeline-Ganz-Project/blob/main/INSTRUCTIONS.md) file for detailed steps on setting up and running the ETL pipeline.Happy data processing! 😊📊✨