https://github.com/jibbs1703/tickit-data-pipeline
This repository demonstrates the creation of a robust data pipeline using an Orchestrator, on-prem and cloud resources. It collects data from on-premises SQL and NoSQL database and loads it into a SQL database in the cloud.
https://github.com/jibbs1703/tickit-data-pipeline
aws-glue aws-glue-crawler aws-glue-data-catalog aws-redshift aws-s3 boto3 data-lake database etl-pipeline medallion-architecture mongodb precommit-hooks
Last synced: 17 days ago
JSON representation
This repository demonstrates the creation of a robust data pipeline using an Orchestrator, on-prem and cloud resources. It collects data from on-premises SQL and NoSQL database and loads it into a SQL database in the cloud.
- Host: GitHub
- URL: https://github.com/jibbs1703/tickit-data-pipeline
- Owner: jibbs1703
- License: mit
- Created: 2024-11-04T17:24:07.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-06-06T00:47:28.000Z (4 months ago)
- Last Synced: 2025-06-06T01:19:04.671Z (4 months ago)
- Topics: aws-glue, aws-glue-crawler, aws-glue-data-catalog, aws-redshift, aws-s3, boto3, data-lake, database, etl-pipeline, medallion-architecture, mongodb, precommit-hooks
- Language: Python
- Homepage:
- Size: 54.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Tickit Data Pipeline
## Overview
Welcome to the Tickit Data Lake project! The Tickit Data Lake project demonstrates the construction
of a scalable and robust data pipeline, leveraging the power of Apache Airflow for orchestration
and automation. This project provides a practical example of building a modern data pipeline capable of
handling the extraction, loading, and transformation (ELT) of batch data, specifically designed to support
the analytical needs of a business.## Key Features and Technologies:
- Automated Orchestration: Airflow is the core orchestration engine, responsible for scheduling, monitoring,
and managing the entire data pipeline. It defines the workflow as a Directed Acyclic Graph (DAG), ensuring
dependencies between tasks are correctly handled. Airflow's robust features enable task retries, logging,
and alerting, ensuring pipeline reliability.- Integration of Multiple Data Sources: The project seamlessly integrates with various data sources including:
1. On-premises SQL and NoSQL databases
2. Cloud-hosted SQL and NoSQL databases## Value
This project serves as a valuable example of building a modern data pipelines using Airflow, showcasing best
practices for data ingestion, processing, and transformation. It provides a solid foundation for building a
robust data platform to support a wide range of analytical needs.## Project Setup
- Clone the repository.
```bash
git clone https://github.com/jibbs1703/Tickit-Data-Pipeline
cd Tickit-Data-Pipeline
```
```bash
# Build Tickit Test Container
docker build -t test-tickit .
``````bash
# Run Tickit Test Container
docker run -it --name tickit-test-container -v .:/app test-tickit
``````bash
# Clenup Tickit Test Container
docker stop tickit-test-container
docker rm tickit-test-container
```
```bash
# Cleanup all containers
docker rm $(docker ps -a -q)
```