https://github.com/jibbs1703/tickit-data-pipeline

This repository demonstrates the creation of a robust data pipeline using an Orchestrator, on-prem and cloud resources. It collects data from on-premises SQL and NoSQL database and loads it into a SQL database in the cloud.
https://github.com/jibbs1703/tickit-data-pipeline

aws-glue aws-glue-crawler aws-glue-data-catalog aws-redshift aws-s3 boto3 data-lake database etl-pipeline medallion-architecture mongodb precommit-hooks

Last synced: 17 days ago
JSON representation

Host: GitHub
URL: https://github.com/jibbs1703/tickit-data-pipeline
Owner: jibbs1703
License: mit
Created: 2024-11-04T17:24:07.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-06-06T00:47:28.000Z (4 months ago)
Last Synced: 2025-06-06T01:19:04.671Z (4 months ago)
Topics: aws-glue, aws-glue-crawler, aws-glue-data-catalog, aws-redshift, aws-s3, boto3, data-lake, database, etl-pipeline, medallion-architecture, mongodb, precommit-hooks
Language: Python
Homepage:
Size: 54.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Tickit Data Pipeline

## Overview
Welcome to the Tickit Data Lake project! The Tickit Data Lake project demonstrates the construction
of a scalable and robust data pipeline, leveraging the power of Apache Airflow for orchestration
and automation. This project provides a practical example of building a modern data pipeline capable of
handling the extraction, loading, and transformation (ELT) of batch data, specifically designed to support
the analytical needs of a business.

## Key Features and Technologies:

- Automated Orchestration: Airflow is the core orchestration engine, responsible for scheduling, monitoring,
and managing the entire data pipeline. It defines the workflow as a Directed Acyclic Graph (DAG), ensuring
dependencies between tasks are correctly handled. Airflow's robust features enable task retries, logging,
and alerting, ensuring pipeline reliability.

- Integration of Multiple Data Sources: The project seamlessly integrates with various data sources including:
1. On-premises SQL and NoSQL databases
2. Cloud-hosted SQL and NoSQL databases

## Value
This project serves as a valuable example of building a modern data pipelines using Airflow, showcasing best
practices for data ingestion, processing, and transformation. It provides a solid foundation for building a
robust data platform to support a wide range of analytical needs.

## Project Setup
- Clone the repository.
```bash
git clone https://github.com/jibbs1703/Tickit-Data-Pipeline
cd Tickit-Data-Pipeline
```
```bash
# Build Tickit Test Container
docker build -t test-tickit .
```

```bash
# Run Tickit Test Container
docker run -it --name tickit-test-container -v .:/app test-tickit
```

```bash
# Clenup Tickit Test Container
docker stop tickit-test-container
docker rm tickit-test-container
```
```bash
# Cleanup all containers
docker rm $(docker ps -a -q)
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jibbs1703/tickit-data-pipeline

Awesome Lists containing this project

README