https://github.com/mtholahan/guided-capstone-project

Build an end-to-end pipeline for high-frequency equity market data. Designed database schemas, ingested daily trade and quote records from CSV/JSON into Spark, implemented EOD batch loads with deduplication, and engineered ETL jobs to calculate trade indicators, moving averages, and bid/ask movements for market analysis.
https://github.com/mtholahan/guided-capstone-project

azure big-data bootcamp csv data-engineering data-pipeline etl finance json parquet pyspark spark springboard stock-market

Last synced: 6 months ago
JSON representation

Host: GitHub
URL: https://github.com/mtholahan/guided-capstone-project
Owner: mtholahan
License: other
Created: 2025-07-23T16:14:56.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-09-15T04:58:06.000Z (7 months ago)
Last Synced: 2025-09-15T05:38:12.794Z (7 months ago)
Topics: azure, big-data, bootcamp, csv, data-engineering, data-pipeline, etl, finance, json, parquet, pyspark, spark, springboard, stock-market
Homepage:
Size: 440 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Guided Capstone Project

## 📖 Abstract
This guided capstone focuses on designing and implementing an end-to-end data pipeline to process and analyze high-frequency equity market data. The simulated client, Spring Capital, is an investment bank that depends on real-time analytics for trade and quote data across multiple exchanges. The engineering goal is to create a scalable platform that ingests raw trade and quote records, applies daily ETL processes, and generates key financial indicators to support decision-making. The pipeline design spans multiple stages: * Schema design: normalized trade and quote tables with composite keys for efficient querying. * Data ingestion: parsing semi-structured daily exchange files (CSV and JSON) to extract valid records and discard malformed ones. * Batch load: an end-of-day process that consolidates daily submissions, resolves late-arriving corrections, and ensures only the most current records persist. * Analytical ETL: deriving business-critical metrics, including latest trade price, rolling 30-minute average, and bid/ask price movements relative to prior day close. * Orchestration: scheduling jobs with retry logic and status tracking to guarantee operational reliability. By the end of the project, the platform demonstrates scalable, fault-tolerant data engineering practices, combining database design, PySpark data ingestion, and workflow orchestration. This project bridges foundational design skills with applied big data engineering in a realistic financial services context.

*Generated automatically via Python + Jinja2 + SQL Server table `tblMiniProjectProgress` on 09-15-2025 18:04:08*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mtholahan/guided-capstone-project

Awesome Lists containing this project

README