https://github.com/mtholahan/guided-capstone-project
Build an end-to-end pipeline for high-frequency equity market data. Designed database schemas, ingested daily trade and quote records from CSV/JSON into Spark, implemented EOD batch loads with deduplication, and engineered ETL jobs to calculate trade indicators, moving averages, and bid/ask movements for market analysis.
https://github.com/mtholahan/guided-capstone-project
azure big-data bootcamp csv data-engineering data-pipeline etl finance json parquet pyspark spark springboard stock-market
Last synced: 6 months ago
JSON representation
Build an end-to-end pipeline for high-frequency equity market data. Designed database schemas, ingested daily trade and quote records from CSV/JSON into Spark, implemented EOD batch loads with deduplication, and engineered ETL jobs to calculate trade indicators, moving averages, and bid/ask movements for market analysis.
- Host: GitHub
- URL: https://github.com/mtholahan/guided-capstone-project
- Owner: mtholahan
- License: other
- Created: 2025-07-23T16:14:56.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-09-15T04:58:06.000Z (7 months ago)
- Last Synced: 2025-09-15T05:38:12.794Z (7 months ago)
- Topics: azure, big-data, bootcamp, csv, data-engineering, data-pipeline, etl, finance, json, parquet, pyspark, spark, springboard, stock-market
- Homepage:
- Size: 440 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Guided Capstone Project
## 📖 Abstract
This guided capstone focuses on designing and implementing an end-to-end data pipeline to process and analyze high-frequency equity market data. The simulated client, Spring Capital, is an investment bank that depends on real-time analytics for trade and quote data across multiple exchanges. The engineering goal is to create a scalable platform that ingests raw trade and quote records, applies daily ETL processes, and generates key financial indicators to support decision-making. The pipeline design spans multiple stages: * Schema design: normalized trade and quote tables with composite keys for efficient querying. * Data ingestion: parsing semi-structured daily exchange files (CSV and JSON) to extract valid records and discard malformed ones. * Batch load: an end-of-day process that consolidates daily submissions, resolves late-arriving corrections, and ensures only the most current records persist. * Analytical ETL: deriving business-critical metrics, including latest trade price, rolling 30-minute average, and bid/ask price movements relative to prior day close. * Orchestration: scheduling jobs with retry logic and status tracking to guarantee operational reliability. By the end of the project, the platform demonstrates scalable, fault-tolerant data engineering practices, combining database design, PySpark data ingestion, and workflow orchestration. This project bridges foundational design skills with applied big data engineering in a realistic financial services context.
*Generated automatically via Python + Jinja2 + SQL Server table `tblMiniProjectProgress` on 09-15-2025 18:04:08*