{"id":50526281,"url":"https://github.com/moto123a/enterprise-rail-freight-data-platform","last_synced_at":"2026-06-03T08:04:19.527Z","repository":{"id":339461145,"uuid":"1162021369","full_name":"moto123a/enterprise-rail-freight-data-platform","owner":"moto123a","description":"Enterprise-style real-time rail freight data platform using Kafka, Spark Structured Streaming, Airflow Bronze/Silver/Gold, Trino SQL KPIs, and Redshift star schema marts.","archived":false,"fork":false,"pushed_at":"2026-02-19T19:58:45.000Z","size":47,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-19T22:25:05.180Z","etag":null,"topics":["airflow","data-engineering","delta-lake","etl","iceberg","kafka","lakehouse","python","redshift","spark","sql","star-schema","streaming","structured-streaming","trino"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/moto123a.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":"governance/certified_datasets.md","roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-19T19:32:57.000Z","updated_at":"2026-02-19T19:58:48.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/moto123a/enterprise-rail-freight-data-platform","commit_stats":null,"previous_names":["moto123a/enterprise-rail-freight-data-platform"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/moto123a/enterprise-rail-freight-data-platform","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moto123a%2Fenterprise-rail-freight-data-platform","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moto123a%2Fenterprise-rail-freight-data-platform/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moto123a%2Fenterprise-rail-freight-data-platform/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moto123a%2Fenterprise-rail-freight-data-platform/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/moto123a","download_url":"https://codeload.github.com/moto123a/enterprise-rail-freight-data-platform/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/moto123a%2Fenterprise-rail-freight-data-platform/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33854130,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-03T02:00:06.370Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","data-engineering","delta-lake","etl","iceberg","kafka","lakehouse","python","redshift","spark","sql","star-schema","streaming","structured-streaming","trino"],"created_at":"2026-06-03T08:04:18.124Z","updated_at":"2026-06-03T08:04:19.521Z","avatar_url":"https://github.com/moto123a.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🚆 Real-Time Rail Logistics Shipment Tracking Pipeline (Enterprise Data Platform)\n\n## Overview\nThis repository demonstrates an enterprise-style **real-time data engineering platform** for rail freight shipment tracking. It ingests streaming shipment + telemetry events, applies **streaming ETL**, publishes **certified datasets (Gold)**, and serves analytics through **Trino SQL** and a **Redshift star schema**.\n\nThis project is intentionally structured to mirror real-world data platform patterns:\n- **Bronze / Silver / Gold** lakehouse layers  \n- **Certified datasets** + dataset contracts  \n- **Streaming + batch compatibility**  \n- **Warehouse marts (star schema)** for BI  \n\n---\n\n## Architecture\n**Kafka (freight_events, telemetry_events)**  \n→ **Spark Structured Streaming** (parse, standardize, validate)  \n→ **Lakehouse layers (Bronze/Silver/Gold)**  \n→ **Trino** (KPI analytics queries)  \n→ **Redshift** (dimensional marts / star schema)  \n→ **Airflow** (Bronze→Silver→Gold→Warehouse orchestration)\n\n---\n\n## Tech Stack\n- **Streaming:** Kafka, Spark Structured Streaming (Spark)\n- **Orchestration:** Apache Airflow (Bronze/Silver/Gold)\n- **Lakehouse Concepts:** Delta Lake, Apache Iceberg (architecture/governance)\n- **Query Engine:** Trino (KPI queries)\n- **Warehouse:** AWS Redshift (star schema + marts)\n- **Languages:** Python, SQL\n- **Geospatial Concepts:** Haversine distance + route classification (basic)\n\n---\n\n## Repo Structure\n\nkafka-ingestion/ # Producers for freight + telemetry streams\nspark-stream-processing/ # Spark streaming consumer / transformations\nairflow-orchestration/ # Airflow DAG for Bronze→Silver→Gold→Warehouse\nwarehouse/ # Star schema (dims/facts) + BI mart view\ntrino-query/ # KPI queries (SQL) for analytics\ndocs/ # Architecture + dataset contracts documentation\ngovernance/ # Certified datasets definitions\ngeospatial/ # Route logic + anomaly helper functions\n\n\n---\n\n## What This Platform Produces (Gold / Certified)\n- **gold_shipment_lifecycle:** shipment events standardized for reporting\n- **gold_hub_dwell_time:** dwell-time KPI outputs by hub/terminal (concept)\n- **gold_on_time_performance:** on-time / delivered-ratio KPIs by corridor (concept)\n\n---\n\n## Data Governance\n- **Dataset contracts:** required fields + validation rules  \n  See: `docs/dataset_contracts.md`\n- **Certified datasets definitions:** Gold layer expectations  \n  See: `governance/certified_datasets.md`\n\n---\n\n## KPI Queries (Trino)\nSample queries included:\n- Shipment volume by corridor  \n- Delivered ratio / on-time proxy by corridor  \n- Delay hotspots by hub  \n- Telemetry anomaly scan (speed/status/temp rules)  \n\nSee: `trino-query/analytics_queries.sql`\n\n---\n\n## Redshift Dimensional Model (Star Schema)\n- `dim_date`, `dim_location`, `dim_status`\n- `fact_shipment_events`\n- `mart_on_time_summary` view\n\nSee: `warehouse/star_schema.sql`\n\n---\n\n## Why This Looks “Enterprise” (Not a Toy)\n- Clean separation of ingestion, processing, orchestration, governance, and marts  \n- Bronze/Silver/Gold layering + certified datasets  \n- Streaming + batch compatible modeling  \n- Query + warehouse patterns used in production data platforms  \n\n---\n\n## Author\n**Pavan Krishna**  \nSoftware Engineer | Data \u0026 Streaming Systems  \npavankrishna310@gmail.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoto123a%2Fenterprise-rail-freight-data-platform","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmoto123a%2Fenterprise-rail-freight-data-platform","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmoto123a%2Fenterprise-rail-freight-data-platform/lists"}