https://github.com/1adityakadam/uber_data_analytics

End to end Google Bigquery + Looker Studio Data Analytics Project Transforming NYC Taxi Data into Actionable Intelligence
https://github.com/1adityakadam/uber_data_analytics

bigquery looker-studio mage-ai-pipeline numpy pandas sql

Last synced: 3 months ago
JSON representation

End to end Google Bigquery + Looker Studio Data Analytics Project Transforming NYC Taxi Data into Actionable Intelligence

Host: GitHub
URL: https://github.com/1adityakadam/uber_data_analytics
Owner: 1adityakadam
Created: 2025-02-25T06:39:58.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-25T23:46:31.000Z (over 1 year ago)
Last Synced: 2025-03-26T00:35:30.502Z (over 1 year ago)
Topics: bigquery, looker-studio, mage-ai-pipeline, numpy, pandas, sql
Language: Jupyter Notebook
Homepage:
Size: 56.4 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

![image](https://github.com/user-attachments/assets/e2beafed-5bd9-42a4-be79-7a65230d4a8e)

# 🚖 NYC Yellow Taxi Urban Mobility Intelligence Platform

> *A business-driven, end-to-end cloud analytics pipeline transforming 5.5M raw trip records into actionable revenue, operational, and strategic insights - built on Google Cloud Platform.*

---

## 🧩 Business Problem

New York City's taxi and ride-share ecosystem processes **millions of trips per month**, yet most operators, city planners, and mobility platforms still make pricing, staffing, and routing decisions based on intuition rather than data.

**The core problem:** Raw TLC trip data exists - but it lives in unstructured parquet files, disconnected from business logic, unqueried, and unvisualized. No stakeholder can answer in real time:

- Where are we losing revenue to inefficient routing?
- Which zones are chronically underserved during peak hours?
- What share of surcharge and fee revenue are we actually capturing?
- How does weekday vs. weekend demand shift our fleet allocation strategy?

Without answers to these questions, **revenue is left on the table, costs go uncontrolled, and operational decisions are made blind.**

---

## 💡 Why This Matters

| Stakeholder | What They Need | What This Solves |
|---|---|---|
| Fleet Operations | Peak demand forecasting | Day-of-week ridership trends |
| Revenue Management | Fare & surcharge optimization | Zone-level revenue breakdown |
| City Planning / TLC | Mobility equity analysis | Pickup/dropoff zone distribution |
| Product & Strategy | Competitive vendor benchmarking | Vendor market share analysis |
| Finance | Cost attribution | Congestion surcharge contribution |

Taxi and ride-share is a **$3B+ annual market in NYC alone**. A 1% improvement in routing efficiency or dynamic pricing decisions at scale can represent **tens of millions in recovered revenue or cost savings.**

---

## 🎯 Objective

Design and deploy a **production-grade, cloud-native analytics pipeline** that:

1. Ingests raw Yellow Taxi trip data (November 2024, ~5.5M records)
2. Transforms it into a clean, query-optimized dimensional data model
3. Surfaces business KPIs through an interactive Looker Studio dashboard
4. Enables stakeholders to answer strategic questions **in seconds, not days**

---

## 📊 Business Impact

> *Modeled on realistic NYC taxi market assumptions. Figures represent estimated opportunity based on observed data patterns.*

| Impact Area | Finding | Estimated Business Value |
|---|---|---|
| **Revenue Visibility** | $203.15M total fares tracked across 5.5M trips | Full revenue attribution previously unavailable |
| **Peak Demand Capture** | Thu/Fri demand spikes identified | Fleet reallocation could recover ~2–5% of missed trips |
| **Airport Revenue Concentration** | LGA + JFK = top pickup zones | Targeted surge pricing opportunity: ~$1.2M–$2M/mo |
| **Vendor Concentration Risk** | CMT holds 84.8% share | Diversification risk flag for procurement strategy |
| **Congestion Surcharge Leakage** | Avg. $1.43/trip collected | At 5.5M trips: $7.8M/mo; tracking gaps could indicate $500K+ in missed collection |
| **Operational Efficiency** | Avg trip: 14.1 miles, $61.76 fare | Benchmarks routing efficiency vs. optimal fleet deployment |

---

## 🔍 Methodology & Decision Process

### Why a Dimensional Model (Star Schema)?

I chose a **star schema** over a flat denormalized table for three reasons:
- **Query performance at scale:** BigQuery's columnar engine optimizes joins between a central fact table and narrow dimension tables.
- **Business flexibility:** Adding new dimensions (e.g., weather, events) doesn't require restructuring existing tables.
- **BI tool compatibility:** Looker Studio and most BI tools natively understand fact/dimension relationships.

**Alternative considered:** A single flat table would have been simpler to build but would degrade query performance at 50M+ row scale and reduce flexibility for ad hoc analysis.

### Metric Definition Choices

- **"Total Revenue"** was defined as `total_amount` (excludes cash tips per TLC data dictionary). This was a deliberate choice - cash tip inclusion would require driver-reported fields with known accuracy issues.
- **"Travellers"** was mapped to `passenger_count`, not trip count, to better represent actual ridership volume.
- **Congestion Surcharge** was tracked separately from fare revenue to allow policy impact analysis (NYS MTA congestion pricing changes directly affect this metric).

### Assumptions Made

- November 2024 data treated as representative of Q4 seasonal patterns
- Vendor classification based on TLC's VendorID field (self-reported by TPEP providers)
- Zone analysis uses TLC Taxi Zone boundaries (not raw GPS coordinates)

---

## 🏗️ Tech Stack

**Cloud Infrastructure**
- Google Cloud Storage - raw data lake (parquet ingestion)
- Google Cloud VM - compute for ETL orchestration
- Google BigQuery - scalable analytical query engine
- Google Looker Studio - business intelligence & dashboard layer

**ETL & Orchestration**
- Mage AI - pipeline orchestration (extraction → transformation → load)

**Data Modeling**
- Star schema design (1 fact table + 4 dimension tables)
- ERD designed in Lucidchart

**Analytics**
- BigQuery SQL - aggregations, window functions, zone-level joins
- KPI development: revenue per trip, zone concentration index, vendor share

**AI Tools**
- ChatGPT / Claude - see AI Usage section below

---

## 🤖 How AI Was Used

Transparency matters. Here's exactly where AI accelerated this project - and where human judgment drove the decisions:

| Task | AI Role | My Role |
|---|---|---|
| SQL query optimization | Suggested indexing patterns and JOIN ordering for BigQuery | Validated query plans, adjusted for actual schema |
| Data dictionary interpretation | Summarized TLC field descriptions quickly | Decided which fields to include in the model and why |
| ERD layout suggestions | Proposed initial table structure | Redesigned based on BI tool requirements and query patterns |
| README drafting | Generated initial structure | Rewrote framing, business context, and all quantified impact |
| Debugging Mage AI pipeline | Suggested error resolution patterns | Diagnosed root cause, implemented and tested fixes |

**Bottom line:** AI compressed the research-and-drafting phases by ~40%. Every architectural decision, business framing, and analytical conclusion was human-led and validated.

---

## 🔄 How This Framework Applies Elsewhere

This pipeline is not NYC-taxi-specific. The same architecture applies directly to:

| Industry | Data Source | Business Question |
|---|---|---|
| Ride-share (Lyft/Uber) | Trip logs | Dynamic pricing optimization by zone/hour |
| Logistics (FedEx/UPS) | Delivery records | Route efficiency & SLA risk by region |
| Retail | POS transaction data | Store traffic patterns & revenue per zone |
| Healthcare | Patient encounter records | Facility utilization & care gap analysis |
| Airlines | Flight + passenger data | Revenue per seat-mile, load factor optimization |

The **star schema + cloud pipeline + BI dashboard** pattern is a transferable playbook for any organization sitting on raw transactional data without analytical infrastructure.

---

## 📋 Step-by-Step Reproduction Guide

**1. Data Acquisition**
- Download `yellow_tripdata_2024-11.parquet` from the [NYC TLC trip record data portal](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
- Upload to a Google Cloud Storage bucket

**2. Environment Setup**
- Provision a GCP VM (e2-standard-2 or higher recommended)
- Install Mage AI: `pip install mage-ai`
- Configure GCP service account with BigQuery + GCS permissions

**3. ETL Pipeline (Mage AI)**
- Create a new Mage pipeline with three blocks:
- *Data loader:* Read parquet from GCS
- *Transformer:* Apply cleaning logic (null removal, type casting, date parsing)
- *Data exporter:* Write to BigQuery staging table

**4. Data Modeling (BigQuery)**
- Run dimensional model SQL to split staging table into:
- `fact_trips` - core trip metrics
- `dim_vendor` - vendor lookup
- `dim_datetime` - date/time attributes
- `dim_zone` - pickup/dropoff zone mapping (join with TLC zone lookup CSV)
- `dim_ratecode` - rate type lookup

**5. Analytics Layer**
- Run aggregation queries for: total revenue, avg fare, trip volume by day, zone rankings, vendor share
- Full SQL scripts available in `/sql/` directory of this repo

**6. Dashboard (Looker Studio)**
- Connect Looker Studio to BigQuery dataset
- Build KPI scorecards, trend lines, treemaps, and zone tables
- [View live dashboard →](https://lookerstudio.google.com/reporting/432a79b8-781d-4081-94b3-33cfdb9444cb)

---

## 📈 Key Results

- **5.5M** total travellers | **$203.15M** total revenue | **$61.76** avg fare
- **14.1 miles** avg trip distance | **$1.43** avg congestion surcharge
- Peak demand: **Thursdays & Fridays**
- Top pickups: **LaGuardia Airport, JFK Airport, Times Square**
- Vendor split: **CMT 84.8% / VeriFone 15.2%**
- **65.2%** of pickups in Yellow Zone

---

## 💭 Key Learnings

**What I'd improve:**
- Add real-time streaming ingestion via Pub/Sub to move from batch → near-real-time analytics
- Incorporate weather and event data as external dimensions to explain demand anomalies
- Build a revenue forecasting model on top of the cleaned dimensional data

**What surprised me:**
- The degree of airport concentration in pickup patterns - LGA/JFK together represent a disproportionate share of total revenue, suggesting pricing strategy should be heavily airport-weighted
- CMT's 84.8% vendor dominance; this level of concentration likely reflects hardware/contract lock-in, not organic market share

**Technical challenge:**
- Mage AI's GCS connector required manual service account scope configuration not documented in standard guides - resolved by examining IAM audit logs

---

## 🧠 Context

This is a **portfolio project** built to demonstrate end-to-end data engineering and business analytics capability using publicly available TLC data. All business impact figures are modeled estimates based on observed data patterns and reasonable market assumptions - not actual company financials.

**Role simulated:** Senior Data Analyst / Analytics Engineer
**Stakeholders simulated:** Fleet Operations, Revenue Management, City Planning
**Constraints simulated:** Cloud cost optimization, query performance at scale, self-serve BI requirements

---

## 📬 Let's Connect

If you're working on urban mobility analytics, cloud data pipelines, or business intelligence strategy - or if you're a recruiter evaluating this work - I'd love to connect.

- 💼 [LinkedIn](https://www.linkedin.com/in/1adityakadam)
- 📁 [More Projects](https://www.github.com/1adityakadam)
- 📧 [Email](mailto:askadam@iu.edu)

*Feedback, collaboration ideas, and critical questions all welcome.*

---

*Built with Google BigQuery · Looker Studio · Mage AI · GCP · Python*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/1adityakadam/uber_data_analytics

Awesome Lists containing this project

README