https://github.com/1adityakadam/uber_data_analytics
End to end Google Bigquery + Looker Studio Data Analytics Project Transforming NYC Taxi Data into Actionable Intelligence
https://github.com/1adityakadam/uber_data_analytics
bigquery looker-studio mage-ai-pipeline numpy pandas sql
Last synced: 2 months ago
JSON representation
End to end Google Bigquery + Looker Studio Data Analytics Project Transforming NYC Taxi Data into Actionable Intelligence
- Host: GitHub
- URL: https://github.com/1adityakadam/uber_data_analytics
- Owner: 1adityakadam
- Created: 2025-02-25T06:39:58.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-25T23:46:31.000Z (about 1 year ago)
- Last Synced: 2025-03-26T00:35:30.502Z (about 1 year ago)
- Topics: bigquery, looker-studio, mage-ai-pipeline, numpy, pandas, sql
- Language: Jupyter Notebook
- Homepage:
- Size: 56.4 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README

# ๐ NYC Yellow Taxi Urban Mobility Intelligence Platform
> *A business-driven, end-to-end cloud analytics pipeline transforming 5.5M raw trip records into actionable revenue, operational, and strategic insights - built on Google Cloud Platform.*
---
## ๐งฉ Business Problem
New York City's taxi and ride-share ecosystem processes **millions of trips per month**, yet most operators, city planners, and mobility platforms still make pricing, staffing, and routing decisions based on intuition rather than data.
**The core problem:** Raw TLC trip data exists - but it lives in unstructured parquet files, disconnected from business logic, unqueried, and unvisualized. No stakeholder can answer in real time:
- Where are we losing revenue to inefficient routing?
- Which zones are chronically underserved during peak hours?
- What share of surcharge and fee revenue are we actually capturing?
- How does weekday vs. weekend demand shift our fleet allocation strategy?
Without answers to these questions, **revenue is left on the table, costs go uncontrolled, and operational decisions are made blind.**
---
## ๐ก Why This Matters
| Stakeholder | What They Need | What This Solves |
|---|---|---|
| Fleet Operations | Peak demand forecasting | Day-of-week ridership trends |
| Revenue Management | Fare & surcharge optimization | Zone-level revenue breakdown |
| City Planning / TLC | Mobility equity analysis | Pickup/dropoff zone distribution |
| Product & Strategy | Competitive vendor benchmarking | Vendor market share analysis |
| Finance | Cost attribution | Congestion surcharge contribution |
Taxi and ride-share is a **$3B+ annual market in NYC alone**. A 1% improvement in routing efficiency or dynamic pricing decisions at scale can represent **tens of millions in recovered revenue or cost savings.**
---
## ๐ฏ Objective
Design and deploy a **production-grade, cloud-native analytics pipeline** that:
1. Ingests raw Yellow Taxi trip data (November 2024, ~5.5M records)
2. Transforms it into a clean, query-optimized dimensional data model
3. Surfaces business KPIs through an interactive Looker Studio dashboard
4. Enables stakeholders to answer strategic questions **in seconds, not days**
---
## ๐ Business Impact
> *Modeled on realistic NYC taxi market assumptions. Figures represent estimated opportunity based on observed data patterns.*
| Impact Area | Finding | Estimated Business Value |
|---|---|---|
| **Revenue Visibility** | $203.15M total fares tracked across 5.5M trips | Full revenue attribution previously unavailable |
| **Peak Demand Capture** | Thu/Fri demand spikes identified | Fleet reallocation could recover ~2โ5% of missed trips |
| **Airport Revenue Concentration** | LGA + JFK = top pickup zones | Targeted surge pricing opportunity: ~$1.2Mโ$2M/mo |
| **Vendor Concentration Risk** | CMT holds 84.8% share | Diversification risk flag for procurement strategy |
| **Congestion Surcharge Leakage** | Avg. $1.43/trip collected | At 5.5M trips: $7.8M/mo; tracking gaps could indicate $500K+ in missed collection |
| **Operational Efficiency** | Avg trip: 14.1 miles, $61.76 fare | Benchmarks routing efficiency vs. optimal fleet deployment |
---
## ๐ Methodology & Decision Process
### Why a Dimensional Model (Star Schema)?
I chose a **star schema** over a flat denormalized table for three reasons:
- **Query performance at scale:** BigQuery's columnar engine optimizes joins between a central fact table and narrow dimension tables.
- **Business flexibility:** Adding new dimensions (e.g., weather, events) doesn't require restructuring existing tables.
- **BI tool compatibility:** Looker Studio and most BI tools natively understand fact/dimension relationships.
**Alternative considered:** A single flat table would have been simpler to build but would degrade query performance at 50M+ row scale and reduce flexibility for ad hoc analysis.
### Metric Definition Choices
- **"Total Revenue"** was defined as `total_amount` (excludes cash tips per TLC data dictionary). This was a deliberate choice - cash tip inclusion would require driver-reported fields with known accuracy issues.
- **"Travellers"** was mapped to `passenger_count`, not trip count, to better represent actual ridership volume.
- **Congestion Surcharge** was tracked separately from fare revenue to allow policy impact analysis (NYS MTA congestion pricing changes directly affect this metric).
### Assumptions Made
- November 2024 data treated as representative of Q4 seasonal patterns
- Vendor classification based on TLC's VendorID field (self-reported by TPEP providers)
- Zone analysis uses TLC Taxi Zone boundaries (not raw GPS coordinates)
---
## ๐๏ธ Tech Stack
**Cloud Infrastructure**
- Google Cloud Storage - raw data lake (parquet ingestion)
- Google Cloud VM - compute for ETL orchestration
- Google BigQuery - scalable analytical query engine
- Google Looker Studio - business intelligence & dashboard layer
**ETL & Orchestration**
- Mage AI - pipeline orchestration (extraction โ transformation โ load)
**Data Modeling**
- Star schema design (1 fact table + 4 dimension tables)
- ERD designed in Lucidchart
**Analytics**
- BigQuery SQL - aggregations, window functions, zone-level joins
- KPI development: revenue per trip, zone concentration index, vendor share
**AI Tools**
- ChatGPT / Claude - see AI Usage section below
---
## ๐ค How AI Was Used
Transparency matters. Here's exactly where AI accelerated this project - and where human judgment drove the decisions:
| Task | AI Role | My Role |
|---|---|---|
| SQL query optimization | Suggested indexing patterns and JOIN ordering for BigQuery | Validated query plans, adjusted for actual schema |
| Data dictionary interpretation | Summarized TLC field descriptions quickly | Decided which fields to include in the model and why |
| ERD layout suggestions | Proposed initial table structure | Redesigned based on BI tool requirements and query patterns |
| README drafting | Generated initial structure | Rewrote framing, business context, and all quantified impact |
| Debugging Mage AI pipeline | Suggested error resolution patterns | Diagnosed root cause, implemented and tested fixes |
**Bottom line:** AI compressed the research-and-drafting phases by ~40%. Every architectural decision, business framing, and analytical conclusion was human-led and validated.
---
## ๐ How This Framework Applies Elsewhere
This pipeline is not NYC-taxi-specific. The same architecture applies directly to:
| Industry | Data Source | Business Question |
|---|---|---|
| Ride-share (Lyft/Uber) | Trip logs | Dynamic pricing optimization by zone/hour |
| Logistics (FedEx/UPS) | Delivery records | Route efficiency & SLA risk by region |
| Retail | POS transaction data | Store traffic patterns & revenue per zone |
| Healthcare | Patient encounter records | Facility utilization & care gap analysis |
| Airlines | Flight + passenger data | Revenue per seat-mile, load factor optimization |
The **star schema + cloud pipeline + BI dashboard** pattern is a transferable playbook for any organization sitting on raw transactional data without analytical infrastructure.
---
## ๐ Step-by-Step Reproduction Guide
**1. Data Acquisition**
- Download `yellow_tripdata_2024-11.parquet` from the [NYC TLC trip record data portal](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
- Upload to a Google Cloud Storage bucket
**2. Environment Setup**
- Provision a GCP VM (e2-standard-2 or higher recommended)
- Install Mage AI: `pip install mage-ai`
- Configure GCP service account with BigQuery + GCS permissions
**3. ETL Pipeline (Mage AI)**
- Create a new Mage pipeline with three blocks:
- *Data loader:* Read parquet from GCS
- *Transformer:* Apply cleaning logic (null removal, type casting, date parsing)
- *Data exporter:* Write to BigQuery staging table
**4. Data Modeling (BigQuery)**
- Run dimensional model SQL to split staging table into:
- `fact_trips` - core trip metrics
- `dim_vendor` - vendor lookup
- `dim_datetime` - date/time attributes
- `dim_zone` - pickup/dropoff zone mapping (join with TLC zone lookup CSV)
- `dim_ratecode` - rate type lookup
**5. Analytics Layer**
- Run aggregation queries for: total revenue, avg fare, trip volume by day, zone rankings, vendor share
- Full SQL scripts available in `/sql/` directory of this repo
**6. Dashboard (Looker Studio)**
- Connect Looker Studio to BigQuery dataset
- Build KPI scorecards, trend lines, treemaps, and zone tables
- [View live dashboard โ](https://lookerstudio.google.com/reporting/432a79b8-781d-4081-94b3-33cfdb9444cb)
---
## ๐ Key Results
- **5.5M** total travellers | **$203.15M** total revenue | **$61.76** avg fare
- **14.1 miles** avg trip distance | **$1.43** avg congestion surcharge
- Peak demand: **Thursdays & Fridays**
- Top pickups: **LaGuardia Airport, JFK Airport, Times Square**
- Vendor split: **CMT 84.8% / VeriFone 15.2%**
- **65.2%** of pickups in Yellow Zone
---
## ๐ญ Key Learnings
**What I'd improve:**
- Add real-time streaming ingestion via Pub/Sub to move from batch โ near-real-time analytics
- Incorporate weather and event data as external dimensions to explain demand anomalies
- Build a revenue forecasting model on top of the cleaned dimensional data
**What surprised me:**
- The degree of airport concentration in pickup patterns - LGA/JFK together represent a disproportionate share of total revenue, suggesting pricing strategy should be heavily airport-weighted
- CMT's 84.8% vendor dominance; this level of concentration likely reflects hardware/contract lock-in, not organic market share
**Technical challenge:**
- Mage AI's GCS connector required manual service account scope configuration not documented in standard guides - resolved by examining IAM audit logs
---
## ๐ง Context
This is a **portfolio project** built to demonstrate end-to-end data engineering and business analytics capability using publicly available TLC data. All business impact figures are modeled estimates based on observed data patterns and reasonable market assumptions - not actual company financials.
**Role simulated:** Senior Data Analyst / Analytics Engineer
**Stakeholders simulated:** Fleet Operations, Revenue Management, City Planning
**Constraints simulated:** Cloud cost optimization, query performance at scale, self-serve BI requirements
---
## ๐ฌ Let's Connect
If you're working on urban mobility analytics, cloud data pipelines, or business intelligence strategy - or if you're a recruiter evaluating this work - I'd love to connect.
- ๐ผ [LinkedIn](https://www.linkedin.com/in/1adityakadam)
- ๐ [More Projects](https://www.github.com/1adityakadam)
- ๐ง [Email](mailto:askadam@iu.edu)
*Feedback, collaboration ideas, and critical questions all welcome.*
---
*Built with Google BigQuery ยท Looker Studio ยท Mage AI ยท GCP ยท Python*