{"id":26710519,"url":"https://github.com/1adityakadam/uber_data_analytics","last_synced_at":"2026-04-13T16:31:50.937Z","repository":{"id":279367665,"uuid":"938565102","full_name":"1adityakadam/Uber_data_analytics","owner":"1adityakadam","description":"End to end Google Bigquery + Looker Studio Data Analytics Project Transforming NYC Taxi Data into Actionable Intelligence ","archived":false,"fork":false,"pushed_at":"2025-03-25T23:46:31.000Z","size":59109,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-26T00:35:30.502Z","etag":null,"topics":["bigquery","looker-studio","mage-ai-pipeline","numpy","pandas","sql"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/1adityakadam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-25T06:39:58.000Z","updated_at":"2025-03-25T23:46:34.000Z","dependencies_parsed_at":"2025-03-12T16:37:01.191Z","dependency_job_id":null,"html_url":"https://github.com/1adityakadam/Uber_data_analytics","commit_stats":null,"previous_names":["1adityakadam/uber_data_analytics"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1adityakadam%2FUber_data_analytics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1adityakadam%2FUber_data_analytics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1adityakadam%2FUber_data_analytics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/1adityakadam%2FUber_data_analytics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/1adityakadam","download_url":"https://codeload.github.com/1adityakadam/Uber_data_analytics/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245814746,"owners_count":20676808,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","looker-studio","mage-ai-pipeline","numpy","pandas","sql"],"created_at":"2025-03-27T09:19:45.630Z","updated_at":"2026-04-13T16:31:50.927Z","avatar_url":"https://github.com/1adityakadam.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\u003ca href=\"https://lookerstudio.google.com/reporting/432a79b8-781d-4081-94b3-33cfdb9444cb\"\u003e\n\n![image](https://github.com/user-attachments/assets/e2beafed-5bd9-42a4-be79-7a65230d4a8e)\n\u003c/a\u003e\n\u003cbr\u003e\n\n# 🚖 NYC Yellow Taxi Urban Mobility Intelligence Platform\n\n\u003e *A business-driven, end-to-end cloud analytics pipeline transforming 5.5M raw trip records into actionable revenue, operational, and strategic insights - built on Google Cloud Platform.*\n\n---\n\n## 🧩 Business Problem\n\nNew York City's taxi and ride-share ecosystem processes **millions of trips per month**, yet most operators, city planners, and mobility platforms still make pricing, staffing, and routing decisions based on intuition rather than data.\n\n**The core problem:** Raw TLC trip data exists - but it lives in unstructured parquet files, disconnected from business logic, unqueried, and unvisualized. No stakeholder can answer in real time:\n\n- Where are we losing revenue to inefficient routing?\n- Which zones are chronically underserved during peak hours?\n- What share of surcharge and fee revenue are we actually capturing?\n- How does weekday vs. weekend demand shift our fleet allocation strategy?\n\nWithout answers to these questions, **revenue is left on the table, costs go uncontrolled, and operational decisions are made blind.**\n\n---\n\n## 💡 Why This Matters\n\n| Stakeholder | What They Need | What This Solves |\n|---|---|---|\n| Fleet Operations | Peak demand forecasting | Day-of-week ridership trends |\n| Revenue Management | Fare \u0026 surcharge optimization | Zone-level revenue breakdown |\n| City Planning / TLC | Mobility equity analysis | Pickup/dropoff zone distribution |\n| Product \u0026 Strategy | Competitive vendor benchmarking | Vendor market share analysis |\n| Finance | Cost attribution | Congestion surcharge contribution |\n\nTaxi and ride-share is a **$3B+ annual market in NYC alone**. A 1% improvement in routing efficiency or dynamic pricing decisions at scale can represent **tens of millions in recovered revenue or cost savings.**\n\n---\n\n## 🎯 Objective\n\nDesign and deploy a **production-grade, cloud-native analytics pipeline** that:\n\n1. Ingests raw Yellow Taxi trip data (November 2024, ~5.5M records)\n2. Transforms it into a clean, query-optimized dimensional data model\n3. Surfaces business KPIs through an interactive Looker Studio dashboard\n4. Enables stakeholders to answer strategic questions **in seconds, not days**\n\n---\n\n## 📊 Business Impact\n\n\u003e *Modeled on realistic NYC taxi market assumptions. Figures represent estimated opportunity based on observed data patterns.*\n\n| Impact Area | Finding | Estimated Business Value |\n|---|---|---|\n| **Revenue Visibility** | $203.15M total fares tracked across 5.5M trips | Full revenue attribution previously unavailable |\n| **Peak Demand Capture** | Thu/Fri demand spikes identified | Fleet reallocation could recover ~2–5% of missed trips |\n| **Airport Revenue Concentration** | LGA + JFK = top pickup zones | Targeted surge pricing opportunity: ~$1.2M–$2M/mo |\n| **Vendor Concentration Risk** | CMT holds 84.8% share | Diversification risk flag for procurement strategy |\n| **Congestion Surcharge Leakage** | Avg. $1.43/trip collected | At 5.5M trips: $7.8M/mo; tracking gaps could indicate $500K+ in missed collection |\n| **Operational Efficiency** | Avg trip: 14.1 miles, $61.76 fare | Benchmarks routing efficiency vs. optimal fleet deployment |\n\n---\n\n## 🔍 Methodology \u0026 Decision Process\n\n### Why a Dimensional Model (Star Schema)?\n\nI chose a **star schema** over a flat denormalized table for three reasons:\n- **Query performance at scale:** BigQuery's columnar engine optimizes joins between a central fact table and narrow dimension tables.\n- **Business flexibility:** Adding new dimensions (e.g., weather, events) doesn't require restructuring existing tables.\n- **BI tool compatibility:** Looker Studio and most BI tools natively understand fact/dimension relationships.\n\n**Alternative considered:** A single flat table would have been simpler to build but would degrade query performance at 50M+ row scale and reduce flexibility for ad hoc analysis.\n\n### Metric Definition Choices\n\n- **\"Total Revenue\"** was defined as `total_amount` (excludes cash tips per TLC data dictionary). This was a deliberate choice - cash tip inclusion would require driver-reported fields with known accuracy issues.\n- **\"Travellers\"** was mapped to `passenger_count`, not trip count, to better represent actual ridership volume.\n- **Congestion Surcharge** was tracked separately from fare revenue to allow policy impact analysis (NYS MTA congestion pricing changes directly affect this metric).\n\n### Assumptions Made\n\n- November 2024 data treated as representative of Q4 seasonal patterns\n- Vendor classification based on TLC's VendorID field (self-reported by TPEP providers)\n- Zone analysis uses TLC Taxi Zone boundaries (not raw GPS coordinates)\n\n---\n\n## 🏗️ Tech Stack\n\n**Cloud Infrastructure**\n- Google Cloud Storage - raw data lake (parquet ingestion)\n- Google Cloud VM - compute for ETL orchestration\n- Google BigQuery - scalable analytical query engine\n- Google Looker Studio - business intelligence \u0026 dashboard layer\n\n**ETL \u0026 Orchestration**\n- Mage AI - pipeline orchestration (extraction → transformation → load)\n\n**Data Modeling**\n- Star schema design (1 fact table + 4 dimension tables)\n- ERD designed in Lucidchart\n\n**Analytics**\n- BigQuery SQL - aggregations, window functions, zone-level joins\n- KPI development: revenue per trip, zone concentration index, vendor share\n\n**AI Tools**\n- ChatGPT / Claude - see AI Usage section below\n\n---\n\n## 🤖 How AI Was Used\n\nTransparency matters. Here's exactly where AI accelerated this project - and where human judgment drove the decisions:\n\n| Task | AI Role | My Role |\n|---|---|---|\n| SQL query optimization | Suggested indexing patterns and JOIN ordering for BigQuery | Validated query plans, adjusted for actual schema |\n| Data dictionary interpretation | Summarized TLC field descriptions quickly | Decided which fields to include in the model and why |\n| ERD layout suggestions | Proposed initial table structure | Redesigned based on BI tool requirements and query patterns |\n| README drafting | Generated initial structure | Rewrote framing, business context, and all quantified impact |\n| Debugging Mage AI pipeline | Suggested error resolution patterns | Diagnosed root cause, implemented and tested fixes |\n\n**Bottom line:** AI compressed the research-and-drafting phases by ~40%. Every architectural decision, business framing, and analytical conclusion was human-led and validated.\n\n---\n\n## 🔄 How This Framework Applies Elsewhere\n\nThis pipeline is not NYC-taxi-specific. The same architecture applies directly to:\n\n| Industry | Data Source | Business Question |\n|---|---|---|\n| Ride-share (Lyft/Uber) | Trip logs | Dynamic pricing optimization by zone/hour |\n| Logistics (FedEx/UPS) | Delivery records | Route efficiency \u0026 SLA risk by region |\n| Retail | POS transaction data | Store traffic patterns \u0026 revenue per zone |\n| Healthcare | Patient encounter records | Facility utilization \u0026 care gap analysis |\n| Airlines | Flight + passenger data | Revenue per seat-mile, load factor optimization |\n\nThe **star schema + cloud pipeline + BI dashboard** pattern is a transferable playbook for any organization sitting on raw transactional data without analytical infrastructure.\n\n---\n\n## 📋 Step-by-Step Reproduction Guide\n\n**1. Data Acquisition**\n- Download `yellow_tripdata_2024-11.parquet` from the [NYC TLC trip record data portal](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page)\n- Upload to a Google Cloud Storage bucket\n\n**2. Environment Setup**\n- Provision a GCP VM (e2-standard-2 or higher recommended)\n- Install Mage AI: `pip install mage-ai`\n- Configure GCP service account with BigQuery + GCS permissions\n\n**3. ETL Pipeline (Mage AI)**\n- Create a new Mage pipeline with three blocks:\n  - *Data loader:* Read parquet from GCS\n  - *Transformer:* Apply cleaning logic (null removal, type casting, date parsing)\n  - *Data exporter:* Write to BigQuery staging table\n\n**4. Data Modeling (BigQuery)**\n- Run dimensional model SQL to split staging table into:\n  - `fact_trips` - core trip metrics\n  - `dim_vendor` - vendor lookup\n  - `dim_datetime` - date/time attributes\n  - `dim_zone` - pickup/dropoff zone mapping (join with TLC zone lookup CSV)\n  - `dim_ratecode` - rate type lookup\n\n**5. Analytics Layer**\n- Run aggregation queries for: total revenue, avg fare, trip volume by day, zone rankings, vendor share\n- Full SQL scripts available in `/sql/` directory of this repo\n\n**6. Dashboard (Looker Studio)**\n- Connect Looker Studio to BigQuery dataset\n- Build KPI scorecards, trend lines, treemaps, and zone tables\n- [View live dashboard →](https://lookerstudio.google.com/reporting/432a79b8-781d-4081-94b3-33cfdb9444cb)\n\n---\n\n## 📈 Key Results\n\n- **5.5M** total travellers | **$203.15M** total revenue | **$61.76** avg fare\n- **14.1 miles** avg trip distance | **$1.43** avg congestion surcharge\n- Peak demand: **Thursdays \u0026 Fridays**\n- Top pickups: **LaGuardia Airport, JFK Airport, Times Square**\n- Vendor split: **CMT 84.8% / VeriFone 15.2%**\n- **65.2%** of pickups in Yellow Zone\n\n---\n\n## 💭 Key Learnings\n\n**What I'd improve:**\n- Add real-time streaming ingestion via Pub/Sub to move from batch → near-real-time analytics\n- Incorporate weather and event data as external dimensions to explain demand anomalies\n- Build a revenue forecasting model on top of the cleaned dimensional data\n\n**What surprised me:**\n- The degree of airport concentration in pickup patterns - LGA/JFK together represent a disproportionate share of total revenue, suggesting pricing strategy should be heavily airport-weighted\n- CMT's 84.8% vendor dominance; this level of concentration likely reflects hardware/contract lock-in, not organic market share\n\n**Technical challenge:**\n- Mage AI's GCS connector required manual service account scope configuration not documented in standard guides - resolved by examining IAM audit logs\n\n---\n\n## 🧠 Context\n\nThis is a **portfolio project** built to demonstrate end-to-end data engineering and business analytics capability using publicly available TLC data. All business impact figures are modeled estimates based on observed data patterns and reasonable market assumptions - not actual company financials.\n\n**Role simulated:** Senior Data Analyst / Analytics Engineer  \n**Stakeholders simulated:** Fleet Operations, Revenue Management, City Planning  \n**Constraints simulated:** Cloud cost optimization, query performance at scale, self-serve BI requirements\n\n---\n\n## 📬 Let's Connect\n\nIf you're working on urban mobility analytics, cloud data pipelines, or business intelligence strategy - or if you're a recruiter evaluating this work - I'd love to connect.\n\n- 💼 [LinkedIn](https://www.linkedin.com/in/1adityakadam)\n- 📁 [More Projects](https://www.github.com/1adityakadam)\n- 📧 [Email](mailto:askadam@iu.edu)\n\n*Feedback, collaboration ideas, and critical questions all welcome.*\n\n---\n\n*Built with Google BigQuery · Looker Studio · Mage AI · GCP · Python*\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F1adityakadam%2Fuber_data_analytics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F1adityakadam%2Fuber_data_analytics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F1adityakadam%2Fuber_data_analytics/lists"}