An open API service indexing awesome lists of open source software.

https://github.com/atechguide/nyc-taxi-data-analysis

Spark App to Analyse NYC Taxi Data
https://github.com/atechguide/nyc-taxi-data-analysis

project sbt spark

Last synced: about 1 month ago
JSON representation

Spark App to Analyse NYC Taxi Data

Awesome Lists containing this project

README

          

# NYC taxi Data Analysis

# Tech Stack
- Spark
- Scala
- sbt

# Analysis
- Which zones have the most pickup/drop-offs overall [MostPickupDropoffs.scala]
- What are the peak hours for taxi [PeakHoursForTaxi.scala]
- How are the trips distributed by length? Why are people taking the cab? [TripDistribution.scala]
- What are the peak hours for long/short trips? [PeakHoursForLongShortTrips.scala]
- What are the top 3 pick up and drop off zones for long/short trips? [TopPickUpAndDropOffForLongShortTrips.scala]
- How are people paying for the rides, on long / short trips [PeoplePayingForLongShortTrips.scala]
- How is the payment type evolving with time? [PaymentTypeEvolvingWithTime.scala]
- Can we explore a ride-sharing opportunity by grouping close short trips? [RideSharingOppertunity.scala]

# Data Sources
- [www1.nyc.gov](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
- [academictorrents.com/](http://academictorrents.com/details/4f465810b86c6b793d1c7556fe3936441081992e)

## Data Size

- ~ 1.4 billion taxi rides between 2009 and 2016
- ~ 400 GB uncompressed CSV
- ~ 35 GB snappy parquet

# References
This project is build as part of [rockthejvm.com Spark Essentials with Scala](https://rockthejvm.com/p/spark-essentials) course.