https://github.com/atechguide/nyc-taxi-data-analysis
Spark App to Analyse NYC Taxi Data
https://github.com/atechguide/nyc-taxi-data-analysis
project sbt spark
Last synced: about 1 month ago
JSON representation
Spark App to Analyse NYC Taxi Data
- Host: GitHub
- URL: https://github.com/atechguide/nyc-taxi-data-analysis
- Owner: aTechGuide
- Created: 2020-05-03T09:50:18.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2020-05-03T12:00:54.000Z (about 6 years ago)
- Last Synced: 2024-12-31T04:41:57.009Z (over 1 year ago)
- Topics: project, sbt, spark
- Language: Scala
- Homepage:
- Size: 4.17 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# NYC taxi Data Analysis
# Tech Stack
- Spark
- Scala
- sbt
# Analysis
- Which zones have the most pickup/drop-offs overall [MostPickupDropoffs.scala]
- What are the peak hours for taxi [PeakHoursForTaxi.scala]
- How are the trips distributed by length? Why are people taking the cab? [TripDistribution.scala]
- What are the peak hours for long/short trips? [PeakHoursForLongShortTrips.scala]
- What are the top 3 pick up and drop off zones for long/short trips? [TopPickUpAndDropOffForLongShortTrips.scala]
- How are people paying for the rides, on long / short trips [PeoplePayingForLongShortTrips.scala]
- How is the payment type evolving with time? [PaymentTypeEvolvingWithTime.scala]
- Can we explore a ride-sharing opportunity by grouping close short trips? [RideSharingOppertunity.scala]
# Data Sources
- [www1.nyc.gov](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)
- [academictorrents.com/](http://academictorrents.com/details/4f465810b86c6b793d1c7556fe3936441081992e)
## Data Size
- ~ 1.4 billion taxi rides between 2009 and 2016
- ~ 400 GB uncompressed CSV
- ~ 35 GB snappy parquet
# References
This project is build as part of [rockthejvm.com Spark Essentials with Scala](https://rockthejvm.com/p/spark-essentials) course.