https://github.com/mervat-khaled/etl-apache-spark-nyc-taxi-data

The goal of this project is to do some ETL (Extract, Transform, and Load) In NYC Taxi Data and its geographical information Using Apache Spark, performing various transformations using Spark's python API "PySpark" and SQL language. And finally saving the processed data into CSVs file partitioned by the number of executors on spark session.
https://github.com/mervat-khaled/etl-apache-spark-nyc-taxi-data

apache-spark docker-image etl geojson pyspark shapely spark-sql windowfunction

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/mervat-khaled/etl-apache-spark-nyc-taxi-data
Owner: mervat-khaled
Created: 2024-11-24T01:14:52.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-11-25T00:33:34.000Z (5 months ago)
Last Synced: 2025-01-25T18:43:12.473Z (3 months ago)
Topics: apache-spark, docker-image, etl, geojson, pyspark, shapely, spark-sql, windowfunction
Language: Jupyter Notebook
Homepage:
Size: 7.44 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# ETL-Apache-Spark-NYC-Taxi-Data

# Overview
The goal of this project is to do some ETL (Extract, Transform, and Load) In NYC Taxi Data and its geographical information Using Apache Spark, performing various transformations using Spark's python API "PySpark" and SQL language. And finally saving the processed data into CSVs file partitioned by the number of executors on spark session. Also implementing some spark tuning, passing some partition configuration such as a spark.sql.files.maxPartitionBytes, and spark.sql.shuffle.partitions.

# Transformation

The Spark ETL process involves the following transformation steps on the data:

#### Date Format Change: Convert date columns to a specific date format as required.
#### Column Split: Split specified columns into multiple columns as required.
#### Column Concatenation: Combine multiple columns to create new composite columns.
#### Column Dropping: Remove unnecessary columns that have more than %90 null values.
#### DataFrame Join: Combine data from multiple data frames to create a unified dataset.

For Further details see these Notebooks:
[NYC_taxi_ETL1 .ipynb](https://github.com/mervat-khaled/ETL-Apache-Spark-NYC-Taxi-Data/blob/main/NYC_taxi_ETL1%20.ipynb)
[NYC_taxi_ETL2 .ipynb](https://github.com/mervat-khaled/ETL-Apache-Spark-NYC-Taxi-Data/blob/main/NYC_taxi_ETL2%20.ipynb)

# Technologies Used
Dockerized Apache spark 3.5.0 to process the data, and store the results into CSVs Files.

#### Spark image environment:

![Screenshots/image_config.png](Screenshots/image_config.png)

# Dataset:
We have a sample of NYC taxi data, which can be downloaded from [here](http://www.andresmh.com/nyctaxitrips/), this sample containes 100000 records/rows. Eaach row of the file header represents a single taxi ride in CSV format. For each ride, we have some attributes of the cab (a hashed version of the medallion number) as well as the driver (a hashed version of hack license, which is what licenses to drive taxies are called), some temporal information about when the trip started and ended, and the longitude/latitude coordinates for where the passengers(s) were picked up and dropped off.

we are mainly intersted in each Trip's:

* Some Unique ID for the car (license)
* Pick-up location
* Pick-up time
* Drop-off location
* Drop-off time

# Problem:

We need to compute one important statistic utilization. Utilization is the fraction of time that a cap is on the road and is occupied by one or more passengers. One factor that impacts utilization is the passenger's destination: a cab that drops off passengers near Union Square at midday is much more likely to find its next fare in just a minute or two, whereas a cab that drops someone off at 2 AM on Staten Island may have to drive all the way back to Manhatten before it find its next fare.

We need to compute:

1. The average time it takes for a taxi to find its next fare(trip) per destination borough,
2. The number of trips that started and ended within the same borough.

# Steps:
To carry out the required analysis, we had to deal with two types of data: temporal data, such as dates and times, and geospatial information, like points of longitude and latitude and spatial boundaries.
As the data given in the taxi ride data set shows just the longitude and latitude of both the pickup and drop-off locations, we needed to enrich this data set with boroughs of the respective locations. For this, we had to load another data set that specifies the boundaries of each borough in GeoJson format, data is supplied as a separate [file](https://github.com/mervat-khaled/ETL-Apache-Spark-NYC-Taxi-Data/blob/main/data/nyc-boroughs.geojson?short_path=a7cec63). So we created a data frame out of it and broadcasted it to different workers to join it with the largest data frame.
To enrich the taxi ride data set with pick up and drop off boroughs names, we needed to query the GeoJson data for the name of the borough for which the long/lat of the pick up drop off belongs. To achieve this goal, we used the geometry library [shapely](https://shapely.readthedocs.io/en/stable/) which provides several APIs to handle geometric shapes, among which to query about the inclusion of shape in another shape in another shape.
To let Spark handle these, we defined them as user-defined functions UDFs, as follows:
![Screenshots/broadcast.png](Screenshots/broadcast.png)
![Screenshots/UDF_shapely.png](Screenshots/UDF_shapely.png)

Then we saved Joined data as CSVs files partitioned by the number of cores/executors on Spark session.

![Screenshots/saving_processedData.png](Screenshots/saving_processedData.png)
![Screenshots/CSVs.png](Screenshots/CSVs.png)

# Analysis
1- The number of trips that started and ended within the same borough
![Screenshots/first_query.png](Screenshots/first_query.png)

2- The average time it takes for a taxi to find its next fare(trip) per destination borough.

To answer this question I used the lead window function with SQL statement:
![Screenshots/window_function.png](Screenshots/window_function.png)

The answer of the question was:
![Screenshots/window_function2.png](Screenshots/window_function2.png)

The Unknown borough name here is most likely to be that the drop-off was in another borough.
The average wait time for a taxi from one borough to another is longer than the average wait in the same borough.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mervat-khaled/etl-apache-spark-nyc-taxi-data

Awesome Lists containing this project

README