An open API service indexing awesome lists of open source software.

https://github.com/muhammadrauhan/project-using-pyspark

Cleaned and Processed an E-Commerce Orders Dataset using PySpark.
https://github.com/muhammadrauhan/project-using-pyspark

apache-spark data-cleaning-and-preprocessing data-processing pyspark

Last synced: about 1 month ago
JSON representation

Cleaned and Processed an E-Commerce Orders Dataset using PySpark.

Awesome Lists containing this project

README

          

# Cleaning an E-Commerce Orders Dataset with PySpark


data check

## :bulb: About
In this project, I had stepped into a role of a data engineer at an E-Commerce company and use PySpark, a powerful tool for data processing. I have been requested by a peer Machine Learning team to clean the data containing the information about orders made last year. They are planning to further use this cleaned data to build a demand forecasting model. To achieve this, they have shared their requirements regarding the desired output table format.

An analyst shared a parquet file called `orders_data.parquet` for you to clean and preprocess.

You can see the dataset schema below along with the cleaning requirements:


data

## :jigsaw: Project Approach
My step-by-step approach throughout the project was:
+ Read and load the raw orders data.
+ Processed and standardized the order date column.
+ Cleaned product information columns for consistency.
+ Extracted state details from addresses and counted unique states.
+ Prepared and exported the final cleaned dataset for analysis.

- ### Load the Parquet File to PySpark DataFrame:


data-head

Here you can see that the total count for `orders_data.parquet` is **185,950** records, before any transformation is done.

- ### Dealing with `order_date` Column:

Created a `time_of_day` column.


column

- #### Removed the rows containing an Orders made at Night Time:


rem-row

Here you can clearly see that the total count is **176,762** records, after removing the rows containing an orders made at night time.








If you want to see my further cleaning, transformation and data processing approach you can see my `.py` file here [Data Processing File](https://github.com/muhammadrauhan/Project-using-PySpark/blob/main/data-cleaning-and-processing.ipynb).









> [!NOTE]
> I completed this project from **DataCamp**, If you want to do this project you can visit the site by clicking on this [DataCamp PySpark Project](https://app.datacamp.com/learn/projects/2355).