https://github.com/muhammadrauhan/project-using-pyspark

Cleaned and Processed an E-Commerce Orders Dataset using PySpark.
https://github.com/muhammadrauhan/project-using-pyspark

apache-spark data-cleaning-and-preprocessing data-processing pyspark

Last synced: about 1 month ago
JSON representation

Cleaned and Processed an E-Commerce Orders Dataset using PySpark.

Host: GitHub
URL: https://github.com/muhammadrauhan/project-using-pyspark
Owner: muhammadrauhan
License: mit
Created: 2025-10-15T06:40:27.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-10-15T09:34:53.000Z (9 months ago)
Last Synced: 2025-10-15T19:31:43.155Z (8 months ago)
Topics: apache-spark, data-cleaning-and-preprocessing, data-processing, pyspark
Language: Jupyter Notebook
Homepage:
Size: 10.7 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Cleaning an E-Commerce Orders Dataset with PySpark

## :bulb: About
In this project, I had stepped into a role of a data engineer at an E-Commerce company and use PySpark, a powerful tool for data processing. I have been requested by a peer Machine Learning team to clean the data containing the information about orders made last year. They are planning to further use this cleaned data to build a demand forecasting model. To achieve this, they have shared their requirements regarding the desired output table format.

An analyst shared a parquet file called `orders_data.parquet` for you to clean and preprocess.

You can see the dataset schema below along with the cleaning requirements:

## :jigsaw: Project Approach
My step-by-step approach throughout the project was:
+ Read and load the raw orders data.
+ Processed and standardized the order date column.
+ Cleaned product information columns for consistency.
+ Extracted state details from addresses and counted unique states.
+ Prepared and exported the final cleaned dataset for analysis.

- ### Load the Parquet File to PySpark DataFrame:

Here you can see that the total count for `orders_data.parquet` is **185,950** records, before any transformation is done.

- ### Dealing with `order_date` Column:

Created a `time_of_day` column.

- #### Removed the rows containing an Orders made at Night Time:

Here you can clearly see that the total count is **176,762** records, after removing the rows containing an orders made at night time.

If you want to see my further cleaning, transformation and data processing approach you can see my `.py` file here [Data Processing File](https://github.com/muhammadrauhan/Project-using-PySpark/blob/main/data-cleaning-and-processing.ipynb).

> [!NOTE]
> I completed this project from **DataCamp**, If you want to do this project you can visit the site by clicking on this [DataCamp PySpark Project](https://app.datacamp.com/learn/projects/2355).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/muhammadrauhan/project-using-pyspark

Awesome Lists containing this project

README