https://github.com/muhammadrauhan/project-using-pyspark
Cleaned and Processed an E-Commerce Orders Dataset using PySpark.
https://github.com/muhammadrauhan/project-using-pyspark
apache-spark data-cleaning-and-preprocessing data-processing pyspark
Last synced: about 1 month ago
JSON representation
Cleaned and Processed an E-Commerce Orders Dataset using PySpark.
- Host: GitHub
- URL: https://github.com/muhammadrauhan/project-using-pyspark
- Owner: muhammadrauhan
- License: mit
- Created: 2025-10-15T06:40:27.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-10-15T09:34:53.000Z (9 months ago)
- Last Synced: 2025-10-15T19:31:43.155Z (8 months ago)
- Topics: apache-spark, data-cleaning-and-preprocessing, data-processing, pyspark
- Language: Jupyter Notebook
- Homepage:
- Size: 10.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Cleaning an E-Commerce Orders Dataset with PySpark
## :bulb: About
In this project, I had stepped into a role of a data engineer at an E-Commerce company and use PySpark, a powerful tool for data processing. I have been requested by a peer Machine Learning team to clean the data containing the information about orders made last year. They are planning to further use this cleaned data to build a demand forecasting model. To achieve this, they have shared their requirements regarding the desired output table format.
An analyst shared a parquet file called `orders_data.parquet` for you to clean and preprocess.
You can see the dataset schema below along with the cleaning requirements:
## :jigsaw: Project Approach
My step-by-step approach throughout the project was:
+ Read and load the raw orders data.
+ Processed and standardized the order date column.
+ Cleaned product information columns for consistency.
+ Extracted state details from addresses and counted unique states.
+ Prepared and exported the final cleaned dataset for analysis.
- ### Load the Parquet File to PySpark DataFrame:
Here you can see that the total count for `orders_data.parquet` is **185,950** records, before any transformation is done.
- ### Dealing with `order_date` Column:
Created a `time_of_day` column.
- #### Removed the rows containing an Orders made at Night Time:
Here you can clearly see that the total count is **176,762** records, after removing the rows containing an orders made at night time.
If you want to see my further cleaning, transformation and data processing approach you can see my `.py` file here [Data Processing File](https://github.com/muhammadrauhan/Project-using-PySpark/blob/main/data-cleaning-and-processing.ipynb).
> [!NOTE]
> I completed this project from **DataCamp**, If you want to do this project you can visit the site by clicking on this [DataCamp PySpark Project](https://app.datacamp.com/learn/projects/2355).