{"id":32720904,"url":"https://github.com/muhammadrauhan/project-using-pyspark","last_synced_at":"2026-05-15T11:37:01.353Z","repository":{"id":318829156,"uuid":"1076644809","full_name":"muhammadrauhan/Project-using-PySpark","owner":"muhammadrauhan","description":"Cleaned and Processed an E-Commerce Orders Dataset using PySpark.","archived":false,"fork":false,"pushed_at":"2025-10-15T09:34:53.000Z","size":11,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-15T19:31:43.155Z","etag":null,"topics":["apache-spark","data-cleaning-and-preprocessing","data-processing","pyspark"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/muhammadrauhan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-15T06:40:27.000Z","updated_at":"2025-10-15T11:18:51.000Z","dependencies_parsed_at":"2025-10-16T13:47:05.838Z","dependency_job_id":"f7bacb84-8dd1-448b-9b24-95c6a5ee98ea","html_url":"https://github.com/muhammadrauhan/Project-using-PySpark","commit_stats":null,"previous_names":["muhammadrauhan/project-using-pyspark"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/muhammadrauhan/Project-using-PySpark","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/muhammadrauhan%2FProject-using-PySpark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/muhammadrauhan%2FProject-using-PySpark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/muhammadrauhan%2FProject-using-PySpark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/muhammadrauhan%2FProject-using-PySpark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/muhammadrauhan","download_url":"https://codeload.github.com/muhammadrauhan/Project-using-PySpark/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/muhammadrauhan%2FProject-using-PySpark/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":282349847,"owners_count":26654800,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-02T02:00:06.609Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","data-cleaning-and-preprocessing","data-processing","pyspark"],"created_at":"2025-11-02T20:01:23.049Z","updated_at":"2025-11-02T20:03:59.053Z","avatar_url":"https://github.com/muhammadrauhan.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cleaning an E-Commerce Orders Dataset with PySpark\n\u003cdiv align=\"center\"\u003e\n    \u003cimg width=\"1224\" height=\"346\" alt=\"data check\" src=\"https://github.com/user-attachments/assets/6054c8b8-7f04-447c-9fa7-31e91ef0bd12\" /\u003e\n\u003c/div\u003e\n\n## :bulb: About\nIn this project, I had stepped into a role of a data engineer at an E-Commerce company and use PySpark, a powerful tool for data processing. I have been requested by a peer Machine Learning team to clean the data containing the information about orders made last year. They are planning to further use this cleaned data to build a demand forecasting model. To achieve this, they have shared their requirements regarding the desired output table format.\n\nAn analyst shared a parquet file called `orders_data.parquet` for you to clean and preprocess.\n\nYou can see the dataset schema below along with the cleaning requirements:\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg width=\"745\" height=\"400\" alt=\"data\" src=\"https://github.com/user-attachments/assets/3f06c77a-f7c5-4b26-a59f-2ccdf4ccd855\" /\u003e\n\u003c/div\u003e\n\n## :jigsaw: Project Approach\nMy step-by-step approach throughout the project was:\n+ Read and load the raw orders data.\n+ Processed and standardized the order date column.\n+ Cleaned product information columns for consistency.\n+ Extracted state details from addresses and counted unique states.\n+ Prepared and exported the final cleaned dataset for analysis.\n\n- ### Load the Parquet File to PySpark DataFrame:\n\n  \u003cdiv align=\"center\"\u003e\n      \u003cimg width=\"744\" height=\"290\" alt=\"data-head\" src=\"https://github.com/user-attachments/assets/54adef6c-3701-4aba-873d-f51ccd77cedb\" /\u003e\n  \u003c/div\u003e\n\n  Here you can see that the total count for `orders_data.parquet` is **185,950** records, before any transformation is done.\n\n- ### Dealing with `order_date` Column:\n\n  Created a `time_of_day` column.\n\n  \u003cdiv align=\"center\"\u003e\n      \u003cimg width=\"744\" height=\"301\" alt=\"column\" src=\"https://github.com/user-attachments/assets/61ceae12-53e7-4669-9e9d-f912cd1de5c1\" /\u003e\n  \u003c/div\u003e\n\n  - #### Removed the rows containing an Orders made at Night Time:\n\n    \u003cdiv align=\"center\"\u003e\n        \u003cimg width=\"750\" height=\"79\" alt=\"rem-row\" src=\"https://github.com/user-attachments/assets/b2a1faa6-9a70-4919-add0-6332ec5bef14\" /\u003e\n    \u003c/div\u003e\n\n    Here you can clearly see that the total count is **176,762** records, after removing the rows containing an orders made at night time.\n    \u003cbr\u003e\n    \u003cbr\u003e\n    \u003cbr\u003e\n    \n  \nIf you want to see my further cleaning, transformation and data processing approach you can see my `.py` file here [Data Processing File](https://github.com/muhammadrauhan/Project-using-PySpark/blob/main/data-cleaning-and-processing.ipynb).\n\u003cbr\u003e\n\u003cbr\u003e\n\u003cbr\u003e\n\u003cbr\u003e\n\u003cbr\u003e\n\n\u003e [!NOTE]\n\u003e I completed this project from **DataCamp**, If you want to do this project you can visit the site by clicking on this [DataCamp PySpark Project](https://app.datacamp.com/learn/projects/2355).\n\n\n    \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmuhammadrauhan%2Fproject-using-pyspark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmuhammadrauhan%2Fproject-using-pyspark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmuhammadrauhan%2Fproject-using-pyspark/lists"}