An open API service indexing awesome lists of open source software.

https://github.com/akshpraj/data-cleaning-and-preprocessing

Sales Data Cleaning and Preprocessing - Jupyter Notebook
https://github.com/akshpraj/data-cleaning-and-preprocessing

jupyter-notebook

Last synced: 10 months ago
JSON representation

Sales Data Cleaning and Preprocessing - Jupyter Notebook

Awesome Lists containing this project

README

          

# ๐Ÿงน Sales Data Cleaning Project

### ๐Ÿ“Œ Objective

To clean and prepare a raw sales dataset by handling missing values, duplicates, inconsistent text formats, and incorrect data types. The cleaned dataset will be used for downstream analysis or reporting.

### ๐Ÿงฐ Tools Used

- Python (Pandas)
- Jupyter Notebook / Google Colab
- OR Excel for non-programmatic data cleaning

## ๐Ÿ“ Cleaning Steps Performed

## Task Description

- ๐Ÿ” Missing Values Identified using .isnull() and handled by imputation or row removal.
- โ™ป๏ธ Duplicates Removed using .drop_duplicates() or Excel's "Remove Duplicates".
- ๐Ÿง‘โ€๐Ÿ’ผ Standardized Text Gender, country names, etc., were cleaned for consistency (e.g., male, Male, MALE โ†’ Male).
- ๐Ÿ“† Date Format Fixes Converted all dates to consistent format (DD-MM-YYYY).
- ๐Ÿท๏ธ Column Name Cleanup Renamed headers to lowercase with underscores (e.g., Order Date โ†’ order_date).
- ๐Ÿ”ข Data Type Corrections Ensured numeric fields (like age, sales) are of correct type and dates as datetime.

## ๐Ÿงผ Example Summary of Changes

- Removed 5 duplicate rows
- Filled 12 missing 'customer_name' values with "Unknown"
- Standardized 'Gender' column to: ['Male', 'Female']
- Converted 'order_date' to datetime format
- Renamed columns: "Order Date" โ†’ "order_date", "Sales Amount" โ†’ "sales_amount"
- Casted 'quantity' and 'age' columns to integer