https://github.com/an4pdm/data_cleaning_for_cafe
This project aims to transform and clean data from a cafe by eliminating redundancies, improving data quality and persistency. It leverages data manipulation techniques using pandas, with a focus on enhancing data integrity and optimizing storage for future analysis.
https://github.com/an4pdm/data_cleaning_for_cafe
Last synced: about 1 year ago
JSON representation
This project aims to transform and clean data from a cafe by eliminating redundancies, improving data quality and persistency. It leverages data manipulation techniques using pandas, with a focus on enhancing data integrity and optimizing storage for future analysis.
- Host: GitHub
- URL: https://github.com/an4pdm/data_cleaning_for_cafe
- Owner: An4PDM
- Created: 2025-04-10T21:14:17.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-24T23:04:48.000Z (about 1 year ago)
- Last Synced: 2025-05-05T01:02:19.799Z (about 1 year ago)
- Language: Jupyter Notebook
- Size: 332 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# โ *Data Cleaning for Cafe*
This project aims to transform and clean data from a cafe by eliminating redundancies, improving data quality and persistency. It leverages data manipulation techniques using **Pandas**, with a focus on enhancing data integrity and optimizing storage for future analysis.
## ๐ Objectives
- Detect and remove duplicate or inconsistent records
- Convert invalid data types to numeric where applicable
- Replace or handle missing values
- Improve overall data structure and clarity
## ๐ ๏ธ Tools & Libraries
- Python
- Pandas
- Jupyter Notebook (for development and visualization)
## ๐งผ Data Cleaning Steps
The dataset was cleaned and transformed incrementally, with each step saved as a `.pkl` file for reproducibility and version control.
### โ
Checkpoints
- **data_step1.pkl**
- Set all `Price Per Unit` values for each item correctly
- Converted `Quantity`, `Price Per Unit`, and `Total Spent` to numeric types
- Replaced non-numeric values with `NaN` (using `pd.to_numeric` with `errors='coerce'`)
- Imputed missing values in `Quantity` by dividing `Total Spent` by `Price Per Unit`
- Updated missing values in `Total Spent` by multiplying `Quantity` and `Price Per Unit`
- **data_step2.pkl** (in progress)
- Removed NaN and replaced values in `Item` and `Payment Method`
- Removed redundant or duplicate rows
- Standardized column names (e.g., lowercase, underscores)
- Trimmed whitespace in string values
### ๐ File Naming Convention
Each step is saved as `data_stepN.pkl`, where `N` indicates the transformation phase.
## ๐ก Key Learnings
- Data validation and type conversion using `pd.to_numeric()`
- Filtering rows with conditions (`isna()`, `notna()`)
- Creating new DataFrames from cleaned Series
- Good practices in data preprocessing for analysis
## ๐ Output
The final cleaned DataFrame is ready for further use in dashboards, analysis, or machine learning tasks.
---
Feel free to fork or use it as a reference in your own data projects!