An open API service indexing awesome lists of open source software.

https://github.com/an4pdm/data_cleaning_for_cafe

This project aims to transform and clean data from a cafe by eliminating redundancies, improving data quality and persistency. It leverages data manipulation techniques using pandas, with a focus on enhancing data integrity and optimizing storage for future analysis.
https://github.com/an4pdm/data_cleaning_for_cafe

Last synced: about 1 year ago
JSON representation

This project aims to transform and clean data from a cafe by eliminating redundancies, improving data quality and persistency. It leverages data manipulation techniques using pandas, with a focus on enhancing data integrity and optimizing storage for future analysis.

Awesome Lists containing this project

README

          

# โ˜• *Data Cleaning for Cafe*

This project aims to transform and clean data from a cafe by eliminating redundancies, improving data quality and persistency. It leverages data manipulation techniques using **Pandas**, with a focus on enhancing data integrity and optimizing storage for future analysis.

## ๐Ÿ“Œ Objectives

- Detect and remove duplicate or inconsistent records
- Convert invalid data types to numeric where applicable
- Replace or handle missing values
- Improve overall data structure and clarity

## ๐Ÿ› ๏ธ Tools & Libraries

- Python
- Pandas
- Jupyter Notebook (for development and visualization)

## ๐Ÿงผ Data Cleaning Steps

The dataset was cleaned and transformed incrementally, with each step saved as a `.pkl` file for reproducibility and version control.

### โœ… Checkpoints

- **data_step1.pkl**

- Set all `Price Per Unit` values for each item correctly
- Converted `Quantity`, `Price Per Unit`, and `Total Spent` to numeric types
- Replaced non-numeric values with `NaN` (using `pd.to_numeric` with `errors='coerce'`)
- Imputed missing values in `Quantity` by dividing `Total Spent` by `Price Per Unit`
- Updated missing values in `Total Spent` by multiplying `Quantity` and `Price Per Unit`

- **data_step2.pkl** (in progress)
- Removed NaN and replaced values in `Item` and `Payment Method`
- Removed redundant or duplicate rows
- Standardized column names (e.g., lowercase, underscores)
- Trimmed whitespace in string values

### ๐Ÿ”„ File Naming Convention

Each step is saved as `data_stepN.pkl`, where `N` indicates the transformation phase.

## ๐Ÿ’ก Key Learnings

- Data validation and type conversion using `pd.to_numeric()`
- Filtering rows with conditions (`isna()`, `notna()`)
- Creating new DataFrames from cleaned Series
- Good practices in data preprocessing for analysis

## ๐Ÿ“ Output

The final cleaned DataFrame is ready for further use in dashboards, analysis, or machine learning tasks.

---

Feel free to fork or use it as a reference in your own data projects!