https://github.com/tharadol07969/data_cleaning_with_postgresql
This project focuses on cleaning the FoodYum Grocery Store Sales dataset using PostgreSQL.
https://github.com/tharadol07969/data_cleaning_with_postgresql
data-cleaning postgresql query sql
Last synced: about 1 year ago
JSON representation
This project focuses on cleaning the FoodYum Grocery Store Sales dataset using PostgreSQL.
- Host: GitHub
- URL: https://github.com/tharadol07969/data_cleaning_with_postgresql
- Owner: Tharadol07969
- Created: 2025-04-10T08:14:35.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-10T08:29:20.000Z (about 1 year ago)
- Last Synced: 2025-04-10T09:40:34.201Z (about 1 year ago)
- Topics: data-cleaning, postgresql, query, sql
- Homepage:
- Size: 94.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# FoodYum Grocery Store Sales: Data Cleaning Project with PostgreSQL
## Table of Contents
1. [Introduction](#introduction)
2. [Objectives](#objectives)
3. [Dataset Description](#dataset-description)
4. [Tools & Environment](#tools--environment)
5. [Project Structure](#project-structure)
6. [Database Schema](#database-schema)
7. [Data Ingestion](#data-ingestion)
8. [Data Cleaning Process](#data-cleaning-process)
- [8.1 Exploratory Checks](#81-exploratory-checks)
- [8.2 Cleaning Query](#82-cleaning-query)
9. [Validation & Testing](#validation--testing)
10. [Conclusion](#conclusion)
11. [Appendix](#appendix)
---
## Introduction
This project focuses on cleaning the FoodYum Grocery Store Sales dataset using PostgreSQL. The goal is to transform raw transaction records into a consistent, analysis-ready table by handling missing values, standardizing formats, and applying business rules.
## Objectives
- Identify and resolve missing or malformed values.
- Standardize categorical fields and numeric formats.
- Replace missing values with appropriate defaults (e.g., median for continuous fields).
- Prepare a clean `products` table for downstream analytics.
## Dataset Description
**FoodYum Grocery Store Sales**
- FoodYum is a U.S.-based grocery chain selling produce, meat, dairy, bakery, snacks, and household staples.
- As food costs rise, the company needs consistent data to ensure broad product availability across price ranges.
- The raw dataset captures product metadata and sales metrics for the last full year of the loyalty program.
| Column | Type | Description |
|----------------------|-----------|-------------------------------------------------------------------------------------------------|
| product_id | INTEGER | Unique identifier for each product |
| product_type | TEXT | Category (Produce, Meat, Dairy, Bakery, Snacks); missing → `Unknown` |
| brand | TEXT | Brand name; missing or `-` → `Unknown` |
| weight | TEXT | Weight in grams (e.g., `500 grams`); extract numeric value and round to 2 decimals; missing → median weight |
| price | NUMERIC | Price in USD; missing → median price |
| average_units_sold | INTEGER | Average monthly units sold; missing → `0` |
| year_added | INTEGER | Year first added to stock; missing → `2022` |
| stock_location | TEXT | Warehouse code (`A`,`B`,`C`,`D`); normalize to uppercase; missing → `Unknown` |
## Tools & Environment
- **Database:** PostgreSQL v17
- **GUI:** pgAdmin 4
- **Optional Scripting:** Python 3.x with `psycopg2`, `pandas`
- **Environment Variables:** Store DB credentials in a `.env` file (not committed to VCS)
## Project Structure
```
foodyum-data-cleaning/
├── data/
│ ├── raw/ # Original CSV files
│ │ ├── foodyum_raw.csv
│ │ └── foodyum_raw_for_pgadmin4.csv
│ └── clean/ # Exported cleaned CSV
│ └── foodyum_clean.csv
├── sql/
│ ├── 01_create_tables.sql # Table definitions
│ ├── 02_data_ingestion.sql # COPY commands
│ ├── 03_data_cleaning.sql # Cleaning queries
│ └── 04_data_validation.sql # Data quality checks
├── .env # DB credentials
├── README.md # Project documentation
└── requirements.txt # Python dependencies
```
## Database Schema
Table: `products`
```sql
CREATE TABLE public.products (
product_id INTEGER PRIMARY KEY,
product_type TEXT,
brand TEXT,
weight TEXT,
price NUMERIC,
average_units_sold INTEGER,
year_added INTEGER,
stock_location TEXT
);
ALTER TABLE public.products OWNER TO postgres;
```
## Data Ingestion
Load the raw CSV into the `products` table:
```sql
\copy products FROM 'path/to/foodyum_raw_for_pgadmin4.csv' CSV HEADER;
```
## Data Cleaning Process
### 8.1 Exploratory Checks
```sql
-- 1. product_type distribution
SELECT product_type, COUNT(*) AS cnt
FROM products
GROUP BY product_type;
-- 2. brand distribution (including '-' placeholder)
SELECT brand, COUNT(*) AS cnt
FROM products
GROUP BY brand;
-- 3. Inspect weight format
SELECT DISTINCT weight
FROM products
LIMIT 10;
-- 4. price summary
SELECT MIN(price), MAX(price), AVG(price)
FROM products;
-- 5. average_units_sold summary
SELECT MIN(average_units_sold), MAX(average_units_sold), AVG(average_units_sold)
FROM products;
-- 6. missing year_added
SELECT COUNT(*)
FROM products
WHERE year_added IS NULL;
-- 7. stock_location variations
SELECT stock_location, COUNT(*)
FROM products
GROUP BY stock_location;
```
### 8.2 Cleaning Query
```sql
WITH clean AS (
SELECT
product_id,
COALESCE(NULLIF(product_type, ''), 'Unknown') AS product_type,
COALESCE(NULLIF(brand, '-'), 'Unknown') AS brand,
NULLIF(SPLIT_PART(weight, ' ', 1), '')::NUMERIC AS weight,
price,
COALESCE(average_units_sold, 0) AS average_units_sold,
COALESCE(year_added, 2022) AS year_added,
UPPER(COALESCE(stock_location, 'Unknown')) AS stock_location
FROM products
),
med AS (
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY weight) AS median_weight,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price) AS median_price
FROM clean
WHERE weight IS NOT NULL AND price IS NOT NULL
)
SELECT
c.product_id,
c.product_type,
c.brand,
ROUND(COALESCE(c.weight, m.median_weight), 2) AS weight,
ROUND(COALESCE(c.price, m.median_price), 2) AS price,
c.average_units_sold,
c.year_added,
c.stock_location
FROM clean c
CROSS JOIN med m;
```
## Validation & Testing
- Verify no NULLs remain:
```sql
SELECT COUNT(*)
FROM (
SELECT * FROM cleaned_products
) t
WHERE product_type IS NULL
OR brand IS NULL
OR weight IS NULL
OR price IS NULL
OR stock_location IS NULL;
```
- Check data types and ranges (e.g., weight > 0, price > 0).
## Conclusion
The cleaning pipeline standardizes text fields, handles missing values with domain-specific defaults, and ensures numeric columns are cast correctly. The resulting `cleaned_products` table is ready for analytics and reporting.
## Appendix
- SQL Scripts:
- `sql/01_create_tables.sql`
- `sql/02_data_ingestion.sql`
- `sql/03_data_cleaning.sql`
- `sql/04_data_validation.sql`
- Cleaned data export: `data/clean/foodyum_clean.csv`