{"id":27455544,"url":"https://github.com/tharadol07969/data_cleaning_with_postgresql","last_synced_at":"2025-04-15T16:44:01.718Z","repository":{"id":287160075,"uuid":"963796655","full_name":"Tharadol07969/data_cleaning_with_postgresql","owner":"Tharadol07969","description":"This project focuses on cleaning the FoodYum Grocery Store Sales dataset using PostgreSQL.","archived":false,"fork":false,"pushed_at":"2025-04-10T08:29:20.000Z","size":97,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-10T09:40:34.201Z","etag":null,"topics":["data-cleaning","postgresql","query","sql"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Tharadol07969.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-04-10T08:14:35.000Z","updated_at":"2025-04-10T08:35:18.000Z","dependencies_parsed_at":"2025-04-10T09:40:45.036Z","dependency_job_id":"f45f5103-7ac5-4569-a459-38fe3b003f3f","html_url":"https://github.com/Tharadol07969/data_cleaning_with_postgresql","commit_stats":null,"previous_names":["tharadol07969/data_cleaning_with_postgresql"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tharadol07969%2Fdata_cleaning_with_postgresql","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tharadol07969%2Fdata_cleaning_with_postgresql/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tharadol07969%2Fdata_cleaning_with_postgresql/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Tharadol07969%2Fdata_cleaning_with_postgresql/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Tharadol07969","download_url":"https://codeload.github.com/Tharadol07969/data_cleaning_with_postgresql/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249110806,"owners_count":21214403,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-cleaning","postgresql","query","sql"],"created_at":"2025-04-15T16:44:01.042Z","updated_at":"2025-04-15T16:44:01.711Z","avatar_url":"https://github.com/Tharadol07969.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# FoodYum Grocery Store Sales: Data Cleaning Project with PostgreSQL\n\n## Table of Contents\n1. [Introduction](#introduction)\n2. [Objectives](#objectives)\n3. [Dataset Description](#dataset-description)\n4. [Tools \u0026 Environment](#tools--environment)\n5. [Project Structure](#project-structure)\n6. [Database Schema](#database-schema)\n7. [Data Ingestion](#data-ingestion)\n8. [Data Cleaning Process](#data-cleaning-process)\n   - [8.1 Exploratory Checks](#81-exploratory-checks)\n   - [8.2 Cleaning Query](#82-cleaning-query)\n9. [Validation \u0026 Testing](#validation--testing)\n10. [Conclusion](#conclusion)\n11. [Appendix](#appendix)\n\n---\n\n## Introduction\nThis project focuses on cleaning the FoodYum Grocery Store Sales dataset using PostgreSQL. The goal is to transform raw transaction records into a consistent, analysis-ready table by handling missing values, standardizing formats, and applying business rules.\n\n## Objectives\n- Identify and resolve missing or malformed values.\n- Standardize categorical fields and numeric formats.\n- Replace missing values with appropriate defaults (e.g., median for continuous fields).\n- Prepare a clean `products` table for downstream analytics.\n\n## Dataset Description\n**FoodYum Grocery Store Sales**\n- FoodYum is a U.S.-based grocery chain selling produce, meat, dairy, bakery, snacks, and household staples.\n- As food costs rise, the company needs consistent data to ensure broad product availability across price ranges.\n- The raw dataset captures product metadata and sales metrics for the last full year of the loyalty program.\n\n| Column               | Type      | Description                                                                                     |\n|----------------------|-----------|-------------------------------------------------------------------------------------------------|\n| product_id           | INTEGER   | Unique identifier for each product                                                              |\n| product_type         | TEXT      | Category (Produce, Meat, Dairy, Bakery, Snacks); missing → `Unknown`                            |\n| brand                | TEXT      | Brand name; missing or `-` → `Unknown`                                                          |\n| weight               | TEXT      | Weight in grams (e.g., `500 grams`); extract numeric value and round to 2 decimals; missing → median weight |\n| price                | NUMERIC   | Price in USD; missing → median price                                                           |\n| average_units_sold   | INTEGER   | Average monthly units sold; missing → `0`                                                       |\n| year_added           | INTEGER   | Year first added to stock; missing → `2022`                                                     |\n| stock_location       | TEXT      | Warehouse code (`A`,`B`,`C`,`D`); normalize to uppercase; missing → `Unknown`                    |\n\n## Tools \u0026 Environment\n- **Database:** PostgreSQL v17\n- **GUI:** pgAdmin 4\n- **Optional Scripting:** Python 3.x with `psycopg2`, `pandas`\n- **Environment Variables:** Store DB credentials in a `.env` file (not committed to VCS)\n\n## Project Structure\n```\nfoodyum-data-cleaning/\n├── data/\n│   ├── raw/                    # Original CSV files\n│   │   ├── foodyum_raw.csv\n│   │   └── foodyum_raw_for_pgadmin4.csv\n│   └── clean/                  # Exported cleaned CSV\n│       └── foodyum_clean.csv\n├── sql/\n│   ├── 01_create_tables.sql    # Table definitions\n│   ├── 02_data_ingestion.sql   # COPY commands\n│   ├── 03_data_cleaning.sql    # Cleaning queries\n│   └── 04_data_validation.sql  # Data quality checks\n├── .env                        # DB credentials\n├── README.md                   # Project documentation\n└── requirements.txt            # Python dependencies\n```\n\n## Database Schema\nTable: `products`\n\n```sql\nCREATE TABLE public.products (\n  product_id           INTEGER PRIMARY KEY,\n  product_type         TEXT,\n  brand                TEXT,\n  weight               TEXT,\n  price                NUMERIC,\n  average_units_sold   INTEGER,\n  year_added           INTEGER,\n  stock_location       TEXT\n);\nALTER TABLE public.products OWNER TO postgres;\n```\n\n## Data Ingestion\nLoad the raw CSV into the `products` table:\n```sql\n\\copy products FROM 'path/to/foodyum_raw_for_pgadmin4.csv' CSV HEADER;\n```\n\n## Data Cleaning Process\n\n### 8.1 Exploratory Checks\n```sql\n-- 1. product_type distribution\nSELECT product_type, COUNT(*) AS cnt\nFROM products\nGROUP BY product_type;\n\n-- 2. brand distribution (including '-' placeholder)\nSELECT brand, COUNT(*) AS cnt\nFROM products\nGROUP BY brand;\n\n-- 3. Inspect weight format\nSELECT DISTINCT weight\nFROM products\nLIMIT 10;\n\n-- 4. price summary\nSELECT MIN(price), MAX(price), AVG(price)\nFROM products;\n\n-- 5. average_units_sold summary\nSELECT MIN(average_units_sold), MAX(average_units_sold), AVG(average_units_sold)\nFROM products;\n\n-- 6. missing year_added\nSELECT COUNT(*)\nFROM products\nWHERE year_added IS NULL;\n\n-- 7. stock_location variations\nSELECT stock_location, COUNT(*)\nFROM products\nGROUP BY stock_location;\n```\n\n### 8.2 Cleaning Query\n```sql\nWITH clean AS (\n  SELECT\n    product_id,\n    COALESCE(NULLIF(product_type, ''), 'Unknown')       AS product_type,\n    COALESCE(NULLIF(brand, '-'), 'Unknown')            AS brand,\n    NULLIF(SPLIT_PART(weight, ' ', 1), '')::NUMERIC    AS weight,\n    price,\n    COALESCE(average_units_sold, 0)                    AS average_units_sold,\n    COALESCE(year_added, 2022)                         AS year_added,\n    UPPER(COALESCE(stock_location, 'Unknown'))         AS stock_location\n  FROM products\n),\nmed AS (\n  SELECT\n    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY weight) AS median_weight,\n    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY price)  AS median_price\n  FROM clean\n  WHERE weight IS NOT NULL AND price IS NOT NULL\n)\n\nSELECT\n  c.product_id,\n  c.product_type,\n  c.brand,\n  ROUND(COALESCE(c.weight, m.median_weight), 2) AS weight,\n  ROUND(COALESCE(c.price,  m.median_price),  2) AS price,\n  c.average_units_sold,\n  c.year_added,\n  c.stock_location\nFROM clean c\nCROSS JOIN med m;\n```\n\n## Validation \u0026 Testing\n- Verify no NULLs remain:\n  ```sql\n  SELECT COUNT(*)\n  FROM (\n    SELECT * FROM cleaned_products\n  ) t\n  WHERE product_type IS NULL\n    OR brand IS NULL\n    OR weight IS NULL\n    OR price IS NULL\n    OR stock_location IS NULL;\n  ```\n- Check data types and ranges (e.g., weight \u003e 0, price \u003e 0).\n\n## Conclusion\nThe cleaning pipeline standardizes text fields, handles missing values with domain-specific defaults, and ensures numeric columns are cast correctly. The resulting `cleaned_products` table is ready for analytics and reporting.\n\n## Appendix\n- SQL Scripts:\n  - `sql/01_create_tables.sql`\n  - `sql/02_data_ingestion.sql`\n  - `sql/03_data_cleaning.sql`\n  - `sql/04_data_validation.sql`\n- Cleaned data export: `data/clean/foodyum_clean.csv`\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftharadol07969%2Fdata_cleaning_with_postgresql","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftharadol07969%2Fdata_cleaning_with_postgresql","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftharadol07969%2Fdata_cleaning_with_postgresql/lists"}