{"id":15157890,"url":"https://github.com/sushant-suresh/data_cleaning_project_using_postgresql","last_synced_at":"2026-01-20T04:32:19.873Z","repository":{"id":256070269,"uuid":"854250397","full_name":"Sushant-Suresh/Data_Cleaning_Project_Using_PostgreSQL","owner":"Sushant-Suresh","description":"Data Cleaning Using PostgreSQL.","archived":false,"fork":false,"pushed_at":"2024-10-20T15:18:35.000Z","size":2563,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-07T14:47:39.044Z","etag":null,"topics":["data-cleaning","postgresql","sql"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Sushant-Suresh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-08T19:10:33.000Z","updated_at":"2024-10-20T15:18:38.000Z","dependencies_parsed_at":null,"dependency_job_id":"9b4e446a-3052-46f7-927e-b94222298c97","html_url":"https://github.com/Sushant-Suresh/Data_Cleaning_Project_Using_PostgreSQL","commit_stats":{"total_commits":8,"total_committers":1,"mean_commits":8.0,"dds":0.0,"last_synced_commit":"71c955bc083fec304ecb20e8030754387b44b29c"},"previous_names":["sushant-suresh/data_cleaning_project"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sushant-Suresh%2FData_Cleaning_Project_Using_PostgreSQL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sushant-Suresh%2FData_Cleaning_Project_Using_PostgreSQL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sushant-Suresh%2FData_Cleaning_Project_Using_PostgreSQL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sushant-Suresh%2FData_Cleaning_Project_Using_PostgreSQL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Sushant-Suresh","download_url":"https://codeload.github.com/Sushant-Suresh/Data_Cleaning_Project_Using_PostgreSQL/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247675632,"owners_count":20977376,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-cleaning","postgresql","sql"],"created_at":"2024-09-26T20:20:23.981Z","updated_at":"2026-01-20T04:32:19.844Z","avatar_url":"https://github.com/Sushant-Suresh.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Cleaning Project\n![aiimg](https://github.com/user-attachments/assets/142c227c-9629-4060-b6af-114fc05a26b6)\n\nIn this project, I was using PostgreSQL to clean [Nashville Housing dataset](https://github.com/AlexTheAnalyst/PortfolioProjects/blob/main/Nashville%20Housing%20Data%20for%20Data%20Cleaning.xlsx). The Dataset has more than 56,000 rows and 19 columns.\n\nThe following tasks were performed:\n- **Standardize \"Sale Price\" field into INT datatype**\n- **Fill up the  NULL values in \"address\" field using self-join**\n- **Parsing long-formatted address into individual columns (Address, City, State)**\n- **Standardize “Sold as Vacant” field (from Y/N to Yes and No)**\n- **Remove Duplicates**\n\n## Schema\n```sql\n-- Creating `nashville` Table\nDROP TABLE IF EXISTS nashville;\nCREATE TABLE nashville (uniqueid INT, parcelid VARCHAR(80), landuse VARCHAR(80),\n                        address VARCHAR(80), saledate DATE, saleprice VARCHAR(10),\n                        legalreference VARCHAR(80), soldasvacant VARCHAR(5),\t\n                        ownername VARCHAR(80), owneraddress VARCHAR(80), acreage FLOAT,\n                        taxdistrict VARCHAR(80), landvalue INT, buildingvalue INT, totalvalue INT,\n                        yearbuilt INT, bedrooms INT, fullbath INT, halfbath INT);\n```\n\n## Overview of data\n```sql\n-- Querying the Table to See the Table Structure \u0026 Data\nSELECT * FROM nashville;\n```\n**Output:**\n\n![1](https://github.com/user-attachments/assets/215afb50-1830-4fc0-88b0-aab74eba5496)\n\n## Standardize 'saleprice' column by converting it into INT datatype\n```sql\n-- While Importing the .csv file `saleprice` Column Couldn't be Imported as INT Datatype. Fixing This Issue:\n\t\t\t\t\t\n-- Identifying Rows with Non-Integer Values\nSELECT saleprice\nFROM nashville\nWHERE saleprice ~ '[^0-9]';\n```\n**Output:**\n\n![3](https://github.com/user-attachments/assets/e04ee72d-9d2f-4c34-9810-b53731c26325)\n\n```sql\n-- Creating a New Column `saleprice_int`\nALTER TABLE nashville\nADD COLUMN saleprice_int INT;\n\n-- Converting and Importing Data as INT Datatype\nUPDATE nashville\nSET saleprice_int = CAST(REGEXP_REPLACE(saleprice, '[^0-9]', '', 'g') AS INT);\n\n-- Checking the output\nSELECT saleprice, saleprice_int\nFROM nashville\nWHERE saleprice ~ '[^0-9]';\n```\n**Output:**\n\n![5](https://github.com/user-attachments/assets/708d54a7-7b76-4419-8195-9544d2d24786)\n\n```sql\n-- Deleting the `saleprice` column\nALTER TABLE nashville\nDROP COLUMN saleprice;\n```\n\n## Populate missing values in 'address' column\n```sql\n-- Entries With the Same `parcelid` Have the same `address`. I'll be Using this Information to Fill up NULL `address` Values:\n\n-- Identifying Rows With NULL Values in `address` Column\nSELECT *\nFROM nashville\nWHERE address IS NULL;\n```\n**Output:**\n\n![6](https://github.com/user-attachments/assets/abe7246b-b702-456c-9a9e-2af15de72557)\n\n```sql\n-- Using SELF-JOIN to Populate the NULL `address` Values With an `address` Having the Same `parcelid`\nSELECT a.parcelid, a.address,\n       b.parcelid, b.address,\n       COALESCE(a.address, b.address) AS address_notnull\nFROM nashville AS a\nJOIN nashville AS b\nON a.parcelid = b.parcelid\nAND a.uniqueid \u003c\u003e b.uniqueid\nWHERE a.address IS NULL;\n```\n**Output:**\n\n![7](https://github.com/user-attachments/assets/ead3d7db-300d-4f3a-9e71-bfab7ccce0a9)\n\n```sql\n-- Updating the `address` Column\nUPDATE nashville AS a\nSET address = COALESCE(b.address, a.address)\nFROM nashville AS b\nWHERE a.parcelid = b.parcelid\nAND a.uniqueid \u003c\u003e b.uniqueid\nAND a.address IS NULL;\n\t\t\t\t\t\n-- Checking the output\nSELECT COUNT(*) AS address_null_count\nFROM nashville\nWHERE address IS NULL;\n```\n**Output:**\n\n![8](https://github.com/user-attachments/assets/5a90c50d-b79a-44df-a620-fb9e1d0466c2)\n\n## Breaking 'address' column into individual columns (address_, city)\n```sql\n-- Splitting Data Stored in the `address` Column Using the ',' Delimiter:\n\n-- Adding the New Columns\nALTER TABLE nashville\nADD COLUMN address_ VARCHAR(80),\nADD COLUMN city VARCHAR(80);\n\n-- Updating the New Columns\nUPDATE nashville\nSET address_ = LEFT(address, POSITION(',' IN address) - 1),\n    city = TRIM(SUBSTRING(address FROM POSITION(',' IN address) + 1));\n\n-- Checking the Results\nSELECT address, address_, city\nFROM nashville;\n```\n**Output:**\n\n![9](https://github.com/user-attachments/assets/f2ccb2b9-c5fe-4aa6-9f07-dd1cc60800e2)\n\n```sql\n-- Deleting the `address` column\nALTER TABLE nashville\nDROP COLUMN address;\n```\n## Breaking 'owneraddress' column into individual columns (owner_address, owner_city, owner_state)\n```sql\n-- Splitting Data Stored in the `owneraddress` Column Using the ',' Delimiter. This column has NULL values also:\n\n-- Identifying NULL Values in `owneraddress` column\nSELECT * FROM nashville\nWHERE owneraddress IS NULL;\n```\n**Output:**\n\n![10](https://github.com/user-attachments/assets/41d88997-b822-44dd-b10e-dd6f8a605233)\n\n```sql\n-- Adding the New Columns\nALTER TABLE nashville\nADD COLUMN owner_address VARCHAR(80),\nADD COLUMN owner_city VARCHAR(80),\nADD COLUMN owner_state VARCHAR(80);\n\n-- Updating the New Columns\nUPDATE nashville\nSET owner_address = CASE\n                        WHEN owneraddress IS NOT NULL THEN TRIM(SPLIT_PART(owneraddress, ',', 1))\n                        ELSE NULL\n                     END,\n        owner_city = CASE\n                         WHEN owneraddress IS NOT NULL THEN TRIM(SPLIT_PART(owneraddress, ',', 2))\n                         ELSE NULL\n                     END,\n       owner_state = CASE\n                         WHEN owneraddress IS NOT NULL THEN TRIM(SPLIT_PART(owneraddress, ' ', -1))\n                         ELSE NULL\n                     END;\n\n-- Checking the Results\nSELECT owneraddress, owner_address, owner_city, owner_state\nFROM nashville;\n```\n**Output:**\n\n![11](https://github.com/user-attachments/assets/015fb136-b91e-41ed-99aa-5f6bb73e51f3)\n\n```sql\n-- Deleting the `owneraddress` column\nALTER TABLE nashville\n\nDROP COLUMN owneraddress;\n```\n## Standardize 'soldasvacant' column\n```sql\n-- In `soldasvacant` Column, There are 4 Values — Y, N, Yes, No — Instead of 2 - Yes and No. Fixing This Issue:\n\n-- Checking the Data Stored in `soldasvacant` Column\nSELECT soldasvacant, COUNT(*)\nFROM nashville\nGROUP BY soldasvacant;\n```\n**Output:**\n\n![12](https://github.com/user-attachments/assets/b0de4a16-6fd8-479f-9a36-895c64062745)\n\n```sql\n-- Formatting Values - Y and N Into Yes and No Respectively. \nUPDATE nashville\nSET SoldAsVacant =  CASE\n                        WHEN SoldAsVacant = 'Y' THEN 'Yes'\n                        WHEN SoldAsVacant = 'N' THEN 'No'\n                        ELSE SoldAsVacant\n                        END;\n\n-- Checking the Results\nSELECT soldasvacant, COUNT(*)\nFROM nashville\nGROUP BY soldasvacant;\n```\n**Output:**\n\n![13](https://github.com/user-attachments/assets/7a983865-c8ac-4d1a-b01e-b50db859f2d6)\n\n## Removing Duplicate Rows\n```sql\n/*\nIf `parcelid`, `address`, `saleprice_int`, `saledate`, \u0026 `legalreference` are the Same for Multiple Rows,\nThen it is a Duplicate. I will be Uing This Information to Remove Duplicates:\n*/\n\n-- Identifying Duplicate Records\nWITH ranked_rows AS (SELECT uniqueid, parcelid, address_, saleprice_int, saledate, legalreference,\n                     ROW_NUMBER() OVER (PARTITION BY parcelid, address_, saleprice_int, saledate, legalreference\n                     ORDER BY uniqueid) AS row_num\n                     FROM nashville)\nSELECT uniqueid, parcelid, address_, saleprice_int, saledate, legalreference\nFROM ranked_rows\nWHERE row_num \u003e 1;\n```\n**Output:**\n\n![14](https://github.com/user-attachments/assets/ad8af2ed-fa59-4699-9b26-ad1982e7ecc1)\n\n```sql\n-- Deleting Duplicate Records\nWITH ranked_rows AS (SELECT uniqueid, parcelid, address_, saleprice_int, saledate, legalreference,\n                     ROW_NUMBER() OVER (PARTITION BY parcelid, address_, saleprice_int, saledate, legalreference\n                     ORDER BY uniqueid) AS row_num\n                     FROM nashville)\nDELETE FROM nashville\nUSING ranked_rows\nWHERE nashville.uniqueid = ranked_rows.uniqueid\nAND ranked_rows.row_num \u003e 1;\n```\n\n## Results\n- **A standard 'saleprice' column having values of only INT datatype**\n- **An 'address' column with no NULL values**\n- **Long-formatted address parsed into individual columns for both property address and owner address**\n- **A standard 'soldasvacant' column (from Y/N to Yes and No)**\n- **No duplicate entries**\n\n### Having clean data will ultimately increase overall productivity and allow for the highest quality information in our decision-making.\n\nAll the SQL code used in this project can be found in the files above together with the raw dataset.\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsushant-suresh%2Fdata_cleaning_project_using_postgresql","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsushant-suresh%2Fdata_cleaning_project_using_postgresql","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsushant-suresh%2Fdata_cleaning_project_using_postgresql/lists"}