{"id":26282073,"url":"https://github.com/busesimsek/sql-data-cleaning-project","last_synced_at":"2025-03-14T16:39:15.878Z","repository":{"id":275753275,"uuid":"924716308","full_name":"busesimsek/SQL-Data-Cleaning-Project","owner":"busesimsek","description":"A data cleaning project focusing on layoff trends, using MySQL to handle missing values, remove duplicates, standardize data, and ensure consistent formatting for accurate analysis.","archived":false,"fork":false,"pushed_at":"2025-03-02T08:50:48.000Z","size":388,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-02T09:29:35.152Z","etag":null,"topics":["data-cleaning","mysql","sql"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/busesimsek.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-30T14:22:45.000Z","updated_at":"2025-03-02T08:50:51.000Z","dependencies_parsed_at":"2025-02-04T12:41:25.071Z","dependency_job_id":null,"html_url":"https://github.com/busesimsek/SQL-Data-Cleaning-Project","commit_stats":null,"previous_names":["busesimsek/sql-data-cleaning-project"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/busesimsek%2FSQL-Data-Cleaning-Project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/busesimsek%2FSQL-Data-Cleaning-Project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/busesimsek%2FSQL-Data-Cleaning-Project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/busesimsek%2FSQL-Data-Cleaning-Project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/busesimsek","download_url":"https://codeload.github.com/busesimsek/SQL-Data-Cleaning-Project/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243613257,"owners_count":20319489,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-cleaning","mysql","sql"],"created_at":"2025-03-14T16:39:15.320Z","updated_at":"2025-03-14T16:39:15.847Z","avatar_url":"https://github.com/busesimsek.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"![image](https://github.com/user-attachments/assets/e116911f-c252-4851-a0e2-6b45400a3387)\n\n# Data Cleaning for Layoffs\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [Dataset](#dataset)\n3. [Tools](#tools)\n4. [Objectives](#objectives)\n5. [Data Cleaning Process](#data-cleaning-process)\n   - [Importing the Raw Data](#importing-the-raw-data)\n   - [Removing Duplicates](#removing-duplicates)\n   - [Standardizing Data](#standardizing-data)\n   - [Handling Null Values](#handling-null-values)\n   - [Final Cleaned Dataset](#final-cleaned-dataset)\n6. [How to Use](#how-to-use)\n   - [Prerequisites](#prerequisites)\n   - [Steps to Run](#steps-to-run)\n7. [Contributions](#contributions)\n8. [Future Improvements](#future-improvements)\n9. [Contact](#contact)\n\n---\n\n## Overview\n\nThis project focuses on cleaning and preparing a dataset of layoffs for further analysis. The dataset, originally from Kaggle, was imported into MySQL, cleaned, and standardized to ensure consistency and accuracy for potential analysis.\n\n### Dataset\n\n- **Original Dataset:** [Layoffs Dataset on Kaggle](https://www.kaggle.com/datasets/swaptr/layoffs-2022)\n- **Raw Data:** [layoffs.json](https://github.com/busesimsek/SQL-Data-Cleaning-Project/blob/main/Dataset/layoffs.json) (imported into MySQL)\n- **Cleaned Data:** [final_cleaned_data.csv](https://github.com/busesimsek/SQL-Data-Cleaning-Project/blob/main/final_cleaned_data.csv) (final output after cleaning)\n\n---\n\n## Tools\n\n- **SQL**: Used for data processing, transformation, and scripting tasks within MySQL.\n- **Database Management System**: MySQL to host, manage, and manipulate the dataset.\n- **Data Cleaning Tools**: SQL for handling missing values, removing duplicates, and standardizing data formats.\n\n---\n\n## Objectives\n\n1. Handle missing or null values.\n2. Remove duplicate records.\n3. Ensure consistent formatting for dates, strings, and numerical values.\n4. Standardize column values for better analysis.\n\n---\n\n## Data Cleaning Process\n\nThe data cleaning was done in multiple phases, as outlined below:\n\n### Importing the Raw Data\n- The raw data was first imported into a MySQL table `layoffs` from [layoffs.json](https://github.com/busesimsek/SQL-Data-Cleaning-Project/blob/main/Dataset/layoffs.json). The original dataset contained 3887 rows.\n\n### Removing Duplicates\n- A staging table `layoffs_staging` was created to preserve the original data. Using the `ROW_NUMBER()` function, duplicates were identified and removed based on specific columns (company, location, total laid off, etc.).\n- Two duplicates were identified for 'Beyond Meat' and 'Cazoo' and were successfully removed.\n- After removing duplicates, the dataset contained 3885 rows.\n\n### Standardizing Data\n- **Whitespace Cleanup:** Removed extra spaces from text columns (e.g., `company`).\n- **Misspelled Locations:** Corrected common misspellings in the `location` column (e.g., \"Ferdericton\" → \"Fredericton\").\n- **Country Standardization:** Unified variations of country names (e.g., \"UAE\" → \"United Arab Emirates\").\n- **Date Formatting:** Converted the `date` column from text to the proper `DATE` type.\n- **Numeric Conversions:** Converted the `total_laid_off` and `funds_raised` columns from text to integers, rounding where necessary.\n\n### Handling Null Values\n- Null values in key columns (`industry`, `total_laid_off`, `percentage_laid_off`) were handled by replacing empty strings with NULL values.\n- Empty rows, where both `total_laid_off` and `percentage_laid_off` were NULL, were removed. This reduced the dataset size to 3248 rows.\n\n### Final Cleaned Dataset\n- The final cleaned dataset, now with standardized data, was exported to [final_cleaned_data.csv](https://github.com/busesimsek/SQL-Data-Cleaning-Project/blob/main/final_cleaned_data.csv) for further use in analysis.\n\n---\n\n## How to Use\n\n### Prerequisites\n1. MySQL or any compatible database system.\n2. The dataset files ([layoffs.json](https://github.com/busesimsek/SQL-Data-Cleaning-Project/blob/main/Dataset/layoffs.json), [layoffs.csv](https://github.com/busesimsek/SQL-Data-Cleaning-Project/blob/main/Dataset/layoffs.csv), [Data Cleaning for Layoffs.sql](https://github.com/busesimsek/SQL-Data-Cleaning-Project/blob/main/Data%20Cleaning%20for%20Layoffs.sql), and [final_cleaned_data.csv](https://github.com/busesimsek/SQL-Data-Cleaning-Project/blob/main/final_cleaned_data.csv)).\n\n### Steps to Run\n1. **Set up the MySQL database:**\n   - Import `layoffs.json` into the MySQL database using the provided SQL script.\n   - Create necessary tables (`layoffs`, `layoffs_staging`, `layoffs_staging2`).\n\n2. **Execute the Cleaning SQL Script:**\n   - Run the `Data Cleaning for Layoffs.sql` script to perform the data cleaning steps. This script includes commands for:\n     - Removing duplicates.\n     - Standardizing data formats.\n     - Handling missing values.\n     - Dropping unnecessary columns.\n\n3. **Output:**\n   - After executing the SQL script, the cleaned dataset will be saved as `final_cleaned_data.csv`.\n\n---\n\n## Contributions\n\nFeel free to fork the repository and contribute by suggesting improvements or submitting pull requests. This project is part of ongoing efforts to clean and analyze datasets for meaningful insights.\n\n---\n\n## Future Improvements\n\n- Incorporating additional analysis to identify trends in layoffs across different industries or locations.\n- Enhancing the data validation process to automatically detect and handle other potential anomalies in future datasets.\n\n---\n\n## Contact\n\nFor questions or feedback, feel free to reach out to me on GitHub or via email.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbusesimsek%2Fsql-data-cleaning-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbusesimsek%2Fsql-data-cleaning-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbusesimsek%2Fsql-data-cleaning-project/lists"}