An open API service indexing awesome lists of open source software.

https://github.com/kokila-m/data-quality-check-script-using-python


https://github.com/kokila-m/data-quality-check-script-using-python

matplotlib numpy pandas python seaborn

Last synced: 11 months ago
JSON representation

Awesome Lists containing this project

README

          

## Sports Car Prices Dataset

This dataset, available on Kaggle, provides information about the prices and details of various sports cars. It includes the car’s make (brand), model, year, engine size, horsepower, torque, 0-60 MPH acceleration time, and price in USD. The dataset helps analyze the relationship between car features like engine size and horsepower with their price. It covers popular car brands like Porsche, Lamborghini, Ferrari, and McLaren. The engine sizes range from 2.0L to 8.0L, and prices range from $25,000 to $3,000,000. The data is for educational use only and was generated by large language models, not collected from real sources.

[Dataset on Kaggle](https://www.kaggle.com/datasets/rkiattisak/sports-car-prices-dataset)

## Dataset Analysis Script

This Python script performs several key data quality checks on the **Sports Car Prices Dataset**. It includes the following checks:

1. **Missing Values**: The script identifies rows with missing values and outputs a count of missing values per column to ensure data completeness.

2. **Duplicates**: It detects any duplicate rows in the dataset to prevent redundancy and ensure the integrity of the analysis.

3. **Outliers**: The script detects outliers in numerical columns using the **Interquartile Range (IQR)** method. This helps in identifying extreme values that could skew the analysis.

### Output Summary:
The script generates a summary report that details:
- The number of missing values in each column.
- The number of duplicate rows found.
- The results of outlier detection using the IQR method.

### Libraries Used:
- **NumPy**: Used for numerical operations, especially when working with arrays and performing mathematical operations.
- **Pandas**: Essential for data manipulation, including handling missing values, duplicates, and performing data analysis.
- **Matplotlib**: Used for plotting and visualizing data, especially helpful for visualizing outliers and distributions.
- **Seaborn**: Built on top of Matplotlib, it simplifies the process of generating statistical graphics, and is used for more advanced visualizations like box plots to detect outliers.

### Requirements:
- Python
- Pandas
- NumPy
- Matplotlib
- Seaborn

The script is flexible and can be adapted to work with various datasets by simply loading them into the pandas DataFrame.