https://github.com/sayande01/cars_msrp_prediction_pca

This project conducts an advanced exploratory data analysis (EDA) on a car dataset with the objective of identifying natural groupings, selecting optimal features for predicting the Manufacturer Suggested Retail Price (MSRP), and leveraging Principal Component Analysis (PCA) for dimensionality reduction.
https://github.com/sayande01/cars_msrp_prediction_pca

dimensionality-reduction feature-selection linear-regression pca

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/sayande01/cars_msrp_prediction_pca
Owner: sayande01
Created: 2024-11-14T13:53:54.000Z (7 months ago)
Default Branch: main
Last Pushed: 2024-11-14T13:59:06.000Z (7 months ago)
Last Synced: 2025-02-13T02:38:37.844Z (4 months ago)
Topics: dimensionality-reduction, feature-selection, linear-regression, pca
Language: Jupyter Notebook
Homepage:
Size: 702 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Here’s a refined, professional description that provides a detailed breakdown of each section for a GitHub project:

---

### Title:
**Advanced EDA and Predictive Modeling on Car Dataset: Clustering, MSRP Prediction, and PCA Analysis**

### Description:
This project conducts an advanced exploratory data analysis (EDA) on a car dataset with the objective of identifying natural groupings, selecting optimal features for predicting the Manufacturer Suggested Retail Price (MSRP), and leveraging Principal Component Analysis (PCA) for dimensionality reduction. The workflow is designed to explore and validate features that impact MSRP, optimize model accuracy through clustering and feature engineering, and evaluate the benefits of reducing dimensions via PCA on model performance.

### Project Workflow
This project follows a structured analytical approach consisting of the following key steps:

1. **Data Preparation**:
- **Data Cleaning**: Address any missing or inconsistent values, ensure data types are correct, and make any necessary adjustments for analysis.
- **Data Segmentation**: Separate numerical columns relevant to clustering and modeling into a new DataFrame (e.g., engine horsepower, cylinder count, city miles per gallon).

2. **KMeans Clustering**:
- **Data Standardization**: Standardize numerical features to ensure all variables are on a comparable scale, as required by KMeans.
- **Optimal Cluster Selection**: Generate an Elbow Plot to determine the ideal number of clusters (K) by identifying the point where the sum of squared distances (inertia) plateaus.
- **Cluster Assignment**: Apply the KMeans algorithm using the optimal K value and add the resulting cluster IDs to the original dataset for future analysis.
- **Cluster Analysis**: Visually inspect each cluster’s characteristics using boxplots and analyze feature distributions across clusters to understand distinct groupings of car specifications.

3. **Feature Selection and MSRP Prediction**:
- **Correlation Analysis**: Use a heatmap to identify strong correlations among numerical features and detect multicollinearity, a common issue that may impact prediction accuracy.
- **Visual Pairwise Relationships**: Create pair plots to examine individual feature distributions and relationships, focusing on features relevant to MSRP prediction.
- **Multicollinearity Handling**: Based on correlation findings, determine whether to retain, drop, or transform correlated features. Document the reasoning for each decision.
- **Baseline Model Creation**: Build a baseline regression model using all available features to predict MSRP and evaluate the model’s performance metrics (e.g., R², RMSE) on both training and testing data.
- **Refined Feature Selection Model (Model 1)**: Based on feature selection results, construct a second model using only the optimal features. Compare Model 1’s performance against the baseline to assess improvements in prediction accuracy.

4. **Principal Component Analysis (PCA)**:
- **PCA Feasibility Analysis**: Assess the suitability of PCA based on observed multicollinearity and variance structure in the feature set.
- **PCA Execution**: Perform PCA on the standardized dataset and generate key metrics, including eigenvalues, eigenvectors, and a cumulative explained variance plot.
- **Dimensionality Reduction**: Select a reduced number of principal components based on cumulative variance (e.g., capturing 90% of the total variance), and transform the dataset accordingly.
- **Model 2 (PCA-Reduced Model)**: Build a predictive model on the PCA-transformed features and evaluate its performance in comparison to both the baseline and Model 1.
- **Performance Comparison**: Analyze and compare the metrics of all models to understand the effects of dimensionality reduction on MSRP prediction accuracy.

5. **Insights and Documentation**:
- **Detailed Findings**: Summarize findings for each step, highlighting insights from clustering, the impact of feature selection on MSRP prediction, and trade-offs introduced by PCA.
- **Project Documentation**: Include all code, visualizations, and analysis in a well-documented Jupyter notebook, with a clear narrative guiding readers through the analytical process and key conclusions.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sayande01/cars_msrp_prediction_pca

Awesome Lists containing this project

README