https://github.com/asenacak/usedcarsml
https://github.com/asenacak/usedcarsml
data-preprocessing decision-tree-regressor feature-selection gridsearchcv hyperband hyperparameter-tuning imputation-methods keras keras-tuner linear-regression machine-learning neural-networks normalization one-hot-encoding price-prediction random-forest-regression selectkbest tensorflow used-cars-price-prediction xgboost-regression
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/asenacak/usedcarsml
- Owner: asenacak
- License: mit
- Created: 2024-04-24T15:09:29.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-23T18:09:16.000Z (6 months ago)
- Last Synced: 2025-01-18T07:44:39.559Z (4 months ago)
- Topics: data-preprocessing, decision-tree-regressor, feature-selection, gridsearchcv, hyperband, hyperparameter-tuning, imputation-methods, keras, keras-tuner, linear-regression, machine-learning, neural-networks, normalization, one-hot-encoding, price-prediction, random-forest-regression, selectkbest, tensorflow, used-cars-price-prediction, xgboost-regression
- Language: Jupyter Notebook
- Homepage:
- Size: 229 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Used Cars Price Prediction







This project involves predicting the prices of used cars using various machine learning models. The dataset used for this analysis comes from the [US Used Cars dataset (3 million)](https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset), containing detailed information about used cars in the United States.
## Project Overview
The goal of this project is to build and compare multiple regression models to predict the prices of used cars. The models included in this analysis are:
- **Linear Regression**
- **Decision Trees**
- **Random Forest**
- **eXtreme Gradient Boosting (XGBoost)**
- **Deep Neural Networks (DNN)**The performance of each model is evaluated using **Root Mean Squared Error (RMSE)** as the key metric. Additionally, techniques such as **GridSearchCV** and **Hyperband (Keras Tuner)** are used to fine-tune model parameters and achieve optimal performance.
## Data Source
The dataset used in this project is publicly available on Kaggle:
- **[US Used Cars dataset (3 million)](https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset)**
## Data Cleaning and Preprocessing
1. **Handling Placeholder Characters**: The dataset contains placeholder characters represented as `--`, which are cleaned before further processing.
2. **Missing Values**: A comprehensive approach is used to deal with missing data:
- Removal of missing entries in certain columns.
- Mean imputation, mode imputation, and multiple imputation techniques are applied where appropriate.3. **Categorical Encoding**: One-hot encoding is applied to convert categorical variables into a numerical format that can be used by machine learning models.
4. **Normalization**: Numerical columns are standardized to ensure consistent scaling across features.
5. **Feature Selection**: The **SelectKBest** method is used to identify the most informative features for predictive modeling, enhancing model performance.
## Model Evaluation
After preprocessing, five regression models are trained and evaluated using **RMSE** as the key metric. The results are as follows:
- **Random Forest**: Best performance with an RMSE of **0.00265**.
- **XGBoost**: Second-best performance with an RMSE of **0.00278**.
- **Decision Tree**: RMSE of **0.00282**, closely following XGBoost.
- **Deep Neural Network (DNN)**: RMSE of **0.00283**, slightly higher than Decision Tree.
- **Linear Regression**: RMSE of **0.00393**, significantly worse than the other models.## Model Tuning
- **GridSearchCV**: Used for hyperparameter tuning of the XGBoost, and Decision Tree models.
- **Keras Tuner (Hyperband)**: Applied for optimizing the Deep Neural Network model's parameters.## Conclusion
Among the five models tested, **Random Forest** outperformed the others, achieving the lowest RMSE, indicating its superior ability to predict used car prices. **XGBoost**, **Decision Tree**, and **DNN** performed similarly, while **Linear Regression** lagged behind in terms of predictive accuracy.
## How to Run the Project
1. Clone this repository.
```bash
git clone https://github.com/asenacak/UsedCarsML.git
```
2. Download the dataset from the following link: [US Used Cars dataset (3 million)](https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset).
3. Install the necessary dependencies.
4. Run the Jupyter notebooks:* Used_Cars_Data_Cleaning.ipynb: Data preprocessing and cleaning.
* models_usedcars.ipynb: Model building, evaluation, and hyperparameter tuning.## Dependencies
* Python 3
* Jupyter Notebook
* pandas
* scikit-learn
* XGBoost
* TensorFlow/Keras
* Keras Tuner
* matplotlib## License
This project is licensed under the MIT License.