https://github.com/sadegh15khedry/housing-prices-prediction-using-randomforest

This repository contains an implementation of random forest model to predict housing prices using the Boston Housing dataset.
https://github.com/sadegh15khedry/housing-prices-prediction-using-randomforest

csv joblib jupyter-notebook matplotlib numpy pandas pil python random-forest seaborn sklearn

Last synced: 7 months ago
JSON representation

This repository contains an implementation of random forest model to predict housing prices using the Boston Housing dataset.

Host: GitHub
URL: https://github.com/sadegh15khedry/housing-prices-prediction-using-randomforest
Owner: sadegh15khedry
License: mit
Created: 2024-06-20T11:02:14.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-08-12T06:28:01.000Z (about 1 year ago)
Last Synced: 2025-01-10T10:58:15.813Z (9 months ago)
Topics: csv, joblib, jupyter-notebook, matplotlib, numpy, pandas, pil, python, random-forest, seaborn, sklearn
Language: Jupyter Notebook
Homepage:
Size: 5.59 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Random Forest Regressor for Boston Housing Dataset

## Introduction

This project utilizes a Random Forest Regressor model to predict housing prices using the Boston Housing dataset. It includes various components such as data preprocessing, model training, evaluation, and utility functions for saving and loading models and data.

## Table of Contents

- [Random Forest Regressor](#random-forest-regressor)

- [Installation](#installation)

- [Usage](#usage)

- [Dataset](#dataset)

- [Folder Structure](#folder-structure)

- [Data Exploration](#data-exploration)

- [Data Preprocessing](#data-preprocessing)

- [Model Training](#model-training)

- [Model Evaluation](#model-evaluation)

- [Utils](#utils)

- [Results](#results)

- [Contributing](#contributing)

- [License](#license)

## Random Forest Regressor

The Random Forest Regressor is an ensemble learning method that operates by constructing multiple decision trees during training and outputting the mean prediction of the individual trees for regression tasks. Here’s a brief overview of its functioning:

1. **Ensemble Method**: It combines multiple decision trees to improve generalizability and robustness over a single decision tree model.

2. **Tree Construction**: Each tree is built using a subset of the training data and a random selection of features. This randomness helps to reduce overfitting.

3. **Prediction**: For regression tasks, predictions are made by averaging the predictions of all the individual trees in the forest.

4. **Hyperparameters**: Important hyperparameters include the number of trees (n_estimators), maximum depth of each tree (max_depth), and the number of features considered for splitting at each node (max_features).

Random Forests are widely used due to their ability to handle large datasets with high dimensionality and noisy data, while also providing good accuracy and robustness.

For more details on the implementation and parameters, refer to the `model_training.ipynb` notebook and the scikit-learn documentation on [RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html).

## Installation

Ensure Python 3.x is installed along with the dependencies listed in `requirements.txt`. Install them using pip:

```bash

pip install -r requirements.txt

```

## Usage

Clone the repository and navigate to the project directory:

```bash

git clone https://github.com/sadegh15khedry/Housing-Prices-Prediction-Using-RandomForest.git

cd random-forest-boston-housing

```

Run the Jupyter notebooks for different aspects of the project:

- data_exploration.ipynb: Explore the dataset and visualize correlations.

- data_preprocessing.ipynb: Preprocess the dataset by removing duplicates and null values, and split into training and test sets.

- model_training.ipynb: Train the Random Forest Regressor model.

- model_evaluation.ipynb: Evaluate the trained model and calculate Mean Squared Error (MSE).

## Dataset

The Boston Housing dataset contains various factors that might influence housing prices in Boston suburbs. Features include crime rate, property tax rate, and accessibility to highways. The target variable is the median value of owner-occupied homes (MEDV).

## Folder Structure

The project follows a standard folder structure convention:

- **datasets/**: Contains dataset files.

- **models/**: Stores trained machine learning models.

- **notebooks/**: Jupyter notebooks for data exploration, preprocessing, model training, and evaluation.

- **src/**: Source code directory containing Python scripts for data processing, model training, evaluation, and utility functions.

## Data Exploration

Explore the dataset to understand its structure and statistical summaries:

```code

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

Load dataset

df = pd.read_csv('../datasets/housing_prices_boston.csv')

Display information about columns

df.info()

Describe statistical summary

df.describe()

Correlation matrix

correlation_matrix = df.corr()

plt.figure(figsize=(10, 8))

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)

plt.title('Correlation Matrix Heatmap')

plt.show()

```

## Data Preprocessing

Prepare the data by preprocessing steps such as removing duplicates and null values, and splitting into training and test sets:

```code

from data_prepocessing import load_data, split_data, preprocess_data

from utils import save_dataframe_as_csv

#Load dataset

df = load_data("housing_prices_boston.csv")

#Preprocess data

df = preprocess_data(df)

#Select feature columns

feature_columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']

#Select label column

label_column = 'MEDV'

#Shuffle data

df = df.sample(frac=1).reset_index(drop=True)

#Split data into training and test sets

x_train, x_test, y_train, y_test = split_data(df, feature_columns, label_column)

#Save dataframes as CSV

save_dataframe_as_csv(x_train, "../datasets/x_train.csv")

save_dataframe_as_csv(y_train, "../datasets/y_train.csv")

save_dataframe_as_csv(x_test, "../datasets/x_test.csv")

save_dataframe_as_csv(y_test, "../datasets/y_test.csv")

```

## Model Training

Train the Random Forest Regressor model:

```code

from sklearn.ensemble import RandomForestRegressor

from model_training import train_model

#Train model

model = train_model(x_train, y_train, estimators=100)

#Save trained model

from utils import save_model

save_model(model, '../models/random_forest.joblib')

```

## Model Evaluation

Evaluate the trained model using Mean Squared Error (MSE):

```code

from sklearn.metrics import mean_squared_error

from model_evaluation import evaluate_model

from data_prepocessing import load_data

#Load trained model

model = load_model('../models/random_forest.joblib')

#Load test data

x_test = load_data("x_test.csv")

y_test = load_data("y_test.csv")

#Evaluate model

mse_test = evaluate_model(model, x_test, y_test)

print(f"Test MSE: {mse_test}")

```

## Utils

Utility functions for saving and loading dataframes, models, confusion matrices, and reports:

```code

from utils import save_confusion_matrix, save_report, save_dataframe_as_csv, save_model, load_model

#Example: Save confusion matrix

save_confusion_matrix(cm, "confusion_matrix.png")

#Example: Save classification report

save_report(report, "classification_report.txt")

#Example: Save dataframe as CSV

save_dataframe_as_csv(df, "data.csv")

#Example: Save trained model

save_model(model, "model.joblib")

#Example: Load trained model

loaded_model = load_model("model.joblib")

```

## Results

### Train

- Mean Squared Error: 2.24

- Mean Absolute Error (MAE): 0.89

- Root Mean Squared Error (RMSE): 1.50

- R-squared: 0.97

- Adjusted R-squared: 0.97

### Validation

- Mean Squared Error: 15.66

- Mean Absolute Error (MAE): 2.61

- Root Mean Squared Error (RMSE): 3.96

- R-squared: 0.74

- Adjusted R-squared: 0.65

### Test

- Mean Squared Error: 12.65

- Mean Absolute Error (MAE): 2.48

- Root Mean Squared Error (RMSE): 3.56

- R-squared: 0.84

- Adjusted R-squared: 0.80

## Contributing

Contributions are welcome! For major changes, please open an issue first to discuss what you would like to change.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sadegh15khedry/housing-prices-prediction-using-randomforest

Awesome Lists containing this project

README