Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dmarks84/ind_project_california-housing-data--kaggle

Independent Project - Kaggle Dataset-- I worked on the California Housing dataset, performing data cleaning and preparation; exploratory data analysis; feature engineering; regression model buildings; model evaluation.
https://github.com/dmarks84/ind_project_california-housing-data--kaggle

cross-validation data-modeling data-reporting data-visualization eda folium grid-search matplotlib model-evaluation numpy pandas pca python seaborn sklearn statistics supervised-ml unsupervised-ml

Last synced: 11 days ago
JSON representation

Host: GitHub
URL: https://github.com/dmarks84/ind_project_california-housing-data--kaggle
Owner: dmarks84
License: bsd-3-clause
Created: 2024-01-30T23:53:22.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-10-18T21:06:58.000Z (3 months ago)
Last Synced: 2024-11-05T19:22:17.259Z (about 2 months ago)
Topics: cross-validation, data-modeling, data-reporting, data-visualization, eda, folium, grid-search, matplotlib, model-evaluation, numpy, pandas, pca, python, seaborn, sklearn, statistics, supervised-ml, unsupervised-ml
Language: Jupyter Notebook
Homepage: https://www.kaggle.com/code/danpmarks/california-housing-prices-1990
Size: 2.02 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Ind_Project_California-Housing-Data--Kaggle

## Screenshot
![screenshot](https://github.com/dmarks84/Ind_Project_California-Housing-Data--Kaggle/blob/main/housing_screenshot.png?raw=true)

## Summary
I worked on the California Housing Data dataset. The dataset provided 9 features that I used to try to predict the median housing data. I performed expanded exploratory data analysis in order to consider the best ways to encode categorical data and scale the numeric features. I also used PCA to add features, as well as K-means clustering to identify potential meaningful cluster identifiers. THis informaiton was fed into several regression models/estimators, all of which were then run through gridsearch and cross validation to determine the best model on split training and validation data.

## Results
### Model Selected & Scoring
Of the models considered, the best was was a **Gradient Boosted Decision Tree with max depth set to 8** (with default parameters otherwise [e.g., learning rate = 0.1]). The score on validation data was **~0.84**, which is a respectable score but there is likely room for improvement. It scored well on the training data (0.95), so it is unlikely that the model is overfitting.

## Skills (Developed & Applied)
Programming, Python, Statistics, Numpy, Pandas, Matplotlib, Scikit-learn, Dataframes, Data Modeling, EDA, Data Visualization, Data Reporting, Classification, Supervised ML, Cross Validation, Grid search, Unsupervised ML, PCA, Seaborn, Folium