Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dmarks84/ind_project_california-housing-data--kaggle
Independent Project - Kaggle Dataset-- I worked on the California Housing dataset, performing data cleaning and preparation; exploratory data analysis; feature engineering; regression model buildings; model evaluation.
https://github.com/dmarks84/ind_project_california-housing-data--kaggle
cross-validation data-modeling data-reporting data-visualization eda folium grid-search matplotlib model-evaluation numpy pandas pca python seaborn sklearn statistics supervised-ml unsupervised-ml
Last synced: about 6 hours ago
JSON representation
Independent Project - Kaggle Dataset-- I worked on the California Housing dataset, performing data cleaning and preparation; exploratory data analysis; feature engineering; regression model buildings; model evaluation.
- Host: GitHub
- URL: https://github.com/dmarks84/ind_project_california-housing-data--kaggle
- Owner: dmarks84
- License: bsd-3-clause
- Created: 2024-01-30T23:53:22.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-03-28T22:14:50.000Z (7 months ago)
- Last Synced: 2024-03-29T22:40:55.648Z (7 months ago)
- Topics: cross-validation, data-modeling, data-reporting, data-visualization, eda, folium, grid-search, matplotlib, model-evaluation, numpy, pandas, pca, python, seaborn, sklearn, statistics, supervised-ml, unsupervised-ml
- Language: Jupyter Notebook
- Homepage: https://www.kaggle.com/code/danpmarks/california-housing-prices-1990
- Size: 2.02 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Ind_Project_California-Housing-Data--Kaggle
## Screenshot
![screenshot](https://github.com/dmarks84/Ind_Project_California-Housing-Data--Kaggle/blob/main/housing_screenshot.png?raw=true)## Summary
I worked on the California Housing Data dataset. The dataset provided 9 features that I used to try to predict the median housing data. I performed expanded exploratory data analysis in order to consider the best ways to encode categorical data and scale the numeric features. I also used PCA to add features, as well as K-means clustering to identify potential meaningful cluster identifiers. THis informaiton was fed into several regression models/estimators, all of which were then run through gridsearch and cross validation to determine the best model on split training and validation data.## Results
### Model Selected & Scoring
Of the models considered, the best was was a **Gradient Boosted Decision Tree with max depth set to 8** (with default parameters otherwise [e.g., learning rate = 0.1]). The score on validation data was **~0.84**, which is a respectable score but there is likely room for improvement. It scored well on the training data (0.95), so it is unlikely that the model is overfitting.## Skills (Developed & Applied)
Programming, Python, Statistics, Numpy, Pandas, Matplotlib, Scikit-learn, Dataframes, Data Modeling, EDA, Data Visualization, Data Reporting, Classification, Supervised ML, Cross Validation, Grid search, Unsupervised ML, PCA, Seaborn, Folium