Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sim98b/housing
Predicting California House block's price
https://github.com/sim98b/housing
Last synced: 9 days ago
JSON representation
Predicting California House block's price
- Host: GitHub
- URL: https://github.com/sim98b/housing
- Owner: Sim98B
- Created: 2023-09-01T08:44:27.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-11-06T16:51:56.000Z (3 months ago)
- Last Synced: 2024-11-06T17:44:27.636Z (3 months ago)
- Language: Jupyter Notebook
- Size: 9.18 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
**Predicting California House blocks'price.**
![Final results on test data](https://github.com/Sim98B/Housing/blob/main/Cover%20Picture.png?raw=true)
Full dataset is available on Kaggle: https://www.kaggle.com/datasets/kathuman/housing**Problem Setting**
1. Every instance refers to an house block in california.
2. Features refers to both structural and population characteristics of every block.
3. The block's median price is the **target variable**.**Analysis**
1. There were 20640 instances and 10 features
2. 207 null-values were detected and dropped, they were just the **1.003%**.
3. **5.21%** of the whole dataset are outlier, instances outside the interquartile range * 1.5 and were removed; we worked with 19369 instances
4. Multicollinearity was detected between four features: total rooms, total bedrooms, block population and number of housholders per block. To avoid introducing this multicollinearity 2 new features were created: bedrooms density and household density which didn't strongly correlate with any other feature**Modeling the problem**
1. The whole dataset was split into train, test and validation, respectively **80%, 10% and 10%**
2. First, several models were trained with default parameters to select the top 3
3. Then hyper parameters of these models were tuned to get the best performance
4. An ensemble was implemented to outperform basic regressor**Conclusion**
1. Ensmeble seems to be the best model, the one that should be implemented achieving an explained variance on never_seen_data of **80.188%**
2. Ensmeble predictions' residuals don't follow a normal distribution; there is a general tendency for prices to be underestimated. The main problem is the understimation of lower prices bacause it could lead to a negative price's estimation
3. Despite this this model has a mean error of **42578$**, which as percentage is a mean of **16.228%** error