https://github.com/mikekeith52/simulate_data

Using Boston, which is a dataset of only about 500 records, simulate 29000+ observations with added variables.
https://github.com/mikekeith52/simulate_data

Last synced: 3 months ago
JSON representation

Using Boston, which is a dataset of only about 500 records, simulate 29000+ observations with added variables.

Host: GitHub
URL: https://github.com/mikekeith52/simulate_data
Owner: mikekeith52
Created: 2020-06-21T00:31:40.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2020-10-23T12:40:34.000Z (over 4 years ago)
Last Synced: 2025-02-09T08:17:20.637Z (5 months ago)
Language: Jupyter Notebook
Size: 12.4 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Data Simulation

Using the [Boston dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html#:~:text=The%20Boston%20Housing%20Dataset,the%20area%20of%20Boston%20Mass.&text=It%20was%20obtained%20from%20the,the%20literature%20to%20benchmark%20algorithms.), a new dataset is simulated with over 29,000 records. These include created variables (median_income that is highly correlated with the median housing prices and population and families that are highly correlated with one another).

The new records were simulated with [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html) and several transformations to manipulate outcomes were performed using numpy.random functions.

Finally, the variable "sales" is added to the dataset, which can be average sales from a company in a county for a given set of months. This variable is a linear combination of the other variables (all weights assigned through trial and error until the author liked the outcome) with a random error applied to it. Sales that are lower than a certain amount are lifted up so that a small peak exists on the variable's left tail, making its distribution somewhat bimodal. This is done so that more advanced regression techniques, such as neural network, can predict the outcome better than linear regression, which likes normal distributions.

The two resulting datasets, new.csv and existing.csv, are random draws from the created dataset of about 7,000 and 21,000 obserations respectively. A random county_id variable is added to both. The sales variable is dropped from new.csv.

This project was in service of one of my [Apress publications](https://github.com/mikekeith52/PythonRegression). Link to the video: [link](https://link.springer.com/video/10.1007/978-1-4842-6583-3).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mikekeith52/simulate_data

Awesome Lists containing this project

README