https://github.com/bugdaryan/intro2ds_final
https://github.com/bugdaryan/intro2ds_final
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/bugdaryan/intro2ds_final
- Owner: bugdaryan
- Created: 2021-12-26T13:08:59.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-12-26T13:10:10.000Z (over 4 years ago)
- Last Synced: 2025-01-10T17:48:47.654Z (over 1 year ago)
- Language: Python
- Size: 548 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Final project for intro to DS subject.
This is the repo for python implementation of data cleaning and splitting a dataset.
The data that is used is `data.csv` file.
## Data cleaning
Here we drop column if it passes one of the conditions
- There is only one unique number in column.
- Number of missing values is higher than 80% of data size.
- Column contains strings and number of unique values is more than 60.
## Column type detection
Here we identidy categorical and numerical columns as follows
- If a column contains only numbers and its number of unique values is more than 60, it is identified as numeric.
- Otherwise, it is identified as categorical.
## Missing values
Here we go over the numeric columns, and fill missing values with the mean of the column.
## Categorical preprocessing
As we have a lot of categorical values, and some of them has high cardinality, we keep only the top 70% of categorical values, and assign `Other` to the rest.
## Dummy encoding
We go over the categorical columns and apply dummy encoding to all of them.
## Splitting
We split the data by 0.8(train)/0.2(test) and save them in train.csv, test.csv files