https://github.com/bugdaryan/intro2ds_final

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/bugdaryan/intro2ds_final
Owner: bugdaryan
Created: 2021-12-26T13:08:59.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-12-26T13:10:10.000Z (over 4 years ago)
Last Synced: 2025-01-10T17:48:47.654Z (over 1 year ago)
Language: Python
Size: 548 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Final project for intro to DS subject.

This is the repo for python implementation of data cleaning and splitting a dataset.

The data that is used is `data.csv` file.

## Data cleaning

Here we drop column if it passes one of the conditions
- There is only one unique number in column.
- Number of missing values is higher than 80% of data size.
- Column contains strings and number of unique values is more than 60.

## Column type detection

Here we identidy categorical and numerical columns as follows
- If a column contains only numbers and its number of unique values is more than 60, it is identified as numeric.
- Otherwise, it is identified as categorical.

## Missing values

Here we go over the numeric columns, and fill missing values with the mean of the column.

## Categorical preprocessing

As we have a lot of categorical values, and some of them has high cardinality, we keep only the top 70% of categorical values, and assign `Other` to the rest.

## Dummy encoding

We go over the categorical columns and apply dummy encoding to all of them.

## Splitting
We split the data by 0.8(train)/0.2(test) and save them in train.csv, test.csv files

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bugdaryan/intro2ds_final

Awesome Lists containing this project

README