https://github.com/wanadzhar913/classification-ml-with-shap

Exploratory data analysis & modeling exercise with Logistic Regression, Imbalanced-learn, RobustScaler and SHAP (SHapley Additive exPlanations). The dataset details a set of categorical/continuous features with a binary target .
https://github.com/wanadzhar913/classification-ml-with-shap

Last synced: 1 day ago
JSON representation

Host: GitHub
URL: https://github.com/wanadzhar913/classification-ml-with-shap
Owner: wanadzhar913
Created: 2023-10-28T17:16:16.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-10-30T09:09:11.000Z (almost 2 years ago)
Last Synced: 2025-01-07T00:46:33.445Z (9 months ago)
Language: Jupyter Notebook
Homepage:
Size: 2.48 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

#### **TLDR**

#### **Folder Structure**
```
├── README.md
├── data
│ ├── processed_training.csv
│ ├── processed_validation.csv
│ ├── training2.csv
│ └── validation2.csv
├── notebooks
│ ├── 1_EDA.ipynb
│ ├── 2_Feature_Engineering_&_Treatment.ipynb
│ ├── 3_Model_Building.ipynb
│ └── helper_functions.py
```

#### **Outcomes of EDA (Exploratory Data Analysis)**
- We note that, since we do not have any information as to what these variables represent, we unfortunately aren't able to do more business-specific analyses. Hence, we'll focus on understanding the underlying distributions of each feature in the dataset.
- Neither datasets had any duplicates in them.
- In terms of missing values, more than half of the rows in feature `v16` are missing for both validation & training sets, the most for any feature in the dataset.
- For categorical features, there are several instances of highly imbalanced classes e.g., `classLabel` (training), `v12`, `v8` (training), `v4` & `V1`, as well as values with extremely rare occurences (<1% presence in the feature). These are namely the value *l*, for `v4`, and the value *o*, for `v12`.
- For continous features, most are positively skewed with the exception of `v5` & `v6`. Most of these positively skewed features have outliers.
- In terms of correlation between numerical columns, we've identified features `v15` & `v13` to be suffering from perfect multicollinearity. Hence, we'll randomly drop one when we build our classification model.
- On the other hand, for the categorical columns, Cramer's V reveals that column `v17` has an extremely high correlation with the target (~98%), `classLabel`. This is also grounds for dropping due to redundancy.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wanadzhar913/classification-ml-with-shap

Awesome Lists containing this project

README