https://github.com/yixin0829/multi-label-wine-quality-classification
Multi-label wine classification ML project trained using Kaggle wine quality dataset :bar_chart:
https://github.com/yixin0829/multi-label-wine-quality-classification
analysis classification-algorithm data-science exploratory-data-analysis machine-learning python sklearn
Last synced: about 1 month ago
JSON representation
Multi-label wine classification ML project trained using Kaggle wine quality dataset :bar_chart:
- Host: GitHub
- URL: https://github.com/yixin0829/multi-label-wine-quality-classification
- Owner: yixin0829
- Created: 2020-09-13T15:58:27.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-01-17T01:41:32.000Z (over 5 years ago)
- Last Synced: 2025-05-31T15:28:32.763Z (about 1 year ago)
- Topics: analysis, classification-algorithm, data-science, exploratory-data-analysis, machine-learning, python, sklearn
- Language: Jupyter Notebook
- Homepage:
- Size: 24.9 MB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# multi_label_wine_quality_classification:bar_chart:
This is a project where I practiced training various different multi-label wine quality classifiers with one vs. all method.
The workflow includes EDA (exploratory analysis, data visualization), data preprocessing (feature selection with chi-square test, oversampling minority classes with synthetic data, feature scaling), and trained data on different classification ML models (logistic regression, linear supported vector machine (SVM), kernel SVM, and K-NN)
**Feel free to click into the .ipynb notebook for detailed analysis.**
## EDA
The dataset is extremely skewed with minority class (i.e. wine quality) like '3' and '8' share less than 1% of the total population. We can see this by plotting a histogram on 'quality' column.

A clearer visualization of the correlations between features by plotting out a heatmap:

Further visualize the relations between features and wine quality. Notice features like "pH", "chlorides", "residual sugar" almost have no impact on classifying the quality of the wine.

## Preprocessing
* Feature selection using chi-square test
* Drop irrelevant features
* Split dataset
* Apply SMOTE to oversample minority classes data by generating synthetic training data using K-NN. Note we do not oversample testing data.
* Feature scaling
## Result
Because of the skewed nature of the dataset. Use F1-score as the performance metric. By applying synthetic minority oversampling technique, KNN model has a notable increase in its weighted F1-score avg from 0.52 to 0.67. The accuracy also went from 51% to 65%. The other models like logistic regression, linear SVM, and kernel SVM did not perform better as expected.
### Logistic Regression

### Linear SVM & Kernel SVM

### K-NN (Rapid Prototype)

### K-NN (Final)
