https://github.com/kmock930/drug-consumption-machine-learning-analysis
This project contains codes and paperwork based on the course CSI5155 at University of Ottawa (delivered by Professor Dr. Herna Viktor).
https://github.com/kmock930/drug-consumption-machine-learning-analysis
area-under-curve bagging boosting decision-tree ensemble-model gradient-boosting knn machine-learning ml-evaluation ml-pipeline mlp random-forest receiver-operating-characteristic semi-supervised-learning shap-analysis supervised-learning svm unsupervised-learning xai
Last synced: 10 months ago
JSON representation
This project contains codes and paperwork based on the course CSI5155 at University of Ottawa (delivered by Professor Dr. Herna Viktor).
- Host: GitHub
- URL: https://github.com/kmock930/drug-consumption-machine-learning-analysis
- Owner: kmock930
- Created: 2024-10-15T02:41:54.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-09T15:02:54.000Z (over 1 year ago)
- Last Synced: 2025-04-05T18:12:53.202Z (about 1 year ago)
- Topics: area-under-curve, bagging, boosting, decision-tree, ensemble-model, gradient-boosting, knn, machine-learning, ml-evaluation, ml-pipeline, mlp, random-forest, receiver-operating-characteristic, semi-supervised-learning, shap-analysis, supervised-learning, svm, unsupervised-learning, xai
- Language: Jupyter Notebook
- Homepage:
- Size: 105 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Problem Statement
## Aims
1. Convert the multi-class problems into binary classification tasks.
2. Predict whether a person is a consumer of chocolate and magic mushroom.
3. Choose the best and worst classifiers for each dataset.
4. Explain AI models in a scientific manner which should be convincable to non-technical people.
5. Implement models with Semi-Supervised Learning.
## Preview
**Comparing a Pipeline of 6 classifiers on 2 datasets**


**Explainable AI**

**Semi-Supervised Learning**

# Dataset: Drug Consumption Analysis Dataset
The dataset can be found at this link: https://archive.ics.uci.edu/dataset/373/drug+consumption+quantified.
## Description of the Dataset
- Contains a row identifier, 12 features describing the user data, and 18 classification problems related to using 18 different drugs.
- For each drug, it indicates whether a person has 'never used', 'used over a decade ago', 'used in the last decade', 'used in the last year', 'used in the last month', 'used in the last week', or 'used in the last day'.
# Implementation Details
- Split the Dataset and Perform Feature Engineering.
- Perform Supervised Learning using a pipeline of 6 classifiers.
- Identify potential issues in the dataset / the classifier itself.
- Provide results from Evaluation with some useful plots and metrics.
- Summarize the analysis in a report.
- Explain whether certain classifiers make trustable predictions, with the calculation of SHAP values and some visualization plots.
- Prepared labelled and unlabelled data, Implemented and Compared different semi-supervised learning algorithms based on the gradient boosting classifier from assignment 1.
# Project Structure
- You should expect some reports in `.pdf` format at the root level.
- The report for the project is inside the project folder, with name: "Project - Semi Supervised Learning/KeycodeExplaination.pdf"
- Please expand the folder at the root level to view codes.
- This project branches out the analysis into 9 notebooks:
1. Modelling - please check the file `CSI5155 Assignment 1 Modelling Part- Kelvin Mock 300453668.ipynb`
2. Evaluation - please check the file `CSI5155 Assignment 1 Evaluation Part - Kelvin Mock 300453668.ipynb`
3. Calculation of SHAP Values - please check the file `CSI5155 Assignment 2 - Kelvin Mock 300453668.ipynb`
4. Visualizing the SHAP Values - please check the file `CSI5155 Assignment 2 Plots - Kelvin Mock 300453668.ipynb`
5. Baseline Model (Gradient Boosting classifier) - `CSI5155 Project - baseline.ipynb`
6. Self Training method applied on baseline model - `CSI5155 Project - Self Training.ipynb`
7. Co-Training method applied on baseline model - `CSI5155 Project - Co Training.ipynb`
8. Semi-Boost method applied on baseline model - `CSI5155 Project - Semi Boost.ipynb`
9. Label Spreading method applied on baseline model - `CSI5155 Project - Label Spreading.ipynb`
- Models are data dumped into several `.pkl` files from time-to-time in different phases to maintain the code's maintainability.
- The training sets and test sets are also data dumped into several `.pkl` files.
- `choc` directory shows data dumped files related to the Chocolate dataset (which is split from the original dataset).
- `mushrooms` directory shows data dumped files related to the Mushrooms dataset (which is also split from the original dataset).