https://github.com/imosudi/model_training
Breast Cancer Diagnosis: Logistic Regression, Random Forest, k-NN and Decision Tree classifiers models with feature importance analysis - Includes data exploration, train/test splitting, feature scaling, cross-validation, and model evaluation metrics with confusion matrices and decision boundary visualisation
https://github.com/imosudi/model_training
classification data-science decision-tree educational feature-importance k-nearest-neighbors linear-regression machine-learning model-evaluation python3 random-forest scikit-learn
Last synced: about 6 hours ago
JSON representation
Breast Cancer Diagnosis: Logistic Regression, Random Forest, k-NN and Decision Tree classifiers models with feature importance analysis - Includes data exploration, train/test splitting, feature scaling, cross-validation, and model evaluation metrics with confusion matrices and decision boundary visualisation
- Host: GitHub
- URL: https://github.com/imosudi/model_training
- Owner: imosudi
- License: bsd-3-clause
- Created: 2026-04-18T17:25:46.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-18T20:37:22.000Z (2 months ago)
- Last Synced: 2026-04-18T21:29:44.577Z (2 months ago)
- Topics: classification, data-science, decision-tree, educational, feature-importance, k-nearest-neighbors, linear-regression, machine-learning, model-evaluation, python3, random-forest, scikit-learn
- Language: Python
- Homepage:
- Size: 125 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Basic AI/ML Model Training
Educational machine learning project covering classical ML and TensorFlow classification workflows, evaluation, visualisation, and model serialisation.
## Overview
This repository contains hands-on classification examples built with scikit-learn and TensorFlow. It covers data preprocessing, model training, cross-validation, reporting, visualisation, and model export.
Breast Cancer Diagnosis now compares Logistic Regression, Random Forest, k-NN, Decision Tree, and a TensorFlow neural network on the Breast Cancer Wisconsin dataset. The workflow includes data exploration, train/test splitting, feature scaling, cross-validation, classification reports, ROC-AUC, confusion matrices, learning curves, feature importance analysis, and training-vs-validation plots.
[](#) [](#) [](#) [](#) [](#) [](#) [](#) [](#) [](#) [](#) [](#) [](#) [](#) [](#) [](#) [](#) [](#) [](#) [](#) [](#)
## Projects
### 1. Breast Cancer Diagnosis (`cancer/`)
**Dataset:** Breast Cancer Wisconsin (569 samples, 30 features)
**Files:**
- `serialise_models.py` - Main model serialisation script
- `data_load.py` - Data loading and preprocessing utilities
- `trainings.py` - Training functions and pipelines
- `validations.py` - Model validation and cross-validation
- `visualisations.py` - Plotting and visualisation functions
- `reports.py` - Report generation and metrics calculation
- `outputs/` - Directory for generated plots and model files
**Models:**
- Logistic Regression
- Random Forest
- k-Nearest Neighbors (k-NN)
- Decision Tree
- TensorFlow dense neural network
**Features:**
- Full dataset exploration and statistical summary
- Train/test splitting with stratification
- Feature scaling for Logistic Regression, k-NN, and TensorFlow
- TensorFlow training with model summary, epoch logs, validation tracking, and early stopping
- Cross-validation for all models, including manual TensorFlow CV
- Learning curves for all models
- Comprehensive evaluation metrics:
- Accuracy
- Classification reports
- Confusion matrices
- ROC-AUC
- Feature importance analysis:
- Random Forest and Decision Tree: built-in importances
- Logistic Regression: absolute coefficients
- k-NN and TensorFlow: permutation importance
- Unified training-history plots for train vs validation loss and accuracy
- Model serialisation:
- scikit-learn models saved as `.pkl`
- TensorFlow model saved as `.keras`
**Generated outputs include:**
- `training_validation_curves.png`
- Per-model learning curves
- Per-model confusion matrices
- Per-model feature importance plots
- Serialised model artifacts in `cancer/outputs/models/`
### 2. Single Model Training (`one/train_iris.py`)
Training pipeline for individual machine learning models.
### 3. Multi-Model Comparison (`three/`)
Advanced model comparison and evaluation framework.
---
## Project Structure
```
model_training/
├── cancer/ # Breast cancer classification project
│ ├── serialise_models.py # Model serialisation script
│ ├── data_load.py # Data loading utilities
│ ├── trainings.py # Training functions
│ ├── validations.py # Validation methods
│ ├── visualisations.py # Plotting functions
│ ├── reports.py # Report generation
│ └── outputs/ # Generated files and plots
├── one/ # Single model training
│ └── train_iris.py
├── three/ # Multi-model comparison
├── requirements.txt # dependency
├── README.md # This file
└── LICENSE # Project license
```
---
## Core Concepts Covered
- **Data Exploration:** Shape, class distribution, summary statistics, pairplot visualisation
- **Train/Test Splitting:** Stratified splits to preserve class proportions
- **Feature Scaling:** StandardScaler for distance-based and neural-network models
- **Cross-Validation:** k-fold CV for robust model evaluation
- **Model Comparison:** Side-by-side evaluation of multiple algorithms
- **Deep Learning Basics:** Dense neural networks with TensorFlow/Keras
- **Evaluation Metrics:**
- Accuracy
- Confusion matrices
- Classification reports (precision, recall, F1-score, support)
- AUC-ROC score
- **Feature Importance:** Understanding which features drive predictions
- **Visualisation:** Training-validation curves, confusion matrices, learning curves, feature importance plots
- **Serialisation:** Exporting sklearn and TensorFlow models for reuse
---
## Requirements
```bash
git clone git@github.com:imosudi/model_training.git
```
```bash
cd model_training
```
```bash
python3 -m venv venv
```
```bash
source venv/bin/activate
```
```bash
pip install -r requirements.txt
```
## Usage
Run the Breast Cancer diagnosis example:
```bash
python cancer/serialise_models.py
```
Run the Iris classification example:
```bash
python one/train_iris.py
```
This command trains the models, generates reports and visualisations, and writes serialised artifacts to `cancer/outputs/models/`.
---
## Educational Value
These scripts are designed as learning resources for:
- Understanding how different classifiers work
- Learning proper ML workflow (explore → split → scale → train → evaluate)
- Interpreting model outputs and evaluation metrics
- Comparing algorithm performance
- Extracting actionable insights from feature importance
---
## Notes
- All random states are fixed (42) for reproducibility
- Stratified splitting ensures balanced train/test distributions
- Feature scaling is crucial for distance-based models and the TensorFlow model
- Cross-validation provides robust performance estimates
- Confusion matrices reveal which classes are confused with each other
- Feature importance helps understand model decisions
- TensorFlow uses CPU if CUDA drivers are not available
## License
This project is licensed under the **BSD 3-Clause License** - see the [LICENSE](./LICENSE) file for details.
```
BSD 3-Clause License
Copyright (c) 2026, Mosudi Isiaka, IoT and Smart Systems, FH Technikum Wien
All rights reserved.
```
---
## Author
**Mosudi Isiaka O.**
📧 [mosudi.isiaka@gmail.com](mailto:mosudi.isiaka@gmail.com) | [FH Technikum Wien email](mailto:io24m006@technikum-wien.at)
🌐 [https://mioemi.com](https://mioemi.com)
💻 [https://github.com/imosudi](https://github.com/imosudi)
---