https://github.com/imosudi/model_training

Breast Cancer Diagnosis: Logistic Regression, Random Forest, k-NN and Decision Tree classifiers models with feature importance analysis - Includes data exploration, train/test splitting, feature scaling, cross-validation, and model evaluation metrics with confusion matrices and decision boundary visualisation
https://github.com/imosudi/model_training

classification data-science decision-tree educational feature-importance k-nearest-neighbors linear-regression machine-learning model-evaluation python3 random-forest scikit-learn

Last synced: about 6 hours ago
JSON representation

Host: GitHub
URL: https://github.com/imosudi/model_training
Owner: imosudi
License: bsd-3-clause
Created: 2026-04-18T17:25:46.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-04-18T20:37:22.000Z (2 months ago)
Last Synced: 2026-04-18T21:29:44.577Z (2 months ago)
Topics: classification, data-science, decision-tree, educational, feature-importance, k-nearest-neighbors, linear-regression, machine-learning, model-evaluation, python3, random-forest, scikit-learn
Language: Python
Homepage:
Size: 125 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Basic AI/ML Model Training

Educational machine learning project covering classical ML and TensorFlow classification workflows, evaluation, visualisation, and model serialisation.

## Overview

This repository contains hands-on classification examples built with scikit-learn and TensorFlow. It covers data preprocessing, model training, cross-validation, reporting, visualisation, and model export.

Breast Cancer Diagnosis now compares Logistic Regression, Random Forest, k-NN, Decision Tree, and a TensorFlow neural network on the Breast Cancer Wisconsin dataset. The workflow includes data exploration, train/test splitting, feature scaling, cross-validation, classification reports, ROC-AUC, confusion matrices, learning curves, feature importance analysis, and training-vs-validation plots.

[![data-science](https://img.shields.io/badge/-data--science-informational?style=flat)](#) [![machine-learning](https://img.shields.io/badge/-machine--learning-blue?style=flat)](#) [![tensorflow](https://img.shields.io/badge/-TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white)](#) [![keras](https://img.shields.io/badge/-Keras-D00000?style=flat&logo=keras&logoColor=white)](#) [![scikit-learn](https://img.shields.io/badge/-scikit--learn-F7931E?style=flat&logo=scikit-learn&logoColor=white)](#) [![python3](https://img.shields.io/badge/-python3-3776AB?style=flat&logo=python&logoColor=white)](#) [![pandas](https://img.shields.io/badge/-pandas-150458?style=flat&logo=pandas&logoColor=white)](#) [![numpy](https://img.shields.io/badge/-NumPy-013243?style=flat&logo=numpy&logoColor=white)](#) [![matplotlib](https://img.shields.io/badge/-Matplotlib-11557C?style=flat)](#) [![seaborn](https://img.shields.io/badge/-Seaborn-4C72B0?style=flat)](#) [![classification](https://img.shields.io/badge/-classification-red?style=flat)](#) [![model-evaluation](https://img.shields.io/badge/-model--evaluation-teal?style=flat)](#) [![cross-validation](https://img.shields.io/badge/-cross--validation-0A9396?style=flat)](#) [![roc-auc](https://img.shields.io/badge/-ROC--AUC-7B2CBF?style=flat)](#) [![feature-importance](https://img.shields.io/badge/-feature--importance-blueviolet?style=flat)](#) [![random-forest](https://img.shields.io/badge/-random--forest-brightgreen?style=flat)](#) [![linear-regression](https://img.shields.io/badge/-linear--regression-orange?style=flat)](#) [![decision-tree](https://img.shields.io/badge/-decision--tree-yellow?style=flat)](#) [![k-nearest-neighbors](https://img.shields.io/badge/-k--nearest--neighbors-green?style=flat)](#) [![educational](https://img.shields.io/badge/-educational-purple?style=flat)](#)

## Projects

### 1. Breast Cancer Diagnosis (`cancer/`)

**Dataset:** Breast Cancer Wisconsin (569 samples, 30 features)

**Files:**

- `serialise_models.py` - Main model serialisation script

- `data_load.py` - Data loading and preprocessing utilities

- `trainings.py` - Training functions and pipelines

- `validations.py` - Model validation and cross-validation

- `visualisations.py` - Plotting and visualisation functions

- `reports.py` - Report generation and metrics calculation

- `outputs/` - Directory for generated plots and model files

**Models:**

- Logistic Regression

- Random Forest

- k-Nearest Neighbors (k-NN)

- Decision Tree

- TensorFlow dense neural network

**Features:**

- Full dataset exploration and statistical summary

- Train/test splitting with stratification

- Feature scaling for Logistic Regression, k-NN, and TensorFlow

- TensorFlow training with model summary, epoch logs, validation tracking, and early stopping

- Cross-validation for all models, including manual TensorFlow CV

- Learning curves for all models

- Comprehensive evaluation metrics:

  - Accuracy

  - Classification reports

  - Confusion matrices

  - ROC-AUC

- Feature importance analysis:

  - Random Forest and Decision Tree: built-in importances

  - Logistic Regression: absolute coefficients

  - k-NN and TensorFlow: permutation importance

- Unified training-history plots for train vs validation loss and accuracy

- Model serialisation:

  - scikit-learn models saved as `.pkl`

  - TensorFlow model saved as `.keras`

**Generated outputs include:**

- `training_validation_curves.png`

- Per-model learning curves

- Per-model confusion matrices

- Per-model feature importance plots

- Serialised model artifacts in `cancer/outputs/models/`

### 2. Single Model Training (`one/train_iris.py`)

Training pipeline for individual machine learning models.

### 3. Multi-Model Comparison (`three/`)

Advanced model comparison and evaluation framework.

---

## Project Structure

```

model_training/

├── cancer/                    # Breast cancer classification project

│   ├── serialise_models.py   # Model serialisation script

│   ├── data_load.py          # Data loading utilities

│   ├── trainings.py          # Training functions

│   ├── validations.py        # Validation methods

│   ├── visualisations.py     # Plotting functions

│   ├── reports.py            # Report generation

│   └── outputs/              # Generated files and plots

├── one/                      # Single model training

│   └── train_iris.py

├── three/                    # Multi-model comparison

├── requirements.txt          # dependency

├── README.md                 # This file

└── LICENSE                   # Project license

```

---

## Core Concepts Covered

- **Data Exploration:** Shape, class distribution, summary statistics, pairplot visualisation

- **Train/Test Splitting:** Stratified splits to preserve class proportions

- **Feature Scaling:** StandardScaler for distance-based and neural-network models

- **Cross-Validation:** k-fold CV for robust model evaluation

- **Model Comparison:** Side-by-side evaluation of multiple algorithms

- **Deep Learning Basics:** Dense neural networks with TensorFlow/Keras

- **Evaluation Metrics:**

  - Accuracy

  - Confusion matrices

  - Classification reports (precision, recall, F1-score, support)

  - AUC-ROC score

- **Feature Importance:** Understanding which features drive predictions

- **Visualisation:** Training-validation curves, confusion matrices, learning curves, feature importance plots

- **Serialisation:** Exporting sklearn and TensorFlow models for reuse

---

## Requirements

```bash

git clone git@github.com:imosudi/model_training.git

```

```bash

cd model_training

```

```bash

python3 -m venv venv

```

```bash

source venv/bin/activate

```

```bash

pip install -r requirements.txt

```

## Usage

Run the Breast Cancer diagnosis example:

```bash

python cancer/serialise_models.py

```

Run the Iris classification example:

```bash

python one/train_iris.py

```

This command trains the models, generates reports and visualisations, and writes serialised artifacts to `cancer/outputs/models/`.

---

## Educational Value

These scripts are designed as learning resources for:

- Understanding how different classifiers work

- Learning proper ML workflow (explore → split → scale → train → evaluate)

- Interpreting model outputs and evaluation metrics

- Comparing algorithm performance

- Extracting actionable insights from feature importance

---

## Notes

- All random states are fixed (42) for reproducibility

- Stratified splitting ensures balanced train/test distributions

- Feature scaling is crucial for distance-based models and the TensorFlow model

- Cross-validation provides robust performance estimates

- Confusion matrices reveal which classes are confused with each other

- Feature importance helps understand model decisions

- TensorFlow uses CPU if CUDA drivers are not available

## License

This project is licensed under the **BSD 3-Clause License** - see the [LICENSE](./LICENSE) file for details.

```

BSD 3-Clause License

Copyright (c) 2026, Mosudi Isiaka, IoT and Smart Systems, FH Technikum Wien

All rights reserved.

```

---

##  Author

**Mosudi Isiaka O.**  

📧 [mosudi.isiaka@gmail.com](mailto:mosudi.isiaka@gmail.com)  | [FH Technikum Wien email](mailto:io24m006@technikum-wien.at)  

🌐 [https://mioemi.com](https://mioemi.com)   

💻 [https://github.com/imosudi](https://github.com/imosudi)

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/imosudi/model_training

Awesome Lists containing this project

README