https://github.com/mayankmittal29/tensortinker_statistical_methods_in_ai
This repository contains implementations of various machine learning algorithms from scratch, including Multi-Layer Perceptron (MLP), Gaussian Mixture Models (GMM), Principal Component Analysis (PCA), Autoencoders, and Variational Autoencoders.
https://github.com/mayankmittal29/tensortinker_statistical_methods_in_ai
autoencoder-mnist cupy gmm-clustering image-segmentation matplotlib-pyplot mlp-classifier mlp-regressor mnist-dataset numpy pandas pca python3 pytorch roc-auc seaborn torch variational-autoencoder
Last synced: 7 days ago
JSON representation
This repository contains implementations of various machine learning algorithms from scratch, including Multi-Layer Perceptron (MLP), Gaussian Mixture Models (GMM), Principal Component Analysis (PCA), Autoencoders, and Variational Autoencoders.
- Host: GitHub
- URL: https://github.com/mayankmittal29/tensortinker_statistical_methods_in_ai
- Owner: mayankmittal29
- Created: 2025-03-17T17:31:56.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-03-18T18:26:44.000Z (4 months ago)
- Last Synced: 2025-04-01T15:47:16.424Z (3 months ago)
- Topics: autoencoder-mnist, cupy, gmm-clustering, image-segmentation, matplotlib-pyplot, mlp-classifier, mlp-regressor, mnist-dataset, numpy, pandas, pca, python3, pytorch, roc-auc, seaborn, torch, variational-autoencoder
- Language: Jupyter Notebook
- Homepage:
- Size: 21.4 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🧠 TensorTinker Statistical Methods in AI 🧠



This repository contains implementations of various machine learning algorithms from scratch, including Multi-Layer Perceptron (MLP), Gaussian Mixture Models (GMM), Principal Component Analysis (PCA), Autoencoders, and Variational Autoencoders.
## 📑 Table of Contents
- [Project Overview](#-project-overview)
- [1. Multi-Layer Perceptron](#-1-multi-layer-perceptron)
- [1.1 MLP Multi-Class Classifier](#-11-mlp-multi-class-classifier)
- [1.2 MLP Regressor for Price Prediction in Bangalore](#-12-mlp-regressor-for-price-prediction-in-bangalore)
- [1.3 Multi-Label News Article Classification](#-13-multi-label-news-article-classification)
- [2. Gaussian Mixture Model](#-2-gaussian-mixture-model)
- [3. Principal Component Analysis](#-3-principal-component-analysis)
- [3.1 Explained Variance and Lossy Reconstruction](#-31-explained-variance-and-lossy-reconstruction)
- [3.2 Classification Performance with vs without dimensionality reduction](#-32-classification-performance-with-vs-without-dimensionality-reduction)
- [4. Autoencoder](#-4-autoencoder)
- [5. Variational Autoencoder](#-5-variational-autoencoder)
- [Installation Instructions](#-installation-instructions)
- [Usage](#-usage)
- [Results](#-results)## 🔍 Project Overview
This project implements different statistical machine learning methods from scratch to solve various real-world problems. The main focus is on understanding the underlying mathematics and algorithms of these methods and implementing them without using existing libraries (except for PyTorch for autoencoders).
## 🧠 1. Multi-Layer Perceptron
### 🔢 1.1 MLP Multi-Class Classifier
#### Problem Statement
Implementing a Multi-Layer Perceptron for classifying handwritten symbols from historical manuscripts in the SYMBOL dataset.#### Dataset
- Images folder containing all handwritten symbol images
- 10-fold cross-validation setup with train.csv and test.csv in each fold
- Each row contains: path to image, symbol ID, and LaTeX representation#### Implementation Details
- Custom MLP class with configurable hyperparameters:
- Learning rate - [0.01, 0.001]
- Activation functions (Sigmoid, Tanh, ReLU implemented from scratch)
- Optimizers (SGD, Batch GD, Mini-Batch GD implemented from scratch)
- Number and size of hidden layers - [[128], [64, 32], [128, 64, 32]]
- batch sizes - [32, 64, 128]
- Methods for forward and backward propagation
- Training process with various configurations#### Hyperparameter Tuning with 10-Fold Validation
- Learning rate and epochs optimization
- Different hidden layer configurations
- Comparison of activation functions and optimizers
- Performance metrics: accuracy, precision, recall#### Results and Visualizations

##### Best configuration: Sigmoid, sgd, layers=[64, 32], lr=0.01
##### Mean accuracy: 0.6238±0.0064### 🏠 1.2 MLP Regressor for Price Prediction in Bangalore
#### Problem Statement
Building an MLP regressor to predict housing prices in Bangalore based on features like location, size, and amenities.#### Dataset
- Bangalore housing price dataset with various features
- Requires extensive preprocessing due to missing values and outliers#### Data Preprocessing Steps
1. Handling missing values and outliers
2. Feature selection and engineering
3. Normalization and standardization
4. Visualisations like Pie-charts, scatter-plot, Box-Plot, Correlation Heatmap etc
5. Train-validation-test split#### Model Implementation
- Same MLP architecture as the classifier but adapted for regression
- Mean Squared Error (MSE) as the loss function
- Configurable hyperparameters#### Evaluation Metrics
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-squared#### Results

#### All other results of tuning are in Q2.2_files/q2.2_report.csv and cleaned csv also.### 📰 1.3 Multi-Label News Article Classification
#### Problem Statement
Developing an MLP model to tag news articles with multiple topics simultaneously.#### Data Preprocessing
- Parsing CSV files and handling multi-label data
- Computing TF-IDF features from scratch (limited to ~5000 features)
- Multi-label binarization
- Train-validation split#### Model Implementation
- MLP with output neurons for each possible label
- Binary cross-entropy loss for multi-label classification
- Forward and backward propagation with support for multiple outputs#### Hyperparameter Tuning
- Learning rate, epochs, and architecture variations
- Different activation functions and optimizers#### Evaluation Metrics
- Accuracy
- Hamming Loss
- Precision, Recall, F1-score (for multi-label)#### Results

##### Results of all hyperparameters tuned are in Q2.3_files/results_2.3.csv## 🔔 2. Gaussian Mixture Model
#### Problem Statement
Implementing GMM from scratch and using it to segment gray matter, white matter, and cerebrospinal fluid (CSF) from brain MRI images.#### GMM Implementation
- Expectation-Maximization (EM) algorithm
- Component initialization
- Convergence criteria
- Posterior probability calculation#### Brain Tissue Segmentation
- Using GMM to segment the MRI image sald_031764_img.nii
- Visualization of segmentation results
- Comparison with original segmentation#### Segmentation Results After applying GMM on brain MRI scan image:-
#### Original Axial and its Segmented View


#### Original Coronal and its Segmented View


#### Original Sagittal and its Segmented View


#### Analysis

#### Misclassification Analysis
Analysis of regions with highest misclassification based on intensity distributions and GMM model characteristics.## 🔍 3. Principal Component Analysis
### 📉 3.1 Explained Variance and Lossy Reconstruction
#### Implementation Details
- PCA implementation from scratch using NumPy
- Covariance matrix computation
- Eigenvector and eigenvalue calculation
- Projection and reconstruction#### Dataset
- MNIST dataset with 1000 randomly sampled images (uniform class distribution)#### Dimensionality Reduction
- Projecting data to 500, 300, 150, 30, 25, 20, 15, 10, 5 and 2 dimensions#### Visualization and Analysis


#### Image Reconstruction

### 📊 3.2 Classification Performance with vs without dimensionality reduction
#### Experimental Setup
- 40K random samples from MNIST train set and full test set (10K samples)
- MLP classifier with 2-3 fully connected layers
- Dimensionality reduction with PCA to 500, 300, 150, and 30 dimensions#### Performance Metrics
- Accuracy
- Precision
- Recall#### Results
Baseline Classification (No PCA)
Accuracy: 0.9746
Precision: 0.9746
Recall: 0.9746Classification with 500 PCA Components
Accuracy: 0.9332
Precision: 0.9335
Recall: 0.9332Classification with 300 PCA Components
Accuracy: 0.9550
Precision: 0.9550
Recall: 0.9550Classification with 150 PCA Components
Accuracy: 0.9681
Precision: 0.9683
Recall: 0.9681Classification with 30 PCA Components
Accuracy: 0.9804
Precision: 0.9804
Recall: 0.9804Classification with 25 PCA Components
Accuracy: 0.9736
Precision: 0.9736
Recall: 0.9736Classification with 20 PCA Components
Accuracy: 0.9728
Precision: 0.9729
Recall: 0.9728Classification with 15 PCA Components
Accuracy: 0.9678
Precision: 0.9679
Recall: 0.9678Classification with 10 PCA Components
Accuracy: 0.9358
Precision: 0.9360
Recall: 0.9358Classification with 5 PCA Components
Accuracy: 0.7698
Precision: 0.7741
Recall: 0.7698Classification with 2 PCA Components
Accuracy: 0.4727
Precision: 0.4615
Recall: 0.4727#### Analysis
- Discussion on how PCA helps mitigate the curse of dimensionality
- Cases where PCA might not be effective
- Limitations of PCA's variance maximization assumption## 🔄 4. Autoencoder
#### Problem Statement
Implementing an autoencoder for anomaly detection in MNIST digits.#### Implementation Details
- PyTorch implementation of encoder and decoder networks
- Training on normal data (digits matching last digit of roll number)
- Testing on mixed normal and anomalous digits#### Reconstruction Error Analysis
### For dimension 8





#### Performance Metrics for Bottleneck Dimension = 8
##### Optimal Threshold: -24.726648
##### Precision: 0.6364
##### Recall: 0.5631
##### F1-Score: 0.5975
##### AUC-ROC: 0.9185
##### Accuracy: 0.9018### For dimension 16





#### Performance Metrics for Bottleneck Dimension = 16
##### Optimal Threshold: -9.943974
##### Precision: 0.7848
##### Recall: 0.6833
##### F1-Score: 0.7305
##### AUC-ROC: 0.9662
##### Accuracy: 0.9018### For dimension 32





#### Performance Metrics for Bottleneck Dimension = 32
##### Optimal Threshold: -9.622127
##### Precision: 0.6948
##### Recall: 0.7301
##### F1-Score: 0.7120
##### AUC-ROC: 0.9640
##### Accuracy: 0.9018#### Anomaly Detection
- Threshold selection based on reconstruction error distribution
- Performance evaluation with precision, recall, and F1-score#### Hyperparameter Tuning
- Testing 3 different bottleneck dimensions
- Comparison using AUC-ROC score#### Results


#### The optimal bottleneck dimension appears to be 16, with the highest AUC (0.966) and a good balance of precision (0.785) and recall (0.683)
## 🧬 5. Variational Autoencoder#### Problem Statement
Implementing and analyzing a Variational Autoencoder (VAE) on MNIST dataset.#### Implementation Details
- PyTorch implementation of VAE with encoder, reparameterization, and decoder
- Binary cross-entropy loss for reconstruction
- KL divergence for latent space regularization#### Latent Space Visualization
.png)
#### Ablation Studies
1. Training without reconstruction loss
2. Training without KL divergence loss
#### Latent Space Sampling
.png)


#### Loss Function Comparison
- Binary cross-entropy vs. MSE reconstruction loss
- Visual comparison of generated samples#### Visuals of generated smaples when full VAE with BCE loss and KL loss

#### Visual of generated smaples when VAE without reconstruction loss and but KL loss
#### Visual of generated smaples when full VAE with BCE loss and without KL loss
#### Visual of generated smaples when full VAE with MSE loss and KL loss


## 📦 Installation Instructions
```bash
# Clone the repository
git clone https://github.com/your-username/TensorTinker_Statistical_Methods_in_AI.git
cd TensorTinker_Statistical_Methods_in_AI
```⭐ Feel free to star this repository if you find it useful! ⭐
📝 For any questions or issues, please open an issue on GitHub.