https://github.com/rajnandinithopte/machine-learning_knn-classification
KNN classification project for predicting abnormalities in a vertebral column dataset using various distance metrics, EDA, hyperparameter tuning, learning curves, and weighted voting to optimize performance.
https://github.com/rajnandinithopte/machine-learning_knn-classification
Last synced: about 1 month ago
JSON representation
KNN classification project for predicting abnormalities in a vertebral column dataset using various distance metrics, EDA, hyperparameter tuning, learning curves, and weighted voting to optimize performance.
- Host: GitHub
- URL: https://github.com/rajnandinithopte/machine-learning_knn-classification
- Owner: rajnandinithopte
- Created: 2025-02-03T07:40:37.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-02-03T20:59:42.000Z (10 months ago)
- Last Synced: 2025-02-03T21:35:04.134Z (10 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 1.23 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Machine Learning: KNN Classification on Vertebral Column Data
## π· Overview
This project applies **K-Nearest Neighbors (KNN) classification** to the **Vertebral Column Data Set**, a biomedical dataset that categorizes spinal conditions into **normal** and **abnormal** classes. The implementation involves **data preprocessing, exploratory analysis, model training, evaluation, and experimentation** with different distance metrics and voting methods.
---
## π· Dataset Description
The **Vertebral Column Data Set** is obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Vertebral+Column). It consists of **310 samples** with **6 biomechanical attributes** extracted from radiographic images of the spine.
### πΆ Features in the Dataset
- Pelvic incidence
- Pelvic tilt
- Lumbar lordosis angle
- Sacral slope
- Pelvic radius
- Grade of spondylolisthesis
Each row represents a **patient's spinal measurements**, and the **target variable** classifies the patient as:
- **Normal (0)**
- **Abnormal (1)** (includes conditions like herniated discs and spondylolisthesis)
---
## π· Libraries Used
To execute this project, the following Python libraries were used:
- **pandas** β Data manipulation and preprocessing
- **numpy** β Numerical computations
- **matplotlib** β Data visualization
- **seaborn** β Statistical data visualization
- **scikit-learn** β Machine learning algorithms (KNN classifier, distance metrics, model evaluation)
## π· Steps Taken to Accomplish the Project
### πΆ 1. Data Preprocessing and Exploratory Data Analysis (EDA)
- Converted categorical class labels into binary labels (Normal=0, Abnormal=1).
- Conducted scatterplot analysis to visualize relationships between independent variables.
- Created boxplots to identify outliers and distribution differences across the two classes.
- Split the dataset into training (first 70 rows of Class 0 and first 140 rows of Class 1) and test sets (remaining data).
### πΆ 2. K-Nearest Neighbors (KNN) Classification
- Implemented KNN with Euclidean distance using either a custom algorithm or scikit-learnβs implementation.
- Evaluated test errors for different values of k within `{208, 205, ..., 7, 4, 1}`.
- Selected the optimal k by plotting training and test errors vs. k.
- Computed performance metrics for the optimal k:
- Confusion Matrix
- True Positive Rate (Recall)
- True Negative Rate
- Precision
- F1-Score
### πΆ 3. Learning Curve Analysis
- Investigated the effect of training size on KNN performance.
- Trained the model using different training set sizes `N = {10, 20, 30, β¦, 210}`.
- Selected the optimal k dynamically for each training size.
- Plotted the learning curve (test error vs. training set size) to analyze model generalization.
### πΆ 4. Experimentation with Different Distance Metrics
- Replaced Euclidean distance with alternative distance measures:
- Minkowski Distance `(p = 1 β Manhattan, logββ(p) β {0.1, 0.2, β¦, 1}, p β β β Chebyshev)`
- Mahalanobis Distance (accounting for feature correlations)
- Compared test errors across distance metrics and summarized results in a table.
### πΆ 5. Weighted Voting in KNN
- Implemented distance-weighted voting, where closer neighbors contribute more to the decision.
- Compared performance with Euclidean, Manhattan, and Chebyshev distances.
- Identified the best test error with `k β {1, 6, 11, β¦, 196}`.
### πΆ 6. Final Evaluation
- Reported lowest training error achieved across all experiments.
- Summarized findings on the best k-value, distance metric, and voting method.
---
## π **Note**
This repository contains a **Jupyter Notebook** detailing each step, along with **results and visualizations**.