https://github.com/rajnandinithopte/machine-learning_knn-classification
KNN classification project for predicting abnormalities in a vertebral column dataset using various distance metrics, EDA, hyperparameter tuning, learning curves, and weighted voting to optimize performance.
https://github.com/rajnandinithopte/machine-learning_knn-classification
Last synced: 3 months ago
JSON representation
KNN classification project for predicting abnormalities in a vertebral column dataset using various distance metrics, EDA, hyperparameter tuning, learning curves, and weighted voting to optimize performance.
- Host: GitHub
- URL: https://github.com/rajnandinithopte/machine-learning_knn-classification
- Owner: rajnandinithopte
- Created: 2025-02-03T07:40:37.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-02-03T20:59:42.000Z (4 months ago)
- Last Synced: 2025-02-03T21:35:04.134Z (4 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 1.23 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Machine Learning: KNN Classification on Vertebral Column Data
## π· Overview
This project applies **K-Nearest Neighbors (KNN) classification** to the **Vertebral Column Data Set**, a biomedical dataset that categorizes spinal conditions into **normal** and **abnormal** classes. The implementation involves **data preprocessing, exploratory analysis, model training, evaluation, and experimentation** with different distance metrics and voting methods.---
## π· Dataset Description
The **Vertebral Column Data Set** is obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Vertebral+Column). It consists of **310 samples** with **6 biomechanical attributes** extracted from radiographic images of the spine.### πΆ Features in the Dataset
- Pelvic incidence
- Pelvic tilt
- Lumbar lordosis angle
- Sacral slope
- Pelvic radius
- Grade of spondylolisthesisEach row represents a **patient's spinal measurements**, and the **target variable** classifies the patient as:
- **Normal (0)**
- **Abnormal (1)** (includes conditions like herniated discs and spondylolisthesis)---
## π· Libraries Used
To execute this project, the following Python libraries were used:- **pandas** β Data manipulation and preprocessing
- **numpy** β Numerical computations
- **matplotlib** β Data visualization
- **seaborn** β Statistical data visualization
- **scikit-learn** β Machine learning algorithms (KNN classifier, distance metrics, model evaluation)## π· Steps Taken to Accomplish the Project
### πΆ 1. Data Preprocessing and Exploratory Data Analysis (EDA)
- Converted categorical class labels into binary labels (Normal=0, Abnormal=1).
- Conducted scatterplot analysis to visualize relationships between independent variables.
- Created boxplots to identify outliers and distribution differences across the two classes.
- Split the dataset into training (first 70 rows of Class 0 and first 140 rows of Class 1) and test sets (remaining data).### πΆ 2. K-Nearest Neighbors (KNN) Classification
- Implemented KNN with Euclidean distance using either a custom algorithm or scikit-learnβs implementation.
- Evaluated test errors for different values of k within `{208, 205, ..., 7, 4, 1}`.
- Selected the optimal k by plotting training and test errors vs. k.
- Computed performance metrics for the optimal k:
- Confusion Matrix
- True Positive Rate (Recall)
- True Negative Rate
- Precision
- F1-Score### πΆ 3. Learning Curve Analysis
- Investigated the effect of training size on KNN performance.
- Trained the model using different training set sizes `N = {10, 20, 30, β¦, 210}`.
- Selected the optimal k dynamically for each training size.
- Plotted the learning curve (test error vs. training set size) to analyze model generalization.### πΆ 4. Experimentation with Different Distance Metrics
- Replaced Euclidean distance with alternative distance measures:
- Minkowski Distance `(p = 1 β Manhattan, logββ(p) β {0.1, 0.2, β¦, 1}, p β β β Chebyshev)`
- Mahalanobis Distance (accounting for feature correlations)
- Compared test errors across distance metrics and summarized results in a table.### πΆ 5. Weighted Voting in KNN
- Implemented distance-weighted voting, where closer neighbors contribute more to the decision.
- Compared performance with Euclidean, Manhattan, and Chebyshev distances.
- Identified the best test error with `k β {1, 6, 11, β¦, 196}`.### πΆ 6. Final Evaluation
- Reported lowest training error achieved across all experiments.
- Summarized findings on the best k-value, distance metric, and voting method.---
## π **Note**
This repository contains a **Jupyter Notebook** detailing each step, along with **results and visualizations**.