https://github.com/ashithapallath/knn-distance-measures
This project compares k-NN performance using different distance metrics. Euclidean, Manhattan, and Minkowski achieved 100% accuracy, making them ideal for numerical data. Cosine Similarity performed well (93.33%), while Hamming and Jaccard were ineffective (33.33%).
https://github.com/ashithapallath/knn-distance-measures
cosine-similarity distance-measures euclidean-distances hamming-distance jaccard-similarity knn manhattan-distance minkowski
Last synced: 6 months ago
JSON representation
This project compares k-NN performance using different distance metrics. Euclidean, Manhattan, and Minkowski achieved 100% accuracy, making them ideal for numerical data. Cosine Similarity performed well (93.33%), while Hamming and Jaccard were ineffective (33.33%).
- Host: GitHub
- URL: https://github.com/ashithapallath/knn-distance-measures
- Owner: ashithapallath
- Created: 2025-02-05T04:35:50.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-02-05T04:42:12.000Z (8 months ago)
- Last Synced: 2025-02-12T18:01:54.224Z (8 months ago)
- Topics: cosine-similarity, distance-measures, euclidean-distances, hamming-distance, jaccard-similarity, knn, manhattan-distance, minkowski
- Language: Jupyter Notebook
- Homepage:
- Size: 89.8 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# k-NN Classification with Different Distance Measures
## Overview
This project implements a **k-Nearest Neighbors (k-NN) classifier** on the **Iris dataset** using different distance measures. The goal is to compare the classification performance of various distance metrics and determine the most effective one.## Distance Measures Used
1. **Euclidean Distance (L2 Norm)**
- Measures straight-line distance between points.
- Commonly used for continuous numerical data.2. **Manhattan Distance (L1 Norm)**
- Measures distance along axes at right angles.
- Useful when movement is restricted to a grid-like path.3. **Minkowski Distance** (Generalized Form)
- A generalization of Euclidean and Manhattan distances.
- Adjusts the effect of different feature relationships.4. **Cosine Similarity**
- Measures the cosine of the angle between vectors.
- Often used in text and document similarity.5. **Hamming Distance**
- Measures the number of different positions between binary strings.
- Used in categorical data and error detection.6. **Jaccard Similarity**
- Measures the intersection over the union of sets.
- Often applied to categorical and binary data.## Implementation
- The Iris dataset was loaded and split into **training (80%)** and **testing (20%)** sets.
- Standardization was applied for Euclidean, Manhattan, Minkowski, and Cosine similarity.
- Binarization was applied for Jaccard and Hamming distances.
- A **k-NN classifier (k=5)** was trained for each distance measure.
- Accuracy scores were computed for comparison.## Results
```
Euclidean Accuracy: 1.0000
Manhattan Accuracy: 1.0000
Minkowski Accuracy: 1.0000
Cosine Accuracy: 0.9333
Hamming Accuracy: 0.3333
Jaccard Accuracy: 0.3333
```## Best Performing Distance Measure
The **Euclidean, Manhattan, and Minkowski distances** all achieved perfect accuracy (**1.0000**). These metrics are well-suited for continuous numerical datasets like the **Iris dataset**, where feature relationships are best captured by geometric distance.### Reasoning:
- **Euclidean Distance** is ideal when all features have equal importance and are continuous.
- **Manhattan Distance** is robust to outliers and works well for structured grid-like data, though it performed equally well here.
- **Minkowski Distance** generalizes both Euclidean and Manhattan, achieving the same accuracy at **p=3**.
- **Cosine Similarity** was slightly lower (**0.9333**), as it focuses on direction rather than magnitude.
- **Hamming and Jaccard Distances** performed poorly (**0.3333**) since the Iris dataset is not well-suited for categorical distance measures.## Conclusion
For numerical datasets like **Iris**, **Euclidean Distance** is the most effective measure due to its ability to directly compare feature magnitudes.For other datasets:
- Use **Manhattan Distance** when robustness to outliers is needed.
- Use **Cosine Similarity** for high-dimensional sparse data.
- Use **Jaccard and Hamming** for categorical or binary data.## How to Run the Code
1. Install dependencies: `pip install numpy pandas scikit-learn`
2. Run the Python script: `python DistanceMeasures.py`