https://github.com/tutkufurkan/machine-learning---clustering-models
Comprehensive Machine Learning clustering tutorial with K-Means and Hierarchical Clustering implementations. Features synthetic data generation, Elbow Method optimization, dendrogram visualization, and detailed algorithm comparisons. Built with Python, scikit-learn, and Plotly.
https://github.com/tutkufurkan/machine-learning---clustering-models
clustering clustering-analysis data-science data-visualization dendrogram hierarchical-clustering k-means kaggle machine-learning matplotlib python scikit-learn scipy synthetic-data unsupervised-learning ward-linkage
Last synced: 7 months ago
JSON representation
Comprehensive Machine Learning clustering tutorial with K-Means and Hierarchical Clustering implementations. Features synthetic data generation, Elbow Method optimization, dendrogram visualization, and detailed algorithm comparisons. Built with Python, scikit-learn, and Plotly.
- Host: GitHub
- URL: https://github.com/tutkufurkan/machine-learning---clustering-models
- Owner: tutkufurkan
- License: apache-2.0
- Created: 2025-11-07T20:19:51.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-11-12T12:22:48.000Z (7 months ago)
- Last Synced: 2025-11-12T13:28:05.578Z (7 months ago)
- Topics: clustering, clustering-analysis, data-science, data-visualization, dendrogram, hierarchical-clustering, k-means, kaggle, machine-learning, matplotlib, python, scikit-learn, scipy, synthetic-data, unsupervised-learning, ward-linkage
- Language: Jupyter Notebook
- Homepage: https://tutkufurkan.com
- Size: 654 KB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Machine Learning Clustering Models Tutorial
[](https://www.python.org/)
[](https://scikit-learn.org/)
[](https://plotly.com/)
[](https://opensource.org/licenses/Apache-2.0)
[](https://www.kaggle.com/code/dandrandandran2093/machine-learning-clustering-models)
[](https://github.com/sekertutku/Machine-Learning---Clustering-Models)
## Overview
A comprehensive tutorial on **unsupervised machine learning clustering techniques** using Python. Learn K-Means and Hierarchical Clustering with synthetic data, mathematical explanations, interactive visualizations, and detailed performance comparisons.
## ๐ฎ Interactive Demo
**๐ [Run the Interactive Notebook on Kaggle](https://www.kaggle.com/code/dandrandandran2093/machine-learning-clustering-models)**
## Table of Contents
- [What is Clustering?](#what-is-clustering)
- [Clustering Algorithms](#clustering-algorithms)
- [Dataset](#dataset)
- [Installation](#installation)
- [Usage](#usage)
- [Algorithm Comparison](#algorithm-comparison)
- [Key Insights](#key-insights)
- [References](#references)
## What is Clustering?
**Clustering** is an unsupervised learning technique that groups similar data points together without predefined labels. Unlike supervised learning, clustering discovers hidden patterns in unlabeled data.
### Supervised vs Unsupervised Learning
| Type | Has Labels? | Examples | Goal |
|------|-------------|----------|------|
| **Supervised** | โ
Yes | Classification, Regression | Predict labels |
| **Unsupervised** | โ No | Clustering | Discover patterns |
**Common Use Cases:**
- ๐ Customer segmentation
- ๐งฌ Gene expression analysis
- ๐ธ Image segmentation
- ๐ Document clustering
- ๐ Anomaly detection
## Clustering Algorithms
### 1. K-Means Clustering
**Concept**: Partitions data into K clusters by minimizing within-cluster variance.
**Algorithm:**
1. Choose K (number of clusters)
2. Initialize K random centroids
3. Assign points to nearest centroid
4. Update centroids (mean of assigned points)
5. Repeat until convergence
**Formula:**
$$\text{Minimize: } \sum_{i=1}^{K}\sum_{x \in C_i}||x - \mu_i||^2$$
**Elbow Method**: Plot K vs WCSS to find optimal number of clusters. Look for the "elbow point" where WCSS decrease slows down.
**Advantages:**
- โก Fast and efficient
- ๐ Scalable to large datasets
- ๐ฏ Simple to implement
**Disadvantages:**
- ๐ฒ Must specify K beforehand
- ๐ Sensitive to initialization
- โญ Assumes spherical clusters
### 2. Hierarchical Clustering
**Concept**: Builds a hierarchy of clusters without specifying K beforehand. Creates a dendrogram (tree structure) showing relationships.
**Algorithm (Agglomerative):**
1. Start with each point as its own cluster
2. Merge two closest clusters
3. Repeat until one cluster remains
4. Cut dendrogram at desired height to get K clusters
**Formula:**
$$\text{Distance: } d(C_i, C_j) = \min_{x \in C_i, y \in C_j} ||x - y||$$
**Linkage Methods:**
- **Ward**: Minimizes variance (most common)
- **Single**: Minimum distance
- **Complete**: Maximum distance
- **Average**: Average distance
**Advantages:**
- ๐ณ No need to specify K
- ๐ Dendrogram visualization
- ๐ Captures hierarchical relationships
**Disadvantages:**
- ๐ข Slow (O(nยณ) complexity)
- ๐พ Not suitable for large datasets
- ๐ Merge decisions are irreversible
## Dataset
**Synthetic Data Generation**: 3 clusters with Gaussian distribution
| Cluster | Location | Mean (x, y) | Points (K-Means) | Points (Hierarchical) |
|---------|----------|-------------|------------------|-----------------------|
| 1 | Bottom-left | (25, 25) | 1,000 | 100 |
| 2 | Top-right | (55, 60) | 1,000 | 100 |
| 3 | Bottom-right | (55, 15) | 1,000 | 100 |
**Total**: 3,000 points for K-Means / 300 points for Hierarchical
**Why different sizes?** Hierarchical is computationally expensive (O(nยณ)), so we use a smaller dataset for reasonable runtime.
## Installation
### Option 1: Kaggle (Recommended) โญ
๐ **[Open on Kaggle](https://www.kaggle.com/code/dandrandandran2093/machine-learning-clustering-models)** - Everything pre-configured!
### Option 2: Local
```bash
# Clone repository
git clone https://github.com/sekertutku/Machine-Learning---Clustering-Models.git
cd Machine-Learning---Clustering-Models
# Install dependencies
pip install -r requirements.txt
# Run notebook
jupyter notebook machine-learning-clustering-models.ipynb
```
## Usage
### Quick Start
```python
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans, AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
# Generate data
x = np.concatenate([np.random.normal(25, 5, 1000),
np.random.normal(55, 5, 1000),
np.random.normal(55, 5, 1000)])
y = np.concatenate([np.random.normal(25, 5, 1000),
np.random.normal(60, 5, 1000),
np.random.normal(15, 5, 1000)])
data = pd.DataFrame({"x": x, "y": y})
# K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(data)
print(f"Centroids:\n{kmeans.cluster_centers_}")
# Hierarchical
hierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')
h_clusters = hierarchical.fit_predict(data)
# Dendrogram
linkage_matrix = linkage(data, method='ward')
dendrogram(linkage_matrix)
plt.show()
```
### Elbow Method
```python
# Find optimal K
wcss = []
for k in range(1, 15):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(data)
wcss.append(kmeans.inertia_)
# Plot
plt.plot(range(1, 15), wcss, marker='o')
plt.xlabel('K')
plt.ylabel('WCSS')
plt.title('Elbow Method')
plt.show()
```
## Algorithm Comparison
### Performance Summary
| Feature | K-Means | Hierarchical |
|---------|---------|--------------|
| **Speed** | โก Fast | ๐ข Slow |
| **Dataset Size** | Large (3,000 points) | Small (300 points) |
| **K Selection** | Must specify (Elbow Method) | From dendrogram |
| **Scalability** | โ
10,000+ points | โ ๏ธ < 5,000 points |
| **Visualization** | Centroids | Dendrogram tree |
| **Complexity** | O(nรKรiterations) | O(nยณ) |
| **Cluster Shape** | Spherical | Any shape |
### When to Use
**K-Means:**
- โ
Large datasets (10,000+ points)
- โ
Speed is critical
- โ
Production systems
- โ
Spherical clusters expected
**Hierarchical:**
- โ
Unknown number of clusters
- โ
Small/medium datasets (< 5,000 points)
- โ
Need to visualize hierarchy
- โ
Exploratory analysis
## Key Insights
**โ
Both Algorithms Succeeded:**
- K-Means: 3,000 points processed efficiently
- Hierarchical: 300 points with clear dendrogram
- Elbow Method confirmed K=3
- Dendrogram showed 3-cluster structure
**๐ Best Practices:**
- Use Elbow Method for K-Means optimization
- Use Dendrogram for Hierarchical K selection
- Scale features before clustering
- Start with K-Means for large data
- Use Hierarchical for exploratory analysis
**โ ๏ธ Common Pitfalls:**
- Using Hierarchical on large datasets (too slow!)
- Not scaling features (distance-based algorithms need it)
- Choosing K randomly (use Elbow/Dendrogram)
- Ignoring domain knowledge
## Requirements
```
numpy>=1.24.0
pandas>=2.0.0
scikit-learn>=1.3.0
matplotlib>=3.7.0
plotly>=5.15.0
scipy>=1.11.0
jupyter>=1.0.0
```
## Contributing
Contributions welcome! Please open an issue first to discuss major changes.
**Ideas:**
- Add DBSCAN algorithm
- Implement Silhouette Score
- Add real-world datasets
- Create interactive Plotly visualizations
## License
Apache License 2.0 - see LICENSE file for details.
## References
### Course
- **Udemy**: MACHINE LEARNING by DATAI TEAM
### Documentation
- [Scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)
- [K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
- [Hierarchical Clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)
- [SciPy Dendrogram](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html)
**My Machine Learning Series:**
- ๐ **Clustering Models** - [[Kaggle]](https://www.kaggle.com/code/dandrandandran2093/machine-learning-clustering-models) [[GitHub]](https://github.com/tutkufurkan/Machine-Learning---Clustering-Models) *(Current)*
- ๐ **Advanced Topics** - [[Kaggle]](https://www.kaggle.com/code/dandrandandran2093/machine-learning-advanced-topics) [[GitHub]](https://github.com/tutkufurkan/Machine-Learning---Advanced-Topics)
- ๐ฏ **Classification Models** - [[Kaggle]](https://www.kaggle.com/code/dandrandandran2093/machine-learning-classifications-models) [[GitHub]](https://github.com/tutkufurkan/Machine-Learning---Classifications-Models)
- ๐ **Regression Models** - [[Kaggle]](https://www.kaggle.com/code/dandrandandran2093/machine-learning-regression-models) [[GitHub]](https://github.com/tutkufurkan/Machine-Learning---Regression-Models)
## Acknowledgments
- DATAI TEAM for the machine learning course
- Scikit-learn and SciPy developers
- Open-source community
---
## ๐ Connect
- Open an issue for questions
- Connect on [Kaggle](https://www.kaggle.com/dandrandandran2093)
- Visit [tutkufurkan.com](https://www.tutkufurkan.com/)
- Star โญ if helpful!
---
**Happy Clustering! ๐ฏ๐**
๐ [tutkufurkan.com](https://www.tutkufurkan.com/)