{"id":32992776,"url":"https://github.com/tutkufurkan/machine-learning---clustering-models","last_synced_at":"2025-11-14T11:00:36.069Z","repository":{"id":323850836,"uuid":"1091986941","full_name":"tutkufurkan/Machine-Learning---Clustering-Models","owner":"tutkufurkan","description":"Comprehensive Machine Learning clustering tutorial with K-Means and Hierarchical Clustering implementations. Features synthetic data generation, Elbow Method optimization, dendrogram visualization, and detailed algorithm comparisons. Built with Python, scikit-learn, and Plotly.","archived":false,"fork":false,"pushed_at":"2025-11-12T12:22:48.000Z","size":670,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-12T13:28:05.578Z","etag":null,"topics":["clustering","clustering-analysis","data-science","data-visualization","dendrogram","hierarchical-clustering","k-means","kaggle","machine-learning","matplotlib","python","scikit-learn","scipy","synthetic-data","unsupervised-learning","ward-linkage"],"latest_commit_sha":null,"homepage":"https://tutkufurkan.com","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tutkufurkan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-07T20:19:51.000Z","updated_at":"2025-11-12T12:22:52.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/tutkufurkan/Machine-Learning---Clustering-Models","commit_stats":null,"previous_names":["tutkufurkan/machine-learning---clustering-models"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/tutkufurkan/Machine-Learning---Clustering-Models","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tutkufurkan%2FMachine-Learning---Clustering-Models","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tutkufurkan%2FMachine-Learning---Clustering-Models/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tutkufurkan%2FMachine-Learning---Clustering-Models/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tutkufurkan%2FMachine-Learning---Clustering-Models/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tutkufurkan","download_url":"https://codeload.github.com/tutkufurkan/Machine-Learning---Clustering-Models/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tutkufurkan%2FMachine-Learning---Clustering-Models/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":284194067,"owners_count":26963045,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-13T02:00:06.582Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","clustering-analysis","data-science","data-visualization","dendrogram","hierarchical-clustering","k-means","kaggle","machine-learning","matplotlib","python","scikit-learn","scipy","synthetic-data","unsupervised-learning","ward-linkage"],"created_at":"2025-11-13T10:00:51.572Z","updated_at":"2025-11-14T11:00:36.041Z","avatar_url":"https://github.com/tutkufurkan.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Machine Learning Clustering Models Tutorial\n\n[![Python](https://img.shields.io/badge/Python-3.x-blue.svg)](https://www.python.org/)\n[![Scikit-learn](https://img.shields.io/badge/Scikit--learn-Latest-orange.svg)](https://scikit-learn.org/)\n[![Plotly](https://img.shields.io/badge/Plotly-Latest-blue.svg)](https://plotly.com/)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Kaggle](https://img.shields.io/badge/Kaggle-Notebook-20BEFF?logo=kaggle\u0026logoColor=white)](https://www.kaggle.com/code/dandrandandran2093/machine-learning-clustering-models)\n[![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/sekertutku/Machine-Learning---Clustering-Models)\n\n## Overview\n\nA comprehensive tutorial on **unsupervised machine learning clustering techniques** using Python. Learn K-Means and Hierarchical Clustering with synthetic data, mathematical explanations, interactive visualizations, and detailed performance comparisons.\n\n## 🎮 Interactive Demo\n\n**👉 [Run the Interactive Notebook on Kaggle](https://www.kaggle.com/code/dandrandandran2093/machine-learning-clustering-models)**\n\n## Table of Contents\n\n- [What is Clustering?](#what-is-clustering)\n- [Clustering Algorithms](#clustering-algorithms)\n- [Dataset](#dataset)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Algorithm Comparison](#algorithm-comparison)\n- [Key Insights](#key-insights)\n- [References](#references)\n\n## What is Clustering?\n\n**Clustering** is an unsupervised learning technique that groups similar data points together without predefined labels. Unlike supervised learning, clustering discovers hidden patterns in unlabeled data.\n\n### Supervised vs Unsupervised Learning\n\n| Type | Has Labels? | Examples | Goal |\n|------|-------------|----------|------|\n| **Supervised** | ✅ Yes | Classification, Regression | Predict labels |\n| **Unsupervised** | ❌ No | Clustering | Discover patterns |\n\n**Common Use Cases:**\n- 🛒 Customer segmentation\n- 🧬 Gene expression analysis\n- 📸 Image segmentation\n- 📄 Document clustering\n- 🔍 Anomaly detection\n\n## Clustering Algorithms\n\n### 1. K-Means Clustering\n\n**Concept**: Partitions data into K clusters by minimizing within-cluster variance.\n\n**Algorithm:**\n1. Choose K (number of clusters)\n2. Initialize K random centroids\n3. Assign points to nearest centroid\n4. Update centroids (mean of assigned points)\n5. Repeat until convergence\n\n**Formula:**\n$$\\text{Minimize: } \\sum_{i=1}^{K}\\sum_{x \\in C_i}||x - \\mu_i||^2$$\n\n**Elbow Method**: Plot K vs WCSS to find optimal number of clusters. Look for the \"elbow point\" where WCSS decrease slows down.\n\n**Advantages:**\n- ⚡ Fast and efficient\n- 📊 Scalable to large datasets\n- 🎯 Simple to implement\n\n**Disadvantages:**\n- 🎲 Must specify K beforehand\n- 🔄 Sensitive to initialization\n- ⭕ Assumes spherical clusters\n\n### 2. Hierarchical Clustering\n\n**Concept**: Builds a hierarchy of clusters without specifying K beforehand. Creates a dendrogram (tree structure) showing relationships.\n\n**Algorithm (Agglomerative):**\n1. Start with each point as its own cluster\n2. Merge two closest clusters\n3. Repeat until one cluster remains\n4. Cut dendrogram at desired height to get K clusters\n\n**Formula:**\n$$\\text{Distance: } d(C_i, C_j) = \\min_{x \\in C_i, y \\in C_j} ||x - y||$$\n\n**Linkage Methods:**\n- **Ward**: Minimizes variance (most common)\n- **Single**: Minimum distance\n- **Complete**: Maximum distance\n- **Average**: Average distance\n\n**Advantages:**\n- 🌳 No need to specify K\n- 📊 Dendrogram visualization\n- 🔗 Captures hierarchical relationships\n\n**Disadvantages:**\n- 🐢 Slow (O(n³) complexity)\n- 💾 Not suitable for large datasets\n- 🔒 Merge decisions are irreversible\n\n## Dataset\n\n**Synthetic Data Generation**: 3 clusters with Gaussian distribution\n\n| Cluster | Location | Mean (x, y) | Points (K-Means) | Points (Hierarchical) |\n|---------|----------|-------------|------------------|-----------------------|\n| 1 | Bottom-left | (25, 25) | 1,000 | 100 |\n| 2 | Top-right | (55, 60) | 1,000 | 100 |\n| 3 | Bottom-right | (55, 15) | 1,000 | 100 |\n\n**Total**: 3,000 points for K-Means / 300 points for Hierarchical\n\n**Why different sizes?** Hierarchical is computationally expensive (O(n³)), so we use a smaller dataset for reasonable runtime.\n\n## Installation\n\n### Option 1: Kaggle (Recommended) ⭐\n\n👉 **[Open on Kaggle](https://www.kaggle.com/code/dandrandandran2093/machine-learning-clustering-models)** - Everything pre-configured!\n\n### Option 2: Local\n\n```bash\n# Clone repository\ngit clone https://github.com/sekertutku/Machine-Learning---Clustering-Models.git\ncd Machine-Learning---Clustering-Models\n\n# Install dependencies\npip install -r requirements.txt\n\n# Run notebook\njupyter notebook machine-learning-clustering-models.ipynb\n```\n\n## Usage\n\n### Quick Start\n\n```python\nimport numpy as np\nimport pandas as pd\nfrom sklearn.cluster import KMeans, AgglomerativeClustering\nfrom scipy.cluster.hierarchy import dendrogram, linkage\nimport matplotlib.pyplot as plt\n\n# Generate data\nx = np.concatenate([np.random.normal(25, 5, 1000), \n                    np.random.normal(55, 5, 1000),\n                    np.random.normal(55, 5, 1000)])\ny = np.concatenate([np.random.normal(25, 5, 1000),\n                    np.random.normal(60, 5, 1000),\n                    np.random.normal(15, 5, 1000)])\ndata = pd.DataFrame({\"x\": x, \"y\": y})\n\n# K-Means\nkmeans = KMeans(n_clusters=3, random_state=42)\nclusters = kmeans.fit_predict(data)\nprint(f\"Centroids:\\n{kmeans.cluster_centers_}\")\n\n# Hierarchical\nhierarchical = AgglomerativeClustering(n_clusters=3, linkage='ward')\nh_clusters = hierarchical.fit_predict(data)\n\n# Dendrogram\nlinkage_matrix = linkage(data, method='ward')\ndendrogram(linkage_matrix)\nplt.show()\n```\n\n### Elbow Method\n\n```python\n# Find optimal K\nwcss = []\nfor k in range(1, 15):\n    kmeans = KMeans(n_clusters=k, random_state=42)\n    kmeans.fit(data)\n    wcss.append(kmeans.inertia_)\n\n# Plot\nplt.plot(range(1, 15), wcss, marker='o')\nplt.xlabel('K')\nplt.ylabel('WCSS')\nplt.title('Elbow Method')\nplt.show()\n```\n\n## Algorithm Comparison\n\n### Performance Summary\n\n| Feature | K-Means | Hierarchical |\n|---------|---------|--------------|\n| **Speed** | ⚡ Fast | 🐢 Slow |\n| **Dataset Size** | Large (3,000 points) | Small (300 points) |\n| **K Selection** | Must specify (Elbow Method) | From dendrogram |\n| **Scalability** | ✅ 10,000+ points | ⚠️ \u003c 5,000 points |\n| **Visualization** | Centroids | Dendrogram tree |\n| **Complexity** | O(n×K×iterations) | O(n³) |\n| **Cluster Shape** | Spherical | Any shape |\n\n### When to Use\n\n**K-Means:**\n- ✅ Large datasets (10,000+ points)\n- ✅ Speed is critical\n- ✅ Production systems\n- ✅ Spherical clusters expected\n\n**Hierarchical:**\n- ✅ Unknown number of clusters\n- ✅ Small/medium datasets (\u003c 5,000 points)\n- ✅ Need to visualize hierarchy\n- ✅ Exploratory analysis\n\n## Key Insights\n\n**✅ Both Algorithms Succeeded:**\n- K-Means: 3,000 points processed efficiently\n- Hierarchical: 300 points with clear dendrogram\n- Elbow Method confirmed K=3\n- Dendrogram showed 3-cluster structure\n\n**📊 Best Practices:**\n- Use Elbow Method for K-Means optimization\n- Use Dendrogram for Hierarchical K selection\n- Scale features before clustering\n- Start with K-Means for large data\n- Use Hierarchical for exploratory analysis\n\n**⚠️ Common Pitfalls:**\n- Using Hierarchical on large datasets (too slow!)\n- Not scaling features (distance-based algorithms need it)\n- Choosing K randomly (use Elbow/Dendrogram)\n- Ignoring domain knowledge\n\n## Requirements\n\n```\nnumpy\u003e=1.24.0\npandas\u003e=2.0.0\nscikit-learn\u003e=1.3.0\nmatplotlib\u003e=3.7.0\nplotly\u003e=5.15.0\nscipy\u003e=1.11.0\njupyter\u003e=1.0.0\n```\n\n## Contributing\n\nContributions welcome! Please open an issue first to discuss major changes.\n\n**Ideas:**\n- Add DBSCAN algorithm\n- Implement Silhouette Score\n- Add real-world datasets\n- Create interactive Plotly visualizations\n\n## License\n\nApache License 2.0 - see LICENSE file for details.\n\n## References\n\n### Course\n- **Udemy**: MACHINE LEARNING by DATAI TEAM\n\n### Documentation\n- [Scikit-learn Clustering](https://scikit-learn.org/stable/modules/clustering.html)\n- [K-Means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)\n- [Hierarchical Clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)\n- [SciPy Dendrogram](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html)\n\n**My Machine Learning Series:**\n\n- 🔍 **Clustering Models** - [[Kaggle]](https://www.kaggle.com/code/dandrandandran2093/machine-learning-clustering-models) [[GitHub]](https://github.com/tutkufurkan/Machine-Learning---Clustering-Models) *(Current)*\n\n- 🚀 **Advanced Topics** - [[Kaggle]](https://www.kaggle.com/code/dandrandandran2093/machine-learning-advanced-topics) [[GitHub]](https://github.com/tutkufurkan/Machine-Learning---Advanced-Topics)\n\n- 🎯 **Classification Models** - [[Kaggle]](https://www.kaggle.com/code/dandrandandran2093/machine-learning-classifications-models) [[GitHub]](https://github.com/tutkufurkan/Machine-Learning---Classifications-Models)\n\n- 📈 **Regression Models** - [[Kaggle]](https://www.kaggle.com/code/dandrandandran2093/machine-learning-regression-models) [[GitHub]](https://github.com/tutkufurkan/Machine-Learning---Regression-Models)\n\n## Acknowledgments\n\n- DATAI TEAM for the machine learning course\n- Scikit-learn and SciPy developers\n- Open-source community\n\n---\n\n## 📞 Connect\n\n- Open an issue for questions\n- Connect on [Kaggle](https://www.kaggle.com/dandrandandran2093)\n- Visit [tutkufurkan.com](https://www.tutkufurkan.com/)\n- Star ⭐ if helpful!\n\n---\n\n**Happy Clustering! 🎯🔍**\n\n🌐 [tutkufurkan.com](https://www.tutkufurkan.com/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftutkufurkan%2Fmachine-learning---clustering-models","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftutkufurkan%2Fmachine-learning---clustering-models","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftutkufurkan%2Fmachine-learning---clustering-models/lists"}