Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/saritaphd/hierarchial-clustering
Hiearchial clustering manual with example
https://github.com/saritaphd/hierarchial-clustering
hierarchical-clustering python
Last synced: 3 days ago
JSON representation
Hiearchial clustering manual with example
- Host: GitHub
- URL: https://github.com/saritaphd/hierarchial-clustering
- Owner: SaritaPhD
- License: mit
- Created: 2023-10-18T09:04:48.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-10-18T09:16:27.000Z (about 1 year ago)
- Last Synced: 2023-10-18T10:28:45.258Z (about 1 year ago)
- Topics: hierarchical-clustering, python
- Language: Jupyter Notebook
- Homepage:
- Size: 310 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Hierarchial-Clustering
## 1. Data Preparation:Begin with a dataset containing your observations (data points) and the features (variables) that describe those observations. Ensure that your data is in a suitable format for clustering, with numeric values or variables that can be converted to numeric.
## 2. Distance Metric:Choose a distance metric that measures the dissimilarity between data points. Common distance metrics include:
- Euclidean Distance: Measures the straight-line distance between two points in Euclidean space.
- Manhattan Distance: Measures the sum of the absolute differences between corresponding coordinates of two points.
- Cosine Similarity: Measures the cosine of the angle between two non-zero vectors.
Others: You can explore other distance metrics based on your data's characteristics.
## 3. Linkage Method:The linkage method determines how the distance between clusters is calculated when merging. Common linkage methods include:
- Single Linkage: The distance between two clusters is defined as the minimum distance between any two data points in the clusters. It often results in elongated, chain-like clusters.
- Complete Linkage: The distance between two clusters is defined as the maximum distance between any two data points in the clusters. It tends to create more spherical clusters.
- Average Linkage: The distance is defined as the average of all pairwise distances between data points in the two clusters.
- Ward's Method: Minimizes the variance within clusters and is less sensitive to outliers.
## 4. Dendrogram:Perform hierarchical clustering using your chosen distance metric and linkage method. This process creates a dendrogram, which is a tree-like diagram showing the hierarchy of clusters. Each leaf represents a data point, and the branches indicate the merging of clusters.
## 5. Dendrogram Interpretation:Analyze the dendrogram to understand how the clusters are formed. The height at which branches merge corresponds to the dissimilarity between the clusters. Lower branches represent finer-grained clusters, while higher branches indicate broader clusters.
## 6. Cutting the Dendrogram:Decide how many clusters you want by cutting the dendrogram at an appropriate height. The choice of cut-off depends on the problem and your objectives. You might use the elbow method or other heuristics to make this decision.
## 7. Assigning Data Points to Clusters:Assign each data point to a cluster based on the cut-off you chose in the dendrogram. The resulting clusters are your final clusters.
Advanced Concepts:- Handling Large Datasets: For large datasets, hierarchical clustering can be computationally intensive. Use optimized algorithms or techniques like mini-batch hierarchical clustering to handle large datasets efficiently.
- Evaluating Cluster Quality: Employ internal and external validation metrics to assess the quality of your clusters. Common metrics include the silhouette score, Davies-Bouldin index, and adjusted Rand index.
- Hierarchical Clustering in Python or R: Implement hierarchical clustering using libraries such as scipy or scikit-learn in Python, or hclust in R. These libraries offer functions for hierarchical clustering and visualization.
- Agglomerative vs. Divisive Clustering: While agglomerative clustering starts with individual data points and merges them, divisive clustering begins with a single cluster and divides it into smaller clusters. The choice between these approaches depends on the problem.
- Visualization: Create dendrograms, heatmaps, or other visualizations to better understand your clusters and their relationships.
Hierarchical clustering is a flexible and powerful method for grouping data points, and the choice of distance metric and linkage method can significantly impact the results. Experimentation and a deep understanding of your data and problem domain are key to successful hierarchical clustering.