Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/niloth-p/phylogenetic-tree-construction-using-agglomerative-and-divisive-hierarchical-clustering
The prog is written to construct the phylogenetic tree (dendrogram) based on DNA/Protein sequences of species given in a dataset using Agglomerative and Divisive Hierarchical Clustering and to compare Agglomerative and Divisive methods
https://github.com/niloth-p/phylogenetic-tree-construction-using-agglomerative-and-divisive-hierarchical-clustering
agglomerative-clustering dendrogram divisive-clustering hierarchical-clustering min-linkage needleman-wunsch-algorithm phylogenetic-tree proximity-matrix
Last synced: about 1 month ago
JSON representation
The prog is written to construct the phylogenetic tree (dendrogram) based on DNA/Protein sequences of species given in a dataset using Agglomerative and Divisive Hierarchical Clustering and to compare Agglomerative and Divisive methods
- Host: GitHub
- URL: https://github.com/niloth-p/phylogenetic-tree-construction-using-agglomerative-and-divisive-hierarchical-clustering
- Owner: Niloth-p
- License: mit
- Created: 2018-04-30T09:46:48.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-08-08T22:25:47.000Z (over 4 years ago)
- Last Synced: 2023-10-04T12:45:00.842Z (about 1 year ago)
- Topics: agglomerative-clustering, dendrogram, divisive-clustering, hierarchical-clustering, min-linkage, needleman-wunsch-algorithm, phylogenetic-tree, proximity-matrix
- Language: Python
- Size: 124 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Phylogenetic Tree using Agglomerative and Divisive Hierarchical clustering
## Description:
The prog is written to construct the phylogenetic tree (dendrogram) based on DNA/Protein sequences of species
given in a dataset using Agglomerative and Divisive Hierarchical Clustering and to compare Agglomerative and Divisive
methods. I have used min linkage to calculate the proximities.
## Dataset used:
Human Gene DNA Sequences
http://genome.crg.es/datasets/ggalhsapgenes2005/hg16.311.putative.cds.fa## How to run:
python Clusterer.pyFor using some other dataset, change the global variable 'filename'
with the name of your dataset.
And change the global variable 'first_time' to create and store the pickle file.## PseudoCode/Explanation of the algo:
### Read data, preprocessing
(also store the data in an ordered dictionary,
so that a list of exclusive keys can be obtained - to get
all pairs of pts for calculating proximity matrix)
### Needleman Wunsch algo - for distance calculation
### Calc Z - proximity matrix, dump pickle### Clustering algo - Agglomerative
Set numOfClusters = N
Set clusters = list of keys
Loop function call to clustering method until numOfClusters = 1
Choose min value in Z -> get i and j indices
Append to matrix Z
#1 - Cluster A
#2 - CLuster B
#3 - Proximity value
#4 - numOfpts in new cluster formed i.e Total pts in clusters A & B
Append 1 to 4 to a list, and append that list to Z
Update numOfClusters -> --1
Update Clusters -> remove A or B
Update Proximity Matrix by Ward's method using Lance Williams Formula
Draw dendrogram### Divisive clustering algo
Set numOfClusters = 1
Set clusters = all points as 1 cluster
Loop function call to clustering method until numOfClusters = N
Choose the cluster to split
Get the max dissimilarity for each cluster i.e max distance between 2 points in cluster
Find the cluster with max dissimilarity - this is the cluster to split first
Find ele to form the new cluster i.e find the elemenet in the cluster
whose sum of distances to all other points in the cluster is maximum
Remove the ele to new cluster A from old cluster B
Split eles in old original cluster B between the 2 clusters A and B
Find ele with max(avdisttoB - avdisttoA)
max - the max value
avdist - average distance (sum of distances / num of points)
If avdisttoB < avdisttoA, skip to next step
Add the ele to cluster A and remove from cluster B
Keep finding eles 1 by 1 and adding them to cluster A, and removing from cluster B
Do it 1 by 1, bcoz the value of avdist will vary with addition/removal of each point
Update clusters by updating the old cluster B, and adding new cluster A
Update numOfClusters += 1
Add [clustA, clustB] to a temp list which will be needed for computing linkage matrix
Get the linkage matrix X such that
#1 - num
#2 - num + 1
#3 - Proximity value
#4 - [len(clustA), len(clustB)]
Get #4 from the temp list after reversing its order
Draw dendrogram