https://github.com/tahiri-lab/phydbscan
phyDBSCAN: Building alternative phylogenetic trees using DBSCAN and Robinson and Foulds distance
https://github.com/tahiri-lab/phydbscan
bioinformatics classification clustering consensus-tree dbscan phylogeny robinson-foulds supertree
Last synced: 6 months ago
JSON representation
phyDBSCAN: Building alternative phylogenetic trees using DBSCAN and Robinson and Foulds distance
- Host: GitHub
- URL: https://github.com/tahiri-lab/phydbscan
- Owner: tahiri-lab
- License: mit
- Created: 2022-10-22T19:42:20.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2025-08-04T18:33:15.000Z (11 months ago)
- Last Synced: 2026-01-06T01:46:39.480Z (6 months ago)
- Topics: bioinformatics, classification, clustering, consensus-tree, dbscan, phylogeny, robinson-foulds, supertree
- Language: C++
- Homepage:
- Size: 4.52 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
phyDBSCAN
[](https://opensource.org/licenses/MIT)
[](https://devdocs.io/cpp/)

Building alternative phylogenetic trees using DBSCAN
Table of Contents
# About the project
This project aims to perform tree classification using the DBSCAN algorithm. Instead of using traditional coordinates,
distances between points are employed for the classification.
If you would like to find out more about the project, the ideas for improvement, the difficulties encountered and
the changes to be made, please read the "phyDBSCAN_Project_Report.pdf" in attachment.
# Installation
Insert your dataset matrix in the "resources/input_data.txt" file, then use one of the two compilation methods.
### Using Makefile:
Use the provided Makefile to install the project:
```
make
```
To run the project, execute:
```
./phyDBSCAN input.txt output.csv
```
To clean the project, execute:
```
make clean
```
### Using CMakeLists:
Alternatively, if you are using Clion IDE, you can use CMake for building the project. Here are the steps:
1. Run Clion IDE & Open the project
2. Go to Run -> Edit Configurations
3. Click on the "+" button and select "CMake"
4. In the "Name" field, enter "phyDBSCAN" and fill information like in the following image:

5. Click on "Apply" and "OK" and run the project
# Examples of use
To test, we took a matrix from the "resources/input_simulation_dataset.txt" file
Input Data Set used in this example (distance matrix) we put in the file "resources/input_data.txt":
```
0 0.4 0.4 0.4 0.4 1 1 1 1 1 0.8 1 1 1 1 0.8 0.8 0.6 0.8 0.8
0.4 0 0.4 0.8 0.8 0.8 0.8 0.8 0.8 0.8 1 0.8 1 1 0.8 0.8 0.8 0.8 0.8 0.8
0.4 0.4 0 0.8 0.8 1 1 1 1 1 1 1 0.8 0.8 1 0.8 0.8 0.8 0.8 0.8
0.4 0.8 0.8 0 0.6 1 1 1 1 1 0.8 1 1 1 1 0.6 0.6 0.4 0.6 0.6
0.4 0.8 0.8 0.6 0 1 1 1 1 1 0.6 0.8 0.8 0.8 0.8 1 1 0.8 1 1
1 0.8 1 1 1 0 0.4 0.4 0.4 0.4 1 0.8 1 1 0.8 1 1 0.8 1 1
1 0.8 1 1 1 0.4 0 0.6 0.4 0.6 1 0.6 1 1 0.6 1 1 0.8 1 1
1 0.8 1 1 1 0.4 0.6 0 0.6 0.6 1 0.8 1 1 0.8 1 1 0.8 1 1
1 0.8 1 1 1 0.4 0.4 0.6 0 0.6 1 0.8 1 1 0.8 1 1 0.8 1 1
1 0.8 1 1 1 0.4 0.6 0.6 0.6 0 1 0.8 1 1 0.8 1 1 0.8 1 1
0.8 1 1 0.8 0.6 1 1 1 1 1 0 0.4 0.4 0.4 0.4 1 1 0.8 1 1
1 0.8 1 1 0.8 0.8 0.6 0.8 0.8 0.8 0.4 0 0.4 0.4 0 1 1 1 1 1
1 1 0.8 1 0.8 1 1 1 1 1 0.4 0.4 0 0 0.4 1 1 1 1 1
1 1 0.8 1 0.8 1 1 1 1 1 0.4 0.4 0 0 0.4 1 1 1 1 1
1 0.8 1 1 0.8 0.8 0.6 0.8 0.8 0.8 0.4 0 0.4 0.4 0 1 1 1 1 1
0.8 0.8 0.8 0.6 1 1 1 1 1 1 1 1 1 1 1 0 0.4 0.4 0.4 0.4
0.8 0.8 0.8 0.6 1 1 1 1 1 1 1 1 1 1 1 0.4 0 0.4 0.6 0.6
0.6 0.8 0.8 0.4 0.8 0.8 0.8 0.8 0.8 0.8 0.8 1 1 1 1 0.4 0.4 0 0.6 0.6
0.8 0.8 0.8 0.6 1 1 1 1 1 1 1 1 1 1 1 0.4 0.6 0.6 0 0.6
0.8 0.8 0.8 0.6 1 1 1 1 1 1 1 1 1 1 1 0.4 0.6 0.6 0.6 0
```
In the "input_simulated_data.txt" file, the first line of this dataset is the following:
20 8 4 0 50
The first number (20) is the number of points in the dataset, the third number (4) is the number of clusters expected, it is used to calculate the ARI (Adjusted Rand Index).
The output of the program will be stored in the output.csv file as follows :
```
DBSCAN;0.490000;3;20;8;4;50;1.000000;(1<>1<>1<>1<>1<>2<>2<>2<>2<>2<>3<>3<>3<>3<>3<>4<>4<>4<>4<>4);462
```
DBSCAN : method used for the clustering
0.490000 : value of epsilon
3 : number of minimum points
20 : number of trees in the matrix
8 : number of leaves in each trees
4 : number of cluster we expect to find
50 : noise (differences between the trees within a cluster)
1.00000 : ARI
(<><><>) : partition
462 : time it took the program to calculate the clusters and ARI for the matrix
# Contact
Please email us at : or for any question or feedback.