Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/seofernando25/dbscan_go
Implementation of a concurrent DBSCAN algorithm
https://github.com/seofernando25/dbscan_go
Last synced: about 2 months ago
JSON representation
Implementation of a concurrent DBSCAN algorithm
- Host: GitHub
- URL: https://github.com/seofernando25/dbscan_go
- Owner: seofernando25
- Created: 2022-03-09T23:51:11.000Z (almost 3 years ago)
- Default Branch: master
- Last Pushed: 2022-08-12T03:17:46.000Z (over 2 years ago)
- Last Synced: 2024-10-23T05:18:24.966Z (3 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 8.65 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Concurrent DBSCAN GO
This is a simple implementation of the [DBSCAN algorithm](https://en.wikipedia.org/wiki/DBSCAN)
That uses binary space partitioning to speed up neighborhood search.[![asciicast](https://asciinema.org/a/VO2BaDVczne5g3Tb2fss30Uqn.svg)](https://asciinema.org/a/VO2BaDVczne5g3Tb2fss30Uqn)
## How to run
Usage: `./dbscan `
The arguments default to:
`./dbscan data.csv 0.0003 5 1000 [number of cpu cores on your computer]`## Visualizing the results
The program will output 2 files called `clusters.csv` and `points.csv`.
You can use a tool such as [Google My Maps](https://www.google.com/maps/d/u/0/) or the included `visualize.ipynb` notebook to visualize the clusters (requires `jupyter`, `python`, `pandas`, and `plotty`)## About the space partitioning
The program uses binary space partitioning to speed up neighborhood search by dividing the data into a grid of cells. This is similar to a [QuadTree](https://en.wikipedia.org/wiki/Quadtree) but with only 2 children per node.
![rectangle](assets/rectangles.jpg)
This allow us to query all neighbors in `n log n` time instead of `n^2` time.
![BSP DEMO](assets/bsp.png)
Example demonstrating the use of binary space partitioning.
We don't need to check for all the points, we only need to check the points inside certain rectangles.### Building the tree
- Given a set of points, get the bounding box of the points (that will be the root of the tree)
- For each point we add, if a tree node is empty and it has not been subdivided
- set the value of the node to the point.
- Else if, it has not been subdivided:
- We subdivide the tree into 2 sub-trees taking the width/height ratio into account to know if we should split vertically or horizontally.
- Add the current point and the new point to the correct sub-tree (notice the recursive call)
- Else, we have already subdivided the tree, so we add the point to the correct sub-tree (notice the recursive call for tree traversal)## About the algorithm
- Load the data
- Transform the data into a BSP tree
- Create N worker threads to process clusters in parallel
- Given a maximum job size, divide the data into jobs until it satisfies the constraints
- The N workers will process X jobs until there are no more jobs to process
- In the meantime, the main thread will collect the results from the workers
- After all the workers have finished, the main thread will merge the results (single threaded/doesn't take advantage of the tree structure)
- The program will output the clusters and points to two csv files## Results
![vis](assets/vis.png)
With the settings:
```sh
Input file: ./data.csv
Epsilon: 0.0003
MinPts: 5
MaxJobSize: 5000
ThreadN: 12
```I managed to find ~600 clusters in 15 seconds.
The merge process takes a lot longer, about 30 seconds, but it's not part of the assessment.
Which lowered it down to ~500 clusters.Lowering the max job size improves the performance significantly at the cost of precision.
With a maxJobSize of 1000 I managed to find ~1000 clusters in about 2s. This is a huge improvement.
Merging took 40 seconds, 10 seconds slower than previously.The algorithm is by no means perfect, but it improves the clustering speed significantly compared to the region based approach. Unfortunately, the merge process is a a bit buggy and not parallelized.