https://github.com/michaelb/point-clustering
Regroup points in a nth-dimension space if they are closer than a certain distance
https://github.com/michaelb/point-clustering
clustering-algorithm dimensions
Last synced: 19 days ago
JSON representation
Regroup points in a nth-dimension space if they are closer than a certain distance
- Host: GitHub
- URL: https://github.com/michaelb/point-clustering
- Owner: michaelb
- Created: 2020-03-10T18:58:20.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-05-10T10:31:35.000Z (about 6 years ago)
- Last Synced: 2026-03-01T13:46:05.248Z (4 months ago)
- Topics: clustering-algorithm, dimensions
- Language: Python
- Homepage:
- Size: 20.4 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Project Algo
===========
Authors: Michael Bleuez and a friend who may want to remain anonymous
--------------------------------------------
**Goal**:
The project aims to find the size of "clusters" within a set of points.
(A cluster is a connex composant, 2 points being 'in contact' iff they are within a given distance of each other)
**Performance**:
* A perfomance table (versions of the program, input format and execution time) is available in math/perfs.ods
* Complexity is roughly of __O(n.log(n).a^k)__ with n the number of points, where a~2.2 and k is the dimension of input space,
however actual execution time vary a lot depending on properties of input;
1. how much there are points interlinked (big clusters are detrimental in general) (or how big is distance relative to number of points)
2. randomness of the distribution: uniformly distributed allow faster resolution, **to a big extent**
* Real-world speed: at this point of the project, our algorithm can process any reasonable (random-like, 2D) input of size 20k in ~0.5s (i5 4210U 1.7Ghz, SATA SSD)
It is really hard to create a non-random distribution that is really the *worst* possible, but we have been able to slow the algorithm up to 60 sec (still 20k points.)
For reference a 100% naïve algorithm take up to 8 minutes to solve (any) 20k-sized input.
**Etymology**:
* cluster: are "connex composant", is a class of objects. Cluster object include reference to the points they contain, which themselves know which cluster they are a part of
* quadrillage: divide the space in "cases"
* points: are given a reference to an unique to a cluster object (containing only said point at first) at their creation. merge is done via merge method of cluster object
* density: relative to the given *distance*, how much the space is 'crowded'. A good exemple is that same-density sets have clusters of same ratio (size of cluster)/(total number of points)
* a-types: are input where the points are quite sparse (relative to the given distance); an a-type input will contain only few tuples and a pletoria of singletons
* b-types: are inputs contains too much points relative to the given distance, thus is usually one extra large cluster and a few others