https://github.com/georgesittas/similarity-search
Implementation and survey of similarity search methods that rely on dimensionality reduction (e.g. LSH), D-dimensional vector clustering
https://github.com/georgesittas/similarity-search
clustering k-means-clustering k-nearest-neighbours lsh randomized-projection similarity-search
Last synced: 6 months ago
JSON representation
Implementation and survey of similarity search methods that rely on dimensionality reduction (e.g. LSH), D-dimensional vector clustering
- Host: GitHub
- URL: https://github.com/georgesittas/similarity-search
- Owner: georgesittas
- Created: 2022-08-14T15:19:24.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-03-09T02:47:38.000Z (over 2 years ago)
- Last Synced: 2025-03-24T21:05:10.473Z (7 months ago)
- Topics: clustering, k-means-clustering, k-nearest-neighbours, lsh, randomized-projection, similarity-search
- Language: C++
- Homepage:
- Size: 2.33 MB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Similarity Search
This project contains implementations for the following dimensionality reduction methods and their application to
the domains of similarity search (i.e. approximate k-Nearest Neighbors) and clustering:- [Locality Sensitive Hashing (stable distributions)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing)
- [Randomized Projections into Hamming Hypercubes](https://www.researchgate.net/publication/311842520_Practical_linear-space_Approximate_Near_Neighbors_in_high_dimension)```
Clustering through reverse assignment using range queries
---------------------------------------------------------1. Index all data points of interest in either data structure (LSH hash tables or a single Hypercube hash table)
2. Initialize all cluster centroids
3. At each iteration, for each centroid execute a ranged query around it and assign all neighboring points to it
4. The initial radii can be computed as min(dist between centers)/2 and after each iteration they are doubled
5. Repeat the above steps until the algorithm converges to a fix point, or certain criteria are met
6. If there are any unassigned points, we can assign them using the classic K-Means algorithm
7. Output computed clusters
```For clustering, we used the initialization scheme of [K-Means++](https://en.wikipedia.org/wiki/K-means%2B%2B) and
we implemented the classic [Lloyd's algorithm](https://en.wikipedia.org/wiki/K-means_clustering), comparing it to
the above methods based on the well-known [Silhouette](https://en.wikipedia.org/wiki/Silhouette_(clustering)) metric.
See the [project documentation, benchmark results](https://github.com/GeorgeSittas/Similarity-Search/blob/main/report.pdf)
and the related [bibliography](https://github.com/GeorgeSittas/Similarity-Search/blob/main/details.pdf).## Contributors
• [George Sittas (Jo)](https://github.com/GeorgeSittas)\
• [Dimitra Kousta (Demesta)](https://github.com/Demesta)