https://github.com/gatenlp/cluster-embeddings
Simple script to create clusters from embeddings in word2vec format
https://github.com/gatenlp/cluster-embeddings
Last synced: 10 months ago
JSON representation
Simple script to create clusters from embeddings in word2vec format
- Host: GitHub
- URL: https://github.com/gatenlp/cluster-embeddings
- Owner: GateNLP
- Created: 2017-09-01T15:37:19.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-10-18T10:30:25.000Z (over 8 years ago)
- Last Synced: 2025-05-06T13:45:32.613Z (about 1 year ago)
- Language: Python
- Size: 5.86 KB
- Stars: 10
- Watchers: 13
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Tool for clustering embeddings
So far, just a simple script to run a bunch of sklearn clustering algorithms
on embeddings in word2vec format.
Some results on usefulness:
* usefulness for POS tagging: https://github.com/GateNLP/exp-lf-pos
## Usage
Get usage information by running:
* `python3 ./python/cluster-embs.py -h`
Currently the following clustering algorithms are supported (using the sklearn back-end). For
some of the algorithms, information about the clusters is written to a file with extension `info.json`:
* MiniBatchKMeans (default): k-means clustering using minibatch SGD
Info: `cluster_centers`, `inertia`, `counts`, `n_iter`
* KMeans: k-means clustering
Info: `cluter_centers`, `inertia`, `n_iter`
* AgglomerativeClusteringWard: agglomerative clustering using ward linkage
* AgglomerativeClusteringAverageEuclidean: agglomerative clustering using average linkage
* Birch
* SpectralClustering: spectral clustering with rbf affinity
For MiniBatchKMeans and KMeans, the 20 elements most similar to each of the k cluster centroids
are stored in a file with the extension `mostsimilar.json`.
## Clustering times
| embeddings | machine | alg | k | elapsed time, total | elapsed time, clustering |
|------------|---------|-----|---|-------------|----|
| fasttext wiki.en.vec | derwent | KMeans | 100 | 5:21:16 | ??? |
| fasttext wiki.en.vec | derwent | MiniBatchKMeans | 100 | 0:16:37 | 0:00:38 |
| fasttext wiki.en.vec | derwent | MiniBatchKMeans | 500 | 0:22:18 | 0:02:22 |
| fasttext wiki.bg.vec | derwent | MiniBatchKMeans | 100 | 0:02:27 | 0:00:13 |
| fasttext wiki.bg.vec | derwent | MiniBatchKMeans | 500 | 0:04:42 | 0:01:18 |