https://github.com/darenr/wordnet-clusters

Clustering a set of word/tags using K-Means with word2vec or wordnet distance
https://github.com/darenr/wordnet-clusters

clustering k-means-clustering k-means-implementation-in-python tags word2vec wordnet-clusters

Last synced: about 1 month ago
JSON representation

Clustering a set of word/tags using K-Means with word2vec or wordnet distance

Host: GitHub
URL: https://github.com/darenr/wordnet-clusters
Owner: darenr
License: mit
Created: 2016-02-27T20:20:45.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2019-03-05T06:08:09.000Z (over 6 years ago)
Last Synced: 2025-05-07T23:38:13.557Z (about 1 month ago)
Topics: clustering, k-means-clustering, k-means-implementation-in-python, tags, word2vec, wordnet-clusters
Language: Python
Homepage:
Size: 216 KB
Stars: 26
Watchers: 2
Forks: 5
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Tag Clustering using `wordnet` and `word2vec` distance metrics

Clustering a set of `wordnet` synsets using `k-means`, the `wordnet` pair-wise distance (semantic relatedness) of word senses using the [Edge Counting method of the of Wu & Palmer (1994)](https://pdfs.semanticscholar.org/6eff/221e1cf5ae28ce7dcb60515d028b98e37aa5.pdf) is mapped to the euclidean distance to allow K-means to converge preserving the original pair-wise relationship.

By toggling `use_wordnet = False` to `True` the distance metric between words will use a `GloVe` model `glove.6B.300d_word2vec.txt` (this must be in the [word2vec format](https://radimrehurek.com/gensim/scripts/glove2word2vec.html)) and the `word2vec` similarity value

`extras` folder is proof of concept/experimentations

# To Use:

- create a newline delimited file with a list of `wordnet` senses (eg. data/example_tags.txt)
- to use `wordnet` set `use_wordnet=True`, to use `word2vec` `use_wordnet=False`
- ```python generate-tag-clusters.py data/example_tags.txt 25 0.7```
- 25 is the number of clusters to segment the list of `wordnet` senses into.
- 0.7 is the similarity threshold, below this the words are considered not similar
- results places into the `results` folder as a json file

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/darenr/wordnet-clusters

Awesome Lists containing this project

README