https://github.com/pabvald/semantic-similarity
Comparison of methods based on pre-trained Word2Vec, GloVe and FastText vectors to measure the semantic similarity between sentence pairs
https://github.com/pabvald/semantic-similarity
bachelor-thesis embeddings evaluation fasttext gensim-library glove semantic-similarity spacy word-embeddings word2vec
Last synced: 3 months ago
JSON representation
Comparison of methods based on pre-trained Word2Vec, GloVe and FastText vectors to measure the semantic similarity between sentence pairs
- Host: GitHub
- URL: https://github.com/pabvald/semantic-similarity
- Owner: pabvald
- License: agpl-3.0
- Created: 2020-04-16T10:45:49.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2023-03-28T08:29:39.000Z (about 2 years ago)
- Last Synced: 2025-01-27T22:39:46.302Z (3 months ago)
- Topics: bachelor-thesis, embeddings, evaluation, fasttext, gensim-library, glove, semantic-similarity, spacy, word-embeddings, word2vec
- Language: Jupyter Notebook
- Homepage:
- Size: 76.9 MB
- Stars: 6
- Watchers: 2
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Semantic Similarity Methods
Comparison of methods based on pre-trained Word2Vec, GloVe and FastText vectors to measure the semantic similarity between sentence pairs## Content
- `data/`
- `datatsets/`
- `get_datasets.bash`: *script* to download the datasets used in the evaluation, which is a modification of the one provided in the [SentEval](https://github.com/facebookresearch/SentEval) toolkit.
- `tokenizer.vec`
- `embedding/`
- `fasttext/get_fasttext_embeddings.bash`: script that downloads the set of word vectors computed with the FastText used.
- `gloVe/`
- `2word2vec.py`: transforms the GloVe vector set to Word2Vec format.
- `get_glove_embeddings.bash`: script that downloads the GloVe word embeddings set used.
- `word2vec/get_word2vec_embeddings.bash`: script that downloads the Word2Vec word embeddings set used.
- `frequencies.tsv`- `evaluation.ipynb`: Jupyter Notebook file in which the evaluation carried out is developed.
- `load.py`: contains a set of functions to load and preprocess the different data sets used. The code is based on what can be found in the [SentEval]To run the evaluation code, contained in the Jupyter Notebook file [evaluation.ipynb](./evaluation.ipynb), you can follow the following steps:
## Evaluation
### 1. Installing Python3.7 and its virtual environment tool
First, install Python3.7 and the virtual environment tool:
```
sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.7
sudo apt install python3.7-venv
```### 2. Creating and Activating a Python3.7 Virtual Environment
Second, create a Python3.7 virtual environment inside this repository:```
python3.7 -m venv .venv
```
and activate it:
```
source .venv/bin/activate
```### 3. Installing the dependencies
Once the virtual environment is activated, install the dependencies using the following command:
```
pip install -r requirements.txt
```### 4. Downloading vector sets
Note that in order to reproduce the evaluation contained in the [evaluation.ipynb](./evaluation.ipynb) file, you must first download the Word2Vec, GloVe and FastText word vector sets. Each of these sets is of considerable size and may take several minutes to download.#### 4.1. Downloading the Word2Vec set
With this repository (*semantic_similarity/*) being the current directory, run the following commands:```
cd data/embedding/word2vec
chmod +x get_word2vec_embeddings.bash
./get_word2vec_embeddings.bash
```#### 4.2. Downloading the GloVe set
With this repository (*semantic_similarity/*) being the current directory, run the following commands:```
cd data/embedding/glove
chmod +x get_glove_embeddings.bash
./get_glove_embeddings.bash
python 2word2vec.py
```#### 4.3. Downloading the FastText set
With this repository (*semantic_similarity/*) being the current directory, run the following commands:```
cd data/embedding/fasttext
chmod +x get_fasttext_embeddings.bash
./get_fasttext_embeddings.bash
```### 5. Downloading the datasets
It is also necessary to download the datasets. For them, this repository (*semantic_similarity/*) being the current directory, run the following commands:
```
cd data/datasets
sudo chmod +x get_datasets.bash
./get_datasets.bash
```### 6. Starting Jupyter Notebook
Run Jupyter Notebook and access the *evaluation.ipynb* file. To run Jupyter Notebook, execute the following command:
```
jupyter-notebook
```Once you have finished using Jupyter Notebook, in the terminal where you executed the previous command, use `Ctrl + C` to end the execution of Jupyter Notebook. Finally, disable the virtual environment using the following command:
```
deactivate
```## Dependencies
```
gensim==3.8.2
jupyter==1.0.0
notebook==6.0.3
numpy==1.18.3
Orange3==3.25.0
pandas==1.0.3
sklearn==0.0
spacy==2.2.4
```