https://github.com/pabvald/semantic-similarity

Comparison of methods based on pre-trained Word2Vec, GloVe and FastText vectors to measure the semantic similarity between sentence pairs
https://github.com/pabvald/semantic-similarity

bachelor-thesis embeddings evaluation fasttext gensim-library glove semantic-similarity spacy word-embeddings word2vec

Last synced: 5 months ago
JSON representation

Comparison of methods based on pre-trained Word2Vec, GloVe and FastText vectors to measure the semantic similarity between sentence pairs

Host: GitHub
URL: https://github.com/pabvald/semantic-similarity
Owner: pabvald
License: agpl-3.0
Created: 2020-04-16T10:45:49.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2023-03-28T08:29:39.000Z (over 2 years ago)
Last Synced: 2025-01-27T22:39:46.302Z (5 months ago)
Topics: bachelor-thesis, embeddings, evaluation, fasttext, gensim-library, glove, semantic-similarity, spacy, word-embeddings, word2vec
Language: Jupyter Notebook
Homepage:
Size: 76.9 MB
Stars: 6
Watchers: 2
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Semantic Similarity Methods
Comparison of methods based on pre-trained Word2Vec, GloVe and FastText vectors to measure the semantic similarity between sentence pairs

## Content

- `data/`
- `datatsets/`
- `get_datasets.bash`: *script* to download the datasets used in the evaluation, which is a modification of the one provided in the [SentEval](https://github.com/facebookresearch/SentEval) toolkit.
- `tokenizer.vec`
- `embedding/`
- `fasttext/get_fasttext_embeddings.bash`: script that downloads the set of word vectors computed with the FastText used.
- `gloVe/`
- `2word2vec.py`: transforms the GloVe vector set to Word2Vec format.
- `get_glove_embeddings.bash`: script that downloads the GloVe word embeddings set used.
- `word2vec/get_word2vec_embeddings.bash`: script that downloads the Word2Vec word embeddings set used.
- `frequencies.tsv`

- `evaluation.ipynb`: Jupyter Notebook file in which the evaluation carried out is developed.

- `load.py`: contains a set of functions to load and preprocess the different data sets used. The code is based on what can be found in the [SentEval]To run the evaluation code, contained in the Jupyter Notebook file [evaluation.ipynb](./evaluation.ipynb), you can follow the following steps:

## Evaluation

### 1. Installing Python3.7 and its virtual environment tool
First, install Python3.7 and the virtual environment tool:
```
sudo apt update
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.7
sudo apt install python3.7-venv
```

### 2. Creating and Activating a Python3.7 Virtual Environment
Second, create a Python3.7 virtual environment inside this repository:

```
python3.7 -m venv .venv
```
and activate it:
```
source .venv/bin/activate
```

### 3. Installing the dependencies
Once the virtual environment is activated, install the dependencies using the following command:
```
pip install -r requirements.txt
```

### 4. Downloading vector sets
Note that in order to reproduce the evaluation contained in the [evaluation.ipynb](./evaluation.ipynb) file, you must first download the Word2Vec, GloVe and FastText word vector sets. Each of these sets is of considerable size and may take several minutes to download.

#### 4.1. Downloading the Word2Vec set
With this repository (*semantic_similarity/*) being the current directory, run the following commands:

```
cd data/embedding/word2vec
chmod +x get_word2vec_embeddings.bash
./get_word2vec_embeddings.bash
```

#### 4.2. Downloading the GloVe set
With this repository (*semantic_similarity/*) being the current directory, run the following commands:

```
cd data/embedding/glove
chmod +x get_glove_embeddings.bash
./get_glove_embeddings.bash
python 2word2vec.py
```

#### 4.3. Downloading the FastText set
With this repository (*semantic_similarity/*) being the current directory, run the following commands:

```
cd data/embedding/fasttext
chmod +x get_fasttext_embeddings.bash
./get_fasttext_embeddings.bash
```

### 5. Downloading the datasets
It is also necessary to download the datasets. For them, this repository (*semantic_similarity/*) being the current directory, run the following commands:
```
cd data/datasets
sudo chmod +x get_datasets.bash
./get_datasets.bash
```

### 6. Starting Jupyter Notebook
Run Jupyter Notebook and access the *evaluation.ipynb* file. To run Jupyter Notebook, execute the following command:
```
jupyter-notebook
```

Once you have finished using Jupyter Notebook, in the terminal where you executed the previous command, use `Ctrl + C` to end the execution of Jupyter Notebook. Finally, disable the virtual environment using the following command:
```
deactivate
```

## Dependencies
```
gensim==3.8.2
jupyter==1.0.0
notebook==6.0.3
numpy==1.18.3
Orange3==3.25.0
pandas==1.0.3
sklearn==0.0
spacy==2.2.4
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pabvald/semantic-similarity

Awesome Lists containing this project

README