https://github.com/miladfa7/persian-word-embedding

Persian Word Embedding using FastText, BERT, GPT and GloVe | تعبیه کلمات فارسی با روش های مختلف
https://github.com/miladfa7/persian-word-embedding

bert fasttext-embeddings gpt persian persian-nlp word-embeddings word-vectors

Last synced: about 2 months ago
JSON representation

Persian Word Embedding using FastText, BERT, GPT and GloVe | تعبیه کلمات فارسی با روش های مختلف

Host: GitHub
URL: https://github.com/miladfa7/persian-word-embedding
Owner: miladfa7
Created: 2023-07-17T20:15:33.000Z (about 2 years ago)
Default Branch: master
Last Pushed: 2023-07-17T21:16:37.000Z (about 2 years ago)
Last Synced: 2023-07-17T21:40:12.275Z (about 2 years ago)
Topics: bert, fasttext-embeddings, gpt, persian, persian-nlp, word-embeddings, word-vectors
Homepage:
Size: 6.84 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Persian Word Embedding

Persian Word Embedding using FastText, BERT, GPT and GloVe

### 1. How to use FastText Embedding 

1.1 How to install fasttext:

```python

pip install fasttext

pip install huggingface_hub

```

1.2 Here is how to load and use a pre-trained vectors:

```python

import fasttext

from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="facebook/fasttext-en-vectors", filename="model.bin")

model = fasttext.load_model(model_path)

model.words

['رسانه', 'بورس', 'اعضای', 'دیده', 'عملکرد', 'ویرایش', 'سفارش', 'کارشناسی', 'کلاه', 'کمتر', ...]

len(model.words)

2000000

model['زندگی']

array([ 4.89417791e-01,  1.60882145e-01, -2.25947708e-01, -2.94273376e-01,

       -1.04577184e-01,  1.17962055e-01,  1.34821936e-01, -2.41778508e-01, ...])

```

1.3 Here is how to use this model to query the **nearest neighbors** of a Persian word vector:

```

model.get_nearest_neighbors("بورس", k=5)

[(0.6276253461837769, 'سهام شاخص'),

 (0.6252498626708984, 'معاملات'),

 (0.6190851330757141, 'بهادار'),

 (0.6184772253036499, 'اقتصادبورس'),

 (0.6100088357925415, 'بورسهر')]

```

-----------------------

### 2. How to use BERT(ParsBERT) Embedding 

2.1 How to install huggingface:

```python

pip install transformers

```

2.2 Here is how to load and use a pre-trained vectors:

```python

from transformers import BertTokenizer, BertModel

model_name = 'HooshvareLab/bert-fa-zwnj-base'  # Specify the BERT model variant

tokenizer = BertTokenizer.from_pretrained(model_name)

model = BertModel.from_pretrained(model_name)

text = "جز عشق نبود هیچ دم ساز مرا نی اول "

tokenizer.tokenize(text)

['جز', 'عشق', 'نبود', 'هیچ', 'دم', 'ساز', 'مرا', 'نی', 'اول']

```

2.3 Here how to get **word embedding** of bert:

```python

encoded_input = tokenizer.encode_plus(

    text,

    add_special_tokens=True,

    padding='max_length',

    truncation=True,

    max_length=150,  # Specify the desired maximum length of the sequence

    return_tensors='pt' 

  )

input_ids = encoded_input['input_ids']

attention_mask = encoded_input['attention_mask']

with torch.no_grad():

  outputs = model(input_ids, attention_mask=attention_mask)

  words_embedding = outputs.last_hidden_state

words_embedding

tensor([[[ 0.2545, -0.3399,  0.0990,  ...,  0.3291,  0.3309,  1.2594],

         [ 0.5799, -0.1835, -0.1979,  ...,  0.7980, -0.3029, -0.1636],

         [ 0.4741, -0.1815, -0.0451,  ...,  1.8211,  0.1717, -0.3972],

         ...,

         [-0.3178, -0.9737,  0.5525,  ...,  0.4877,  0.1396,  0.7577],

         [ 0.1801, -0.8703,  0.2300,  ...,  0.4041,  0.4268,  0.5552],

         [-0.4429, -0.3841,  0.8476,  ...,  0.3903,  0.8899,  1.8148]]])

words_embedding.size()

torch.Size([1, 150, 768]) # batch_size, max_len, embedding_dim

```

2.4 Here how to get **sentence embedding** of bert:

```python

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

sentence_embedding = torch.mean(embeddings, dim=1).to(device)  # Shape: [1, 768]

sentence_embedding = sentence_embedding.squeeze(0)

sentence_embedding

tensor([ 8.5537e-02, -7.5624e-01,  1.9884e-01, -7.9048e-01, -1.6724e+00,

        -1.0927e+00, -3.7952e-01, -5.0552e-01, -6.3537e-01,  1.5239e+00,

        -8.8235e-01,  4.4737e-01, -5.0677e-01, -9.2339e-01, -8.2049e-01,

         3.1416e-03, -1.5347e-01, -5.0761e-01, -1.2381e+00,  1.3580e-01,

       ...

])

```

### 3. How to use GPT(ParsGPT) Embedding:

```python

from transformers import AutoTokenizer, AutoModel

# tokenizer = AutoTokenizer.from_pretrained('bolbolzaban/gpt2-persian')

# model = AutoModel.from_pretrained('bolbolzaban/gpt2-persian')

tokenizer = AutoTokenizer.from_pretrained('HooshvareLab/gpt2-fa')

model = AutoModel.from_pretrained('HooshvareLab/gpt2-fa')

text = 'ای یوسف خوش نام ما خوش می‌ روی بر بام ما'

encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)

# Extract the embeddings

embeddings = output.last_hidden_state

embeddings

tensor([[[ 0.9679, -0.3543, -1.5806,  ..., -0.6302, -1.1486, -0.1004],

         [-2.0463, -2.9409, -2.5625,  ..., -2.0037,  2.5055, -0.9767],

         [-0.0377, -4.3028,  1.2818,  ..., -3.4329,  0.9164, -1.9161],

         ...,

         [-1.3821, -0.9317,  0.9138,  ...,  0.7705, -0.4418, -0.7426],

         [ 0.0521, -2.3572,  0.0921,  ..., -2.0423, -0.1339,  1.7548],

         [-0.8144, -1.0173,  1.5099,  ..., -0.4598, -0.7072,  2.4239]]],

       grad_fn=)

embeddings.size()

torch.Size([1, 12, 768])

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/miladfa7/persian-word-embedding

Awesome Lists containing this project

README