https://github.com/miladfa7/persian-word-embedding
Persian Word Embedding using FastText, BERT, GPT and GloVe | تعبیه کلمات فارسی با روش های مختلف
https://github.com/miladfa7/persian-word-embedding
bert fasttext-embeddings gpt persian persian-nlp word-embeddings word-vectors
Last synced: about 2 months ago
JSON representation
Persian Word Embedding using FastText, BERT, GPT and GloVe | تعبیه کلمات فارسی با روش های مختلف
- Host: GitHub
- URL: https://github.com/miladfa7/persian-word-embedding
- Owner: miladfa7
- Created: 2023-07-17T20:15:33.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2023-07-17T21:16:37.000Z (about 2 years ago)
- Last Synced: 2023-07-17T21:40:12.275Z (about 2 years ago)
- Topics: bert, fasttext-embeddings, gpt, persian, persian-nlp, word-embeddings, word-vectors
- Homepage:
- Size: 6.84 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Persian Word Embedding
Persian Word Embedding using FastText, BERT, GPT and GloVe### 1. How to use FastText Embedding
1.1 How to install fasttext:
```python
pip install fasttext
pip install huggingface_hub
```1.2 Here is how to load and use a pre-trained vectors:
```python
import fasttext
from huggingface_hub import hf_hub_downloadmodel_path = hf_hub_download(repo_id="facebook/fasttext-en-vectors", filename="model.bin")
model = fasttext.load_model(model_path)
model.words['رسانه', 'بورس', 'اعضای', 'دیده', 'عملکرد', 'ویرایش', 'سفارش', 'کارشناسی', 'کلاه', 'کمتر', ...]
len(model.words)
2000000model['زندگی']
array([ 4.89417791e-01, 1.60882145e-01, -2.25947708e-01, -2.94273376e-01,
-1.04577184e-01, 1.17962055e-01, 1.34821936e-01, -2.41778508e-01, ...])
```1.3 Here is how to use this model to query the **nearest neighbors** of a Persian word vector:
```
model.get_nearest_neighbors("بورس", k=5)
[(0.6276253461837769, 'سهام شاخص'),
(0.6252498626708984, 'معاملات'),
(0.6190851330757141, 'بهادار'),
(0.6184772253036499, 'اقتصادبورس'),
(0.6100088357925415, 'بورسهر')]
```
-----------------------### 2. How to use BERT(ParsBERT) Embedding
2.1 How to install huggingface:
```python
pip install transformers
```
2.2 Here is how to load and use a pre-trained vectors:```python
from transformers import BertTokenizer, BertModelmodel_name = 'HooshvareLab/bert-fa-zwnj-base' # Specify the BERT model variant
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)text = "جز عشق نبود هیچ دم ساز مرا نی اول "
tokenizer.tokenize(text)['جز', 'عشق', 'نبود', 'هیچ', 'دم', 'ساز', 'مرا', 'نی', 'اول']
```
2.3 Here how to get **word embedding** of bert:
```python
encoded_input = tokenizer.encode_plus(
text,
add_special_tokens=True,
padding='max_length',
truncation=True,
max_length=150, # Specify the desired maximum length of the sequence
return_tensors='pt'
)
input_ids = encoded_input['input_ids']
attention_mask = encoded_input['attention_mask']
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
words_embedding = outputs.last_hidden_statewords_embedding
tensor([[[ 0.2545, -0.3399, 0.0990, ..., 0.3291, 0.3309, 1.2594],
[ 0.5799, -0.1835, -0.1979, ..., 0.7980, -0.3029, -0.1636],
[ 0.4741, -0.1815, -0.0451, ..., 1.8211, 0.1717, -0.3972],
...,
[-0.3178, -0.9737, 0.5525, ..., 0.4877, 0.1396, 0.7577],
[ 0.1801, -0.8703, 0.2300, ..., 0.4041, 0.4268, 0.5552],
[-0.4429, -0.3841, 0.8476, ..., 0.3903, 0.8899, 1.8148]]])words_embedding.size()
torch.Size([1, 150, 768]) # batch_size, max_len, embedding_dim```
2.4 Here how to get **sentence embedding** of bert:
```python
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")sentence_embedding = torch.mean(embeddings, dim=1).to(device) # Shape: [1, 768]
sentence_embedding = sentence_embedding.squeeze(0)
sentence_embeddingtensor([ 8.5537e-02, -7.5624e-01, 1.9884e-01, -7.9048e-01, -1.6724e+00,
-1.0927e+00, -3.7952e-01, -5.0552e-01, -6.3537e-01, 1.5239e+00,
-8.8235e-01, 4.4737e-01, -5.0677e-01, -9.2339e-01, -8.2049e-01,
3.1416e-03, -1.5347e-01, -5.0761e-01, -1.2381e+00, 1.3580e-01,
...
])
```
### 3. How to use GPT(ParsGPT) Embedding:```python
from transformers import AutoTokenizer, AutoModel# tokenizer = AutoTokenizer.from_pretrained('bolbolzaban/gpt2-persian')
# model = AutoModel.from_pretrained('bolbolzaban/gpt2-persian')tokenizer = AutoTokenizer.from_pretrained('HooshvareLab/gpt2-fa')
model = AutoModel.from_pretrained('HooshvareLab/gpt2-fa')text = 'ای یوسف خوش نام ما خوش می روی بر بام ما'
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)# Extract the embeddings
embeddings = output.last_hidden_stateembeddings
tensor([[[ 0.9679, -0.3543, -1.5806, ..., -0.6302, -1.1486, -0.1004],
[-2.0463, -2.9409, -2.5625, ..., -2.0037, 2.5055, -0.9767],
[-0.0377, -4.3028, 1.2818, ..., -3.4329, 0.9164, -1.9161],
...,
[-1.3821, -0.9317, 0.9138, ..., 0.7705, -0.4418, -0.7426],
[ 0.0521, -2.3572, 0.0921, ..., -2.0423, -0.1339, 1.7548],
[-0.8144, -1.0173, 1.5099, ..., -0.4598, -0.7072, 2.4239]]],
grad_fn=)embeddings.size()
torch.Size([1, 12, 768])```