https://github.com/inzva/Turkish-GloVe
Türkçe GloVe - Repository for Turkish GloVe Word Embeddings
https://github.com/inzva/Turkish-GloVe
turkish-nlp
Last synced: 3 months ago
JSON representation
Türkçe GloVe - Repository for Turkish GloVe Word Embeddings
- Host: GitHub
- URL: https://github.com/inzva/Turkish-GloVe
- Owner: inzva
- License: mit
- Created: 2020-12-27T12:58:25.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2023-04-12T10:49:46.000Z (about 2 years ago)
- Last Synced: 2024-10-11T12:46:05.826Z (8 months ago)
- Topics: turkish-nlp
- Language: Jupyter Notebook
- Homepage:
- Size: 227 KB
- Stars: 66
- Watchers: 17
- Forks: 6
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- turkish-nlp-resources - TurkishGloVe
README
# TurkishGloVe
Türkçe GloVe - Repository for Turkish GloVe Word Embeddings## Training
We used official GloVe repository both to create word embeddings and evaluation.
GloVe Github Repository## Download pre-trained word vectors
1. 570K Vocab, cased, 300d vectors, 1.6 GB Text, 2.6 GB Binary link
2. 253K Vocab, uncased, 300d vectors, 720 MB Text 1.2 GB Binary link:
## Corpus
Corpus collected from January-December 2018 Commoncrawl.
This corpus has 2,736B tokens.
Corpus size: 5.4GB
Corpus Link \
Paper Link## Intrinsic Evaluation
This benchmark dataset is used for intrinsic evaluation on analogy task.
We used synonyms, capitals, and antonyms for analogy task.
Benchmark Dataset Link### Results
| Semantic Evaluation | Antonyms Analogy Task | Capitals Analogy Task | Synonyms Analogy Task | Total Accuracy |
|:-------------------:|:-------------------------:|:---------------------:|:---------------------:|:------------------:|
| GloVe Uncased | 21.70 | 47.74 | 19.48 | 27.88 |## Extrinsic Evaluation
This dataset is used for extrinsic evaluation on text categorization.
The dataset has 7 different classes.### Accuracy
| | SVC | Logistic Regression |
|:----------------:|:---------:|:-------------------:|
| GloVe Cased | 0.89306 | 0.89959 |
| GloVe Uncased | 0.89956 | 0.90530 |### Precision
| | SVC | Logistic Regression |
|:---------------:|:---------:|:-------------------:|
| GloVe Cased | 0.89388 | 0.89864 |
| GloVe Uncased | 0.90015 | 0.90619 |### Recall
| | SVC | Logistic Regression |
|:---------------:|:---------:|:-------------------:|
| GloVe Cased | 0.89306 | 0.89796 |
| GloVe Uncased | 0.89959 | 0.90531 |We used the given machine learning techniques with default hyperparameters in scikit-learn.
Text Categorization Dataset Link
## Examples
```
model.most_similar(positive=['fransa', 'berlin'], negative=['almanya'])
```
```
model.most_similar(positive=['geliyor', 'gitmek'], negative=['gelmek'])
```
```
model.most_similar("kedi")
```
## References
https://cs224d.stanford.edu/lecture_notes/notes2.pdf \
https://nlp.stanford.edu/pubs/glove.pdf