https://github.com/spacyturk/spacyturk

spaCyTurk - trained models & pipelines for Turkish
https://github.com/spacyturk/spacyturk

floret nlp nlp-library spacy turkish-nlp

Last synced: 8 months ago
JSON representation

spaCyTurk - trained models & pipelines for Turkish

Host: GitHub
URL: https://github.com/spacyturk/spacyturk
Owner: spacyturk
License: mit
Created: 2022-06-17T09:08:31.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2022-06-24T20:52:31.000Z (over 3 years ago)
Last Synced: 2025-02-02T06:51:08.597Z (8 months ago)
Topics: floret, nlp, nlp-library, spacy, turkish-nlp
Language: Python
Homepage:
Size: 6.84 KB
Stars: 19
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          ## spaCyTurk - trained spaCy models for Turkish

spaCyTurk is a library providing trained [spaCy](https://spacy.io) models for Turkish language.

### Available Models 

**Trained floret vectors for Turkish**

The floret vectors were trained on the deduplicated version of [OSCAR-2109](https://oscar-corpus.com/post/oscar-v21-09/) Turkish corpus. The sentence segmented (non-Turkish sentences were removed) and tokenized final corpus has a size of 30GB and 4327M tokens.

For more details, see the ***[article](https://medium.com/@bediiaydogan/training-floret-vectors-for-turkish-b3c516c1570f?source=friends_link&sk=fdb74dcf19a83a98a3284f41430a4462)*** describing the parameter selection and evaluation process.

>**training parameters:** model=cbow, dim=300, minn=4, maxn=6, hashCount=2, minCount=5, ws=5, neg=10, lr=0.05, epoch=5

Two models **(tr_floret_web_md, tr_floret_web_lg)** are available with bucket sizes of 50000 and 200000 respectively.

Model performances were evaluated in below downstream NLP tasks.

* Named Entity Recognition, **NER**

* Part of Speech Tagging, **POS**

* Offensive Language Identificaton, **OLI**

* Movie Sentiment Analaysis, **MSA**

| Vectors                         |  NER  |  POS  |  OLI  |  MSA  | Model Size |

| --------------------------------| ----: | ----: | ----: | ----: | ---------: |

| none                            | 90.19 | 82.60 | 61.07 | 75.63 |          - |

| fastText (~3.4M vectors/keys)   | 92.36 | 92.49 | 69.83 | 75.62 |      4.1GB |

| tr_floret_web_md (bucket 50K)   | 92.87 | 93.02 | 73.55 | 76.98 |       60MB |

| tr_floret_web_lg (bucket 200K)  | 93.05 | 93.51 | 74.00 | 77.28 |      240MB |

| BERT                            | 95.71 | 96.42 | 79.37 | 80.87 |      444MB |

**Evaluation metrics:** micro f1-score for NER, accuracy for POS, macro f1-score for OLI and MSA.

### Installation & Usage

Trained models can be installed directly from [Hugging Face Hub](https://huggingface.co/spacyturk). Alternatively, you can install `spacyturk` from [PyPI](https://pypi.org/project/spacyturk/) and download models through its API. This is the recommended way since the downloader performs version compatibility checks.

 

```bash

pip install spacyturk

```

```python

import spacyturk

# downloads the spaCyTurk model

spacyturk.download("model_name")

# info about spaCyTurk installation and models

spacyturk.info()

# load the model using spaCy

import spacy

nlp = spacy.load("model_name")

```

Alternatively, download models through CLI

```bash

# downloads the spaCyTurk model

python -m spacyturk download model_name

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/spacyturk/spacyturk

Awesome Lists containing this project

README