https://github.com/sinaahmadi/KurdishTokenization

Tokenization resources for Kurdish (Sorani & Kurmanji dialects)
https://github.com/sinaahmadi/KurdishTokenization

kurdish kurdish-language-processing kurmanji natural-language-processing nlp sorani tokenization

Last synced: about 1 year ago
JSON representation

Tokenization resources for Kurdish (Sorani & Kurmanji dialects)

Host: GitHub
URL: https://github.com/sinaahmadi/KurdishTokenization
Owner: sinaahmadi
License: other
Created: 2020-11-04T16:26:43.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2024-06-22T08:13:05.000Z (about 2 years ago)
Last Synced: 2024-11-14T12:50:39.991Z (over 1 year ago)
Topics: kurdish, kurdish-language-processing, kurmanji, natural-language-processing, nlp, sorani, tokenization
Language: Lex
Homepage: https://aclanthology.org/2020.vardial-1.11/
Size: 5.81 MB
Stars: 7
Watchers: 2
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-kurdish - Kurdish Tokenization
awesome-kurdish - KurdishTokenization - A Tokenization System for the Kurdish Language (Sorani & Kurmanji dialects). (Natural Language Processing / Libraries and Tools)

README

# Kurdish Tokenization
## A Tokenization System for the Kurdish Language (Sorani & Kurmanji dialects)

This repository contains data of the tokenization system described in the paper entitled "[A Tokenization System for the Kurdish Language](https://sinaahmadi.github.io/docs/articles/ahmadi2020tokenization.pdf)". An approach is proposed for the tokenization of the Sorani and Kurmanji dialects of Kurdish using a lexicon and a morphological analyzer. The tokenizer is available as a module in the [Kurdish Language Processing Toolkit (KLPT)](https://github.com/sinaahmadi/klpt).

### Gold-standard Datasets
In addition to the tokenization tool, we provide a gold-standard dataset in the [data folder](https://github.com/sinaahmadi/KurdishTokenization/tree/master/data) containing 100 Sorani and Kurmanji sentences in the [Text Corpus Format](https://weblicht.sfs.uni-tuebingen.de/webservices/Helmut-Schmid-Text-Corpus-Format.pdf). These sentences are manually tokenized and therefore can be used for evaluation purposes.

### Annotated Lexicons
We also provide a set of manually-annotated lexicons for this tool which are constantly being updated and completed. These lexicons contain word lemmata in Kurdish along with hyphen-separated multi-word expressions. The current version contains lexicographic data provided by the [FreeDict project](https://freedict.org/) and [Wîkîferheng, the Kurdish Wiktionary](https://ku.wiktionary.org/). The transliteration of the Latin-based script of Kurdish into the Latin-based one is carried out using [Wergor](https://github.com/sinaahmadi/wergor). Please follow the instructions of the [Kurdish Language Processing Toolkit (KLPT)](https://github.com/sinaahmadi/klpt), if you would like to take part in the enrichment of resources.

The following shows two lemmata in the Kurmanji lexicon where the possible writings of a compound word-form are provided in the `token_forms` field.

"riswa": []
"riswa-kirin": {
"token_forms": ["riswakirin", "riswa kirin"]
}

### For researchers
If you would like to extend the current study, the trained models can be found in the [models](https://github.com/sinaahmadi/KurdishTokenization/tree/master/models) directory. Please use the corresponding libraries to import the models in your pipelines. The output of the models are also available in the [experiments](https://github.com/sinaahmadi/KurdishTokenization/tree/master/experiments) folder.

### Contribute
Are you interested in this project? Please follow the instructions of the [Kurdish Language Processing Toolkit (KLPT)](https://github.com/sinaahmadi/klpt) to get involved. Open-source is fun! 😊

### Cite this paper

Please consider citing [this paper](https://sinaahmadi.github.io/docs/articles/ahmadi2020tokenization.pdf), if you use any part of the data or the tool ([`bib` file](https://sinaahmadi.github.io/bibliography/ahmadi2020tokenization.txt)):

@inproceedings{ahmadi2020tokenization,
title={{A Tokenization System for the Kurdish Language}},
author={Ahmadi, Sina},
booktitle={Proceedings of the Seventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2020)},
pages={},
year={2020}
}

### License

The annotated resources by Sina Ahmadi are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License which means:

- **You are free to share**, copy and redistribute the material in any medium or format and also adapt, remix, transform, and build upon the material
for any purpose, **even commercially**.
- **You must give appropriate credit**, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- If you remix, transform, or build upon the material, **you must distribute your contributions under the same license as the original**.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sinaahmadi/KurdishTokenization

Awesome Lists containing this project

README