Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sinaahmadi/KurdishTokenization
Tokenization resources for Kurdish (Sorani & Kurmanji dialects)
https://github.com/sinaahmadi/KurdishTokenization
kurdish kurdish-language-processing kurmanji natural-language-processing nlp sorani tokenization
Last synced: 28 days ago
JSON representation
Tokenization resources for Kurdish (Sorani & Kurmanji dialects)
- Host: GitHub
- URL: https://github.com/sinaahmadi/KurdishTokenization
- Owner: sinaahmadi
- License: other
- Created: 2020-11-04T16:26:43.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2024-06-22T08:13:05.000Z (6 months ago)
- Last Synced: 2024-08-04T01:17:56.395Z (4 months ago)
- Topics: kurdish, kurdish-language-processing, kurmanji, natural-language-processing, nlp, sorani, tokenization
- Language: Lex
- Homepage: https://aclanthology.org/2020.vardial-1.11/
- Size: 5.81 MB
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-kurdish - Kurdish Tokenization
README
# Kurdish Tokenization
## A Tokenization System for the Kurdish Language (Sorani & Kurmanji dialects)This repository contains data of the tokenization system described in the paper entitled "[A Tokenization System for the Kurdish Language](https://sinaahmadi.github.io/docs/articles/ahmadi2020tokenization.pdf)". An approach is proposed for the tokenization of the Sorani and Kurmanji dialects of Kurdish using a lexicon and a morphological analyzer. The tokenizer is available as a module in the [Kurdish Language Processing Toolkit (KLPT)](https://github.com/sinaahmadi/klpt).
### Gold-standard Datasets
In addition to the tokenization tool, we provide a gold-standard dataset in the [data folder](https://github.com/sinaahmadi/KurdishTokenization/tree/master/data) containing 100 Sorani and Kurmanji sentences in the [Text Corpus Format](https://weblicht.sfs.uni-tuebingen.de/webservices/Helmut-Schmid-Text-Corpus-Format.pdf). These sentences are manually tokenized and therefore can be used for evaluation purposes.### Annotated Lexicons
We also provide a set of manually-annotated lexicons for this tool which are constantly being updated and completed. These lexicons contain word lemmata in Kurdish along with hyphen-separated multi-word expressions. The current version contains lexicographic data provided by the [FreeDict project](https://freedict.org/) and [Wîkîferheng, the Kurdish Wiktionary](https://ku.wiktionary.org/). The transliteration of the Latin-based script of Kurdish into the Latin-based one is carried out using [Wergor](https://github.com/sinaahmadi/wergor). Please follow the instructions of the [Kurdish Language Processing Toolkit (KLPT)](https://github.com/sinaahmadi/klpt), if you would like to take part in the enrichment of resources.The following shows two lemmata in the Kurmanji lexicon where the possible writings of a compound word-form are provided in the `token_forms` field.
"riswa": []
"riswa-kirin": {
"token_forms": ["riswakirin", "riswa kirin"]
}### For researchers
If you would like to extend the current study, the trained models can be found in the [models](https://github.com/sinaahmadi/KurdishTokenization/tree/master/models) directory. Please use the corresponding libraries to import the models in your pipelines. The output of the models are also available in the [experiments](https://github.com/sinaahmadi/KurdishTokenization/tree/master/experiments) folder.### Contribute
Are you interested in this project? Please follow the instructions of the [Kurdish Language Processing Toolkit (KLPT)](https://github.com/sinaahmadi/klpt) to get involved. Open-source is fun! 😊### Cite this paper
Please consider citing [this paper](https://sinaahmadi.github.io/docs/articles/ahmadi2020tokenization.pdf), if you use any part of the data or the tool ([`bib` file](https://sinaahmadi.github.io/bibliography/ahmadi2020tokenization.txt)):
@inproceedings{ahmadi2020tokenization,
title={{A Tokenization System for the Kurdish Language}},
author={Ahmadi, Sina},
booktitle={Proceedings of the Seventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2020)},
pages={},
year={2020}
}### License
The annotated resources by Sina Ahmadi are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License which means:- **You are free to share**, copy and redistribute the material in any medium or format and also adapt, remix, transform, and build upon the material
for any purpose, **even commercially**.
- **You must give appropriate credit**, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- If you remix, transform, or build upon the material, **you must distribute your contributions under the same license as the original**.