Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alexeyev/awesome-azeri-nlp
Azerbaijani language processing software, models and datasets.
https://github.com/alexeyev/awesome-azeri-nlp
List: awesome-azeri-nlp
awesome-list azeri morphology natural-language-processing stemming turkic
Last synced: about 1 month ago
JSON representation
Azerbaijani language processing software, models and datasets.
- Host: GitHub
- URL: https://github.com/alexeyev/awesome-azeri-nlp
- Owner: alexeyev
- Created: 2020-02-21T12:01:49.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2023-11-07T12:39:59.000Z (about 1 year ago)
- Last Synced: 2024-10-27T21:10:24.552Z (about 2 months ago)
- Topics: awesome-list, azeri, morphology, natural-language-processing, stemming, turkic
- Language: Shell
- Homepage:
- Size: 103 KB
- Stars: 26
- Watchers: 5
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- ultimate-awesome - awesome-azeri-nlp - Azerbaijani language processing software, models and datasets. (Other Lists / Monkey C Lists)
README
# Awesome Azeri NLP [![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)](https://github.com/sindresorhus/awesome)
A curated list of awesome Azerbaijani language processing software, models and datasets. Inspired by [awesome-ML](https://github.com/josephmisiti/awesome-machine-learning).
The main focus is on **open source** tools, **downloadable** data and research **papers with code**.
If you want to contribute to this list (please do), send me a pull request.
Also, a listed repository should be tagged as deprecated if:* Repository's owners explicitly say that "this library is not maintained".
* Not committed for long time (2~3 years).## Table of Contents
- [Awesome Azeri NLP](#awesome-azeri-nlp)
- [Table of Contents](#table-of-contents)
- [Datasets](#data)
- [Pretrained models](#pretrained-models)
- [Methods/Software](#software)
- [Morphology](#morphology-s)
- [Syntax](#syntax-s)
- [Online Demos](#demos)
- [Miscellaneous](#misc)#### Raw text
* [University of Leipzig corpus collection](https://cls.corpora.uni-leipzig.de/en?corpusLanguage=aze#tblselect) — Newscrawl (2011, 2013) and Wikipedia (misc) datasets
* [Helsinki University corpus](http://www.ling.helsinki.fi/uhlcs/readme-all/README-turkic-lgs.html#C21) — New Testament in the Azerbaijani language
* [Latest **azwiki** dump](https://dumps.wikimedia.org/azwiki/latest/): [**download** directly](https://dumps.wikimedia.org/azwiki/latest/azwiki-latest-pages-articles.xml.bz2)
* [Azeri at An Crúbadán](http://crubadan.org/languages/az) — 8M+ words, Latin script
* [**az-corpus-nlp**](https://github.com/raminrahimzada/az-corpus-nlp) — 2000+ texts, Latin script
* [azWaC: Azerbaijani corpus from the web](https://www.sketchengine.eu/azwac-azerbaijani-corpus/) — SketchEngine-hosted corpus crawled from the web in 2012, ~94 million words
* [Domrachev-Sudoplatova scraped corpus](https://github.com/svetlana21/Nutch_parser/) — 2189398 words, 100560 sentences
* [Azerbaijani Named Entity Recognition (NER) Dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset) — A dataset for training and evaluating NER models in Azerbaijani, including annotated text data with various named entities.**Several corpora are also mentioned in research works:**
* S. Mammadova, G. Azimova, and A. Fatullayev. 2010.Text corpora and its role in development of the linguistic technologies for the azerbaijani language. In The Third International Conference Problems of Cybernetics and Informatics.
* Baisa, Vıt, and Vıt Suchomel. "Large corpora for turkic languages and unsupervised morphological analysis." Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA). 2012. [**SketchEngine corpora?**]
* C. Biemann, S. Bordag, G. Heyer, U. Quasthoff, and C. Wolff. 2004. Language-independent methods for compiling monolingual lexical data. Computational linguistics and intelligent text processing, pages 217–228.
* Domrachev M. A., Sudoplatova S. N. Testing Methods for Automatic Detection of Mor-
pheme Boundaries in the Azerbaijani Language. Vestnik NSU. Series: Linguistics and Intercultural
Communication , 2018, vol. 16, no. 2, p. 34–47. (in Russ.) [Downloadable corpus](https://github.com/svetlana21/Nutch_parser/)
* Özenç B., Ehsani R., Solak E. Moraz: an open-source morphological analyzer for Azerbaijani Turkish //Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. – 2018. – С. 25-29. [**BBC Azerbaijan**]#### Syntax
* [UD_Azerbaijani-TueCL](https://github.com/UniversalDependencies/UD_Azerbaijani-TueCL): a treebank that contains a total of ~110 sentences including 20 Cairo sentences, and ~90 sentences suggested by UD Turkic Group; part of the UD Turkic Treebank. Translations of all the sentences are available in English, Turkish and Kyrgyz languages
* [UD project comments](https://universaldependencies.org/tr/) on difficulties in Turkish language processing, might bring light to the question why parsing Azeri is hard as well#### Machine-readable dictionaries
TODO#### Summarization
* [AZ summarization](https://github.com/derintelligence/az-summarization) — articles and titles, available on request#### Translation
* [AZ-EN parallel corpus](https://github.com/derintelligence/en-az-parallel-corpus) — 68K+ sentences, available on request#### Sentiment
Mentioned in:
* [N. Gasimli's MS thesis](https://www.academia.edu/32330261/Analysis_of_the_use_of_Twitter_in_Azerbaijan) "Analysis of the use of Twitter in Azerbaijan" — 2194+700 tweets
* [Mammad Hajili's 160K customer reviews with scores and upvotes](https://huggingface.co/datasets/hajili/azerbaijani_review_sentiment_classification)
## Pretrained models
* [Polyglot morfessor](https://github.com/aboSamoor/polyglot/blob/master/docs/MorphologicalAnalysis.rst) — pretrained [morfessor](http://www.cis.hut.fi/cis/projects/morpho/) model, number 53
* [fastText](https://fasttext.cc/docs/en/crawl-vectors.html) — 300-dimensional fastText vectors provided by the authors#### Morphology
* [Azmorph](http://wiki.apertium.org/wiki/Azmorph) — morphological analyzer for Azerbaijani (Azerbaycan dili), said to be in pre-ALPHA state; however, was [used for web corpora preparation](https://www.sketchengine.eu/wp-content/uploads/Large_Corpora_for_turkic_2012.pdf)
* [Wiktionary word forms extraction](https://github.com/svetlana21/az_morphology) — Python code on github
* [MorAz](https://github.com/berkeozenc/MorAz) — open-source morph. analyzer, [paper](https://www.aclweb.org/anthology/D18-2005v1.pdf), [demo](http://ddil.isikun.edu.tr/moraz/), [related slides on AZ morphology](http://fsmnlp2017.cs.umu.se/wp-content/uploads/2017/08/RaziehEhsani.pdf).**Mentioned in papers:**
* [POS-tagging](https://www.researchgate.net/publication/334074082_Part-of-Speech_Tagging_for_Azerbaijani_Language) paper — Mammadov, S., Rustamov, S., Mustafali, A., Sadigov, Z., Mollayev, R., & Mammadov, Z. (2018, October). Part-of-Speech Tagging for Azerbaijani Language. In 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT) (pp. 1-6). IEEE. [**Probable implementation: [aznlp repo](https://github.com/aznlp/azerbaijani-language-pos-tagger)**]
* [Stemming paper, 2019](https://jpit.az/en/journals/227/) — Alizadeh, M. B. H., & Seyyedi, S. A. H. (2019). AUTO STEMMING OF AZERBAIJANI LANGUAGE. Problems of Information Technology, 59-66.
* [N. Gasimli's MS thesis](https://www.academia.edu/32330261/Analysis_of_the_use_of_Twitter_in_Azerbaijan) "Analysis of the use of Twitter in Azerbaijan" — [Zemberek](https://github.com/ahmetaa/zemberek-nlp) is extended for Azerbaijani; though stated a lot of effort is still required for it to work properly for Azeri language.
## Online Demos
* [Cyrillic ⇄ Latin conversion](http://www.ismanov.com/Projects/CyrLatConverter/index.php) — PHP-based online tool
## Miscellaneous
* [Turkic languages-related resources](http://ddi.itu.edu.tr/en/toolsandresources) compiled by Dr. Gülşen Eryiğit and her team at Istanbul Technical University
* [Azeribaijani corpora data review](https://www.elibrary.ru/item.asp?id=37146771&)
* [Dilmanc](http://dilmanc.az/en/project/about) — government-funded Azerbaijani language-related initiative
* [Dilmanc EAMT paper](http://dilmanc.az/pdf/EAMT-2008-Fatullayev.pdf) on MT peculiarities
* [Apertium page](http://wiki.apertium.org/wiki/Azerbaijani) — a list of various online language-related resources
* [AZNLP github](https://github.com/aznlp) — a repo hub with various language-related software: stemmer, POS-tagger
* [MozillaAZ community spellchecker](https://github.com/mozillaz/spellchecker) — spellchecker plugin