Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/navalnica/be_nlp_speech_resources

Links to Belarusian NLP and Speech resources
https://github.com/navalnica/be_nlp_speech_resources

asr belarus belarusian belarusian-language natural-language-processing nlp speech speech-processing speech-recognition speech-synthesis speech-to-text stt text-to-speech tts

Last synced: about 1 month ago
JSON representation

Links to Belarusian NLP and Speech resources

Awesome Lists containing this project

README

        

# Belarusian NLP and Speech Processing resources

This repository contains links to Belarusian Natural Language and Speech Processing resources and datasets.

It is inspired by similar project with Ukrainian Speech Processing resources: [egorsmkv/speech-recognition-uk](https://github.com/egorsmkv/speech-recognition-uk)

### TODOs:
* add detailed descriptions to each of list items
* evaluate models on benchmarks and log their performance

# ๐ŸŽ™ Speech-to-Text

## ๐ŸŽ™๐Ÿ’ก Implementations

* wav2vec2 trained on Common Voice 8 + kenlm language model trained on Common Voice 8:
* Model: [ales/wav2vec2-cv-be](https://huggingface.co/ales/wav2vec2-cv-be)
* Demo: [ales/wav2vec2-cv-be-lm](https://huggingface.co/spaces/ales/wav2vec2-cv-be-lm)
* Code: [navalnica/wav2vec2-belarusian](https://github.com/navalnica/wav2vec2-belarusian)

* whisper:
* original [openai/whisper](https://github.com/openai/whisper) models
* Whisper models fine-tuned on Belarusian Common Voice 11 dataset:
* Whisper Small:
* Model: [ales/whisper-small-belarusian](https://huggingface.co/ales/whisper-small-belarusian)
* test WER on CommonVoice11: `6.79`
* Demo: [ales/whisper-small-belarusian-demo](https://huggingface.co/spaces/ales/whisper-small-belarusian-demo)
* Code: [navalnica/whisper-finetuning-be](https://github.com/navalnica/whisper-finetuning-be)
* Whisper Base:
* Model: [ales/whisper-base-belarusian](https://huggingface.co/ales/whisper-base-belarusian)
* Code: [navalnica/whisper-finetuning-be](https://github.com/navalnica/whisper-finetuning-be)

* Nvidia NeMo models:
* [nvidia/stt_be_conformer_ctc_large](https://huggingface.co/nvidia/stt_be_conformer_ctc_large)
* [huggingface self-reported metric] test WER on CommonVoice10: `4.8`
* [nvidia/stt_be_conformer_transducer_large](https://huggingface.co/nvidia/stt_be_conformer_transducer_large)
* [huggingface self-reported metric] test WER on CommonVoice10: `3.8`
* [nvidia/stt_be_fastconformer_hybrid_large_pc](https://huggingface.co/nvidia/stt_be_fastconformer_hybrid_large_pc)
* [huggingface self-reported metric] test WER on CommonVoice12: `2.72`
* [huggingface self-reported metric] test WER P&C CommonVoice12: `3.87`

* ESPnet:
* [espnet/belarusian_commonvoice_blstm](https://huggingface.co/espnet/belarusian_commonvoice_blstm)

## ๐ŸŽ™๐Ÿ“Š Benchmarks

Model comparisons grouped by dataset. TODO

## ๐ŸŽ™๐Ÿ“š Datasets

* [Common Voice](https://commonvoice.mozilla.org/en/datasets). Speech recognition dataset
* Dataset from [knihi.com](https://knihi.com/none/Korpus_bielaruskaha_maulennia_dla_trenirouki_niejronnych_sietak_zip.html). TODO: what is the type of dataset?
* [google/fleurs](https://huggingface.co/datasets/google/fleurs/viewer/be_by/train)
* ssrlab: TODO. Speech recognition dataset

------

# ๐Ÿ“ข Text-to-Speech

## ๐Ÿ“ข๐Ÿ’ก Implementations

* CoquiAI implementations
* [jhlfrfufyfn/bel-tts](https://github.com/jhlfrfufyfn/bel-tts). GlowTTS + HifiGan
* [Code](https://github.com/jhlfrfufyfn/bel-tts)
* [Model](https://huggingface.co/jhlfrfufyfn/bel-tts)
* [Demo on HuggingFace](https://huggingface.co/spaces/jhlfrfufyfn/bel-tts)
* [Demo on a custom web-page](https://nikuchin.fun/tts). The source code for the demo page: [here](https://github.com/jhlfrfufyfn/bel-tts-server)
* [alex73/belarusian-tts](https://github.com/alex73/belarusian-tts). CoquiAI implementation by Yurii Paniv (@robinhad).

Original repo & models were deleted - only fork is available now

---

# ๐Ÿ“ NLP

## POS-tagging
* [KoichiYasuoka/roberta-small-belarusian-upos](https://huggingface.co/KoichiYasuoka/roberta-small-belarusian-upos)
* [stanfordnlp/stanza-be](https://huggingface.co/stanfordnlp/stanza-be)
* [poritski/YABC_Tagger](https://github.com/poritski/YABC_Tagger). Rule-based POS-tagger and lemmatizer.

Written in Perl.
Uses [poritski/YABC](https://github.com/poritski/YABC) as a Grammar base (?)
* [volchek/beltagger](https://github.com/volchek/beltagger).
An improved version of [poritski/YABC_Tagger](https://github.com/poritski/YABC_Tagger) rule-based POS-tagger and lemmatizer.

Cross-platform, written in C++.

Known issues:
* requires input data to be incoded in Windows-1251, does not support UTF-8;
* tagset is not fully-compatible with BNKorpus's tagset and grammar base
* grammar base used is not full enough. [Belarus/GrammarDB](https://github.com/Belarus/GrammarDB) is a better paradigms source but is not incorporated yet
* suffix table calculation script is not ported from Perl to C++
* code uses Boost libarary

## Other
* [pkasila/bel-sklony](https://github.com/pkasila/bel-sklony) - web page with Belarusian nouns declension. Demo: [sklony.pkasila.net](https://sklony.pkasila.net/)

## Masked Language Modeling
* [KoichiYasuoka/roberta-small-belarusian](https://huggingface.co/KoichiYasuoka/roberta-small-belarusian)

## ๐Ÿ“๐Ÿ“š Datasets

* [oscar](https://huggingface.co/datasets/oscar)
* [mc4](https://huggingface.co/datasets/mc4)
* [poritski/YABC](https://github.com/poritski/YABC) - ะญะบัะฟะตั€ั‹ะผะตะฝั‚ะฐะปัŒะฝั‹ ะบะพั€ะฟัƒั ะฑะตะปะฐั€ัƒัะบะฐะน ะผะพะฒั‹, ะญะšะ‘ะœ
* [Belarus/GrammarDB](https://github.com/Belarus/GrammarDB) - Grammar Database of Belarusian language
* [tsimafeip/Translator](https://github.com/tsimafeip/Translator) - Dataset with russian-belarusian translation pairs
* Universal dependencies dataset:
* [Page](https://universaldependencies.org/treebanks/be_hse/index.html)
* [GitHub Repository](https://github.com/UniversalDependencies/UD_Belarusian-HSE)
* [Tatoeba Belarusian sentences](https://tatoeba.org/en/sentences/show_all_in/bel/none)

---

# ๐Ÿงโ€โ™€๏ธ๐Ÿง Communities and platforms:
* [corpus.by](https://www.corpus.by)
* [ssrlab.by](https://ssrlab.by)
* [bnkorpus.info](https://bnkorpus.info)
* [Belarus](https://github.com/Belarus) organization on github
* [nlproc.by](https://github.com/nlprocby) community on github

---
# ๐Ÿฆ” Unsorted
* nothing for now