Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/navalnica/be_nlp_speech_resources
Links to Belarusian NLP and Speech resources
https://github.com/navalnica/be_nlp_speech_resources
asr belarus belarusian belarusian-language natural-language-processing nlp speech speech-processing speech-recognition speech-synthesis speech-to-text stt text-to-speech tts
Last synced: about 1 month ago
JSON representation
Links to Belarusian NLP and Speech resources
- Host: GitHub
- URL: https://github.com/navalnica/be_nlp_speech_resources
- Owner: navalnica
- Created: 2022-10-12T13:56:12.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-09-17T18:48:13.000Z (3 months ago)
- Last Synced: 2024-09-17T23:13:53.146Z (3 months ago)
- Topics: asr, belarus, belarusian, belarusian-language, natural-language-processing, nlp, speech, speech-processing, speech-recognition, speech-synthesis, speech-to-text, stt, text-to-speech, tts
- Homepage:
- Size: 39.1 KB
- Stars: 30
- Watchers: 5
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Belarusian NLP and Speech Processing resources
This repository contains links to Belarusian Natural Language and Speech Processing resources and datasets.
It is inspired by similar project with Ukrainian Speech Processing resources: [egorsmkv/speech-recognition-uk](https://github.com/egorsmkv/speech-recognition-uk)
### TODOs:
* add detailed descriptions to each of list items
* evaluate models on benchmarks and log their performance# ๐ Speech-to-Text
## ๐๐ก Implementations
* wav2vec2 trained on Common Voice 8 + kenlm language model trained on Common Voice 8:
* Model: [ales/wav2vec2-cv-be](https://huggingface.co/ales/wav2vec2-cv-be)
* Demo: [ales/wav2vec2-cv-be-lm](https://huggingface.co/spaces/ales/wav2vec2-cv-be-lm)
* Code: [navalnica/wav2vec2-belarusian](https://github.com/navalnica/wav2vec2-belarusian)* whisper:
* original [openai/whisper](https://github.com/openai/whisper) models
* Whisper models fine-tuned on Belarusian Common Voice 11 dataset:
* Whisper Small:
* Model: [ales/whisper-small-belarusian](https://huggingface.co/ales/whisper-small-belarusian)
* test WER on CommonVoice11: `6.79`
* Demo: [ales/whisper-small-belarusian-demo](https://huggingface.co/spaces/ales/whisper-small-belarusian-demo)
* Code: [navalnica/whisper-finetuning-be](https://github.com/navalnica/whisper-finetuning-be)
* Whisper Base:
* Model: [ales/whisper-base-belarusian](https://huggingface.co/ales/whisper-base-belarusian)
* Code: [navalnica/whisper-finetuning-be](https://github.com/navalnica/whisper-finetuning-be)
* Nvidia NeMo models:
* [nvidia/stt_be_conformer_ctc_large](https://huggingface.co/nvidia/stt_be_conformer_ctc_large)
* [huggingface self-reported metric] test WER on CommonVoice10: `4.8`
* [nvidia/stt_be_conformer_transducer_large](https://huggingface.co/nvidia/stt_be_conformer_transducer_large)
* [huggingface self-reported metric] test WER on CommonVoice10: `3.8`
* [nvidia/stt_be_fastconformer_hybrid_large_pc](https://huggingface.co/nvidia/stt_be_fastconformer_hybrid_large_pc)
* [huggingface self-reported metric] test WER on CommonVoice12: `2.72`
* [huggingface self-reported metric] test WER P&C CommonVoice12: `3.87`
* ESPnet:
* [espnet/belarusian_commonvoice_blstm](https://huggingface.co/espnet/belarusian_commonvoice_blstm)## ๐๐ Benchmarks
Model comparisons grouped by dataset. TODO
## ๐๐ Datasets
* [Common Voice](https://commonvoice.mozilla.org/en/datasets). Speech recognition dataset
* Dataset from [knihi.com](https://knihi.com/none/Korpus_bielaruskaha_maulennia_dla_trenirouki_niejronnych_sietak_zip.html). TODO: what is the type of dataset?
* [google/fleurs](https://huggingface.co/datasets/google/fleurs/viewer/be_by/train)
* ssrlab: TODO. Speech recognition dataset------
# ๐ข Text-to-Speech
## ๐ข๐ก Implementations
* CoquiAI implementations
* [jhlfrfufyfn/bel-tts](https://github.com/jhlfrfufyfn/bel-tts). GlowTTS + HifiGan
* [Code](https://github.com/jhlfrfufyfn/bel-tts)
* [Model](https://huggingface.co/jhlfrfufyfn/bel-tts)
* [Demo on HuggingFace](https://huggingface.co/spaces/jhlfrfufyfn/bel-tts)
* [Demo on a custom web-page](https://nikuchin.fun/tts). The source code for the demo page: [here](https://github.com/jhlfrfufyfn/bel-tts-server)
* [alex73/belarusian-tts](https://github.com/alex73/belarusian-tts). CoquiAI implementation by Yurii Paniv (@robinhad).
Original repo & models were deleted - only fork is available now---
# ๐ NLP
## POS-tagging
* [KoichiYasuoka/roberta-small-belarusian-upos](https://huggingface.co/KoichiYasuoka/roberta-small-belarusian-upos)
* [stanfordnlp/stanza-be](https://huggingface.co/stanfordnlp/stanza-be)
* [poritski/YABC_Tagger](https://github.com/poritski/YABC_Tagger). Rule-based POS-tagger and lemmatizer.
Written in Perl.
Uses [poritski/YABC](https://github.com/poritski/YABC) as a Grammar base (?)
* [volchek/beltagger](https://github.com/volchek/beltagger).
An improved version of [poritski/YABC_Tagger](https://github.com/poritski/YABC_Tagger) rule-based POS-tagger and lemmatizer.
Cross-platform, written in C++.
Known issues:
* requires input data to be incoded in Windows-1251, does not support UTF-8;
* tagset is not fully-compatible with BNKorpus's tagset and grammar base
* grammar base used is not full enough. [Belarus/GrammarDB](https://github.com/Belarus/GrammarDB) is a better paradigms source but is not incorporated yet
* suffix table calculation script is not ported from Perl to C++
* code uses Boost libarary
## Other
* [pkasila/bel-sklony](https://github.com/pkasila/bel-sklony) - web page with Belarusian nouns declension. Demo: [sklony.pkasila.net](https://sklony.pkasila.net/)## Masked Language Modeling
* [KoichiYasuoka/roberta-small-belarusian](https://huggingface.co/KoichiYasuoka/roberta-small-belarusian)## ๐๐ Datasets
* [oscar](https://huggingface.co/datasets/oscar)
* [mc4](https://huggingface.co/datasets/mc4)
* [poritski/YABC](https://github.com/poritski/YABC) - ะญะบัะฟะตััะผะตะฝัะฐะปัะฝั ะบะพัะฟัั ะฑะตะปะฐัััะบะฐะน ะผะพะฒั, ะญะะะ
* [Belarus/GrammarDB](https://github.com/Belarus/GrammarDB) - Grammar Database of Belarusian language
* [tsimafeip/Translator](https://github.com/tsimafeip/Translator) - Dataset with russian-belarusian translation pairs
* Universal dependencies dataset:
* [Page](https://universaldependencies.org/treebanks/be_hse/index.html)
* [GitHub Repository](https://github.com/UniversalDependencies/UD_Belarusian-HSE)
* [Tatoeba Belarusian sentences](https://tatoeba.org/en/sentences/show_all_in/bel/none)---
# ๐งโโ๏ธ๐ง Communities and platforms:
* [corpus.by](https://www.corpus.by)
* [ssrlab.by](https://ssrlab.by)
* [bnkorpus.info](https://bnkorpus.info)
* [Belarus](https://github.com/Belarus) organization on github
* [nlproc.by](https://github.com/nlprocby) community on github---
# ๐ฆ Unsorted
* nothing for now