low-resource-languages
A curated list of resources for the conservation, development, and documentation of low resource (human) languages.
https://github.com/RichardLitt/low-resource-languages
Last synced: 7 days ago
JSON representation
-
Afrikaans
-
Albanian
-
Utilities
- Apertium rules for Albanian - Machine Translation rules
- out-of-copyright-albanian-authors - authors scraped from the albanian language wikipedia who are out of copyright.
- Plis keyboard - The Plis keyboard is a keyboard or computer keyboard layout for the Albanian language.
- spell checking - Here you find a collection of Albanian words and information about them. Aspell, Ispell, and MySpell are included.
- spell checking - Here you find a collection of Albanian words and information about them. Aspell, Ispell, and MySpell are included.
-
-
Alutiiq
-
Utilities
- wiinaq - Word Wiinaq is a [Kodiak Alutiiq](http://www.alutiiqlanguage.org/) dictionary web application with automatically generated ending tables and souped-up search capabilities. It is written in Python using Django.
-
-
Amharic
-
Utilities
- HornMorpho - Morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs
-
-
Android Applications
-
Software
- ojoVoz - A mobile app for sending georeferenced image and voice recordings from an Adroid phone to an email address. <!-- used to exist on https://github.com/ojovoz/ojoVoz_mobile -->
-
-
Annotation
-
Software
- Annotation page - Ethnographic tools for annotation.
- brat - brat rapid annotation tool (brat) for online text annotation.
- CLAM - Quickly and transparently transforms command-line NLP tools into RESTful webservices with an interface for human end-users.
- FoLiA: Format for Linguistic Annotation - A rich XML-based annotation format, suitable for the representation of linguistically annotated language resources.
- WebAnno - Web-based annotation tool for a wide range of linguistic annotations including various layers of morphological, syntactical, and semantic annotations. Distributed under Apache 2.0.
-
-
Audio automation
-
Software
- CMU Sphinx - Open source toolkit for speech recognition. PocketSphinx, SphinxTrain, Sphinx4, and sphinxbase.
- dejavu - Audio fingerprinting and recognition in Python.
- ELAN
- pyAudioAnalysis - Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications.
- SoX - SoX, the Swiss Army knife of sound processing programs.
-
-
Basque
-
Utilities
-
-
Bengali
-
Utilities
- Bangla-অঙ্কুর for Mac
- Ekushey
- Bengali Writer - `Bengali Writer' is a set of utilities for computerized editing and typesetting in Bengali, a language of India and Bangladesh. It comprises a set of fonts for Bengali in several formats (METAFONT, BDF, PS), a text editor with spell-cheking, export, and more. (Original project is on SourceForge: https://sourceforge.net/projects/bengaliwriter/).
- Lekho - A collection of tools and resources for using bangla on computers (Original project is on SourceForge: https://sourceforge.net/projects/lekho/).
-
-
Chichewa
-
Utilities
- Chichewa - NLP resources for Chichewa.
-
-
Contribute
- @RichardLitt
- Wiki
- on this Wikipedia page - if you have another one, please suggest it!
- Firefox language packs - languages) for each low resource language would be unhelpful, as would be including all of the tools available for Basque noted in the [ACL Wiki](https://aclweb.org/aclwiki/Resources_for_Basque), which would mainly mean cataloguing tools through the [IXA group](http://ixa.si.ehu.es/produktuak?language=en), some of which are open source, and some are not. Instead, view this list as a starting point for more research.
- DocToc
- the awesome lists collection
-
Corpora
-
FieldDB Webservices/Components/Plugins
- Common Crawl — web-languages - Crowd-sourced URL lists to steer the Common Crawl crawler toward under-resourced languages.
- Common Crawl — web-languages-code - Code and tooling for the Common Crawl web-languages project.
- OLDI — Open Language Data Initiative - Curated multilingual datasets (FLORES+, OLDI-Seed) covering ~400 language-script combinations for NLP research.
- WaxalNLP - Large-scale multilingual speech corpus covering 29 African languages for ASR and TTS research, created by Google.
-
-
FieldDB Webservices/Components/Plugins
-
Utilities
- Learning to map into a Univerisal POS tagset
- Unicodify - bit encodings to Unicode (using the UTF-16 encoding). Unicodify was particularly designed to handle HTML-based text using non-ISCII 8-bit fonts to render South Asian scripts. However, elements of the suite can map other types of non-ASCII 8-bit encodings, such as Latin-2, ISCII and PASCII.
- AndroidLanguageLearningClientForFieldDB-sikuli - Sikuli tests for AndroidLanguageLearningClientForFieldDB.
- AuthenticationWebService - A node.js web service which mananges users and corpora creation and authentication.
- bower-fielddb-angular - A bower repository which hosts fielddb-angular components, bower install fielddb-angular --save.
- bower-fielddb - A bower repository which hosts fielddb core components, bower install fielddb --save.
- fielddb-spreadsheet-sikuli - sikuli tests for the spreadsheet module [use](https://www.youtube.com/watch?v=pPN8e1m6RBU&feature=youtu.be).
- FieldDBActivityFeed - A fielddb activity feed widget which can be embedded in other codebases, websites etc [use](https://chrome.google.com/webstore/detail/lingsync-prototype/eeipnabdeimobhlkfaiohienhibfcfpa).
- FieldDBGlosser - A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of possible parses for morphemes in a word. bower install fielddb-glosser --save.
- FieldDBLexicon - A lexicon browser/editor web widget for FieldDB databases.
- LanguageClassDashboard - App which provides a view of FieldDB corpora for language teachers [use](http://app.phophlo.ca/).
- LexiconWebService - A node.js ElasticSearch wrapper for indexing/training lexicons from corpora.
- LexiconWebServiceSample - A node.js web server which implements the fieldlinguist's lexicon API for the FieldDB project.
- Gargantua - Fast Unsupervised Sentence Aligner described in "Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora", COLING 2010.
- ldc-kiy - Materials for: The experimental state of mind in elicitation: illustrations from tonal fieldwork. Dubmitted to Language Documentation & Conservation, _How to study a tone language_.
- low-resource-pos-tagging-2014 - resource-pos-tagging-2014](https://github.com/dhgarrette/low-resource-pos-tagging-2014) Published in: Learning a Part-of-Speech Tagger from Two Hours of Annotation. _Dan Garrette and Jason Baldridge_. In Proceedings of NAACL 2013. And in: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages. _Dan Garrette, Jason Mielens, and Jason Baldridge_. In Proceedings of ACL 2013. <!-- ell:ignore -->
- orthotree - Linguistic family tree based on orthographic distance.
- type-supervised-tagging-2012emnlp - Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries. _Dan Garrette and Jason Baldridge_. In Proceedings of EMNLP 2012. This code is frozen as of the version used to obtain the results in the paper. It will not be maintained. To see the updated code, visit [nlp](https://github.com/dhgarrette/nlp) <!-- ell:ignore -->
- visualizing-language - For visualizations of WALS and other typological databases.
- WALS-APiCS - Code for working with WALS-APiCS (Atlas of Pidgin and Creole Language Structures) complexity metrics.
- CorpusWebService - über-simple node.js-Proxy to enable CORS request for couchdb.
- CorporaForFieldLinguistics - Small corpora from diverse language typologies, useful for testing scripts.
- startR
- lucenerevolution-2013 - Demo examples for linguistics in Lucene and Solr.
- berlin-buzzwords-2013 - Demo examples for Lucene, Solr, ElasticSearch and OpenNLP from Berlin Buzzwords 2013 talk.
- fontinline - Make inline stroke paths from an outline font.
- bible-corpus - A multilingual parallel corpus created from translations of the Bible.
- poio-corpus - The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others.
-
-
Galician
-
Apertium
- apertium-cat-glg - Apertium translation pair for Catalan and Galician
- apertium-dict-en-gl - English-Galician language pair for Apertium
- apertium-dict-es-gl - Spanish-Galician language pair for Apertium
- apertium-dict-pt-gl - Portuguese-Galician language pair for Apertium
- apertium-en-gl - Apertium translation pair for English and Galician
- apertium-es-gl - Apertium translation pair for Spanish and Galician
- apertium-glg - Apertium linguistic data for Galician
- Apertium-pt-gl.pt-gl-LMF - This is the LMF version of the Apertium bilingual ditionary for Portugues and Galician languages
- apertium-pt-gl - Apertium translation pair for Portuguese and Galician
-
Utilities
- CitiusTagger - A PoS-Tagger and Named Entity Classification tool for Portuguese, English, Galician, and Spanish
- Conshuga - Galician verb conjugator
- GalegoDroid - Galician Translator for Android
- an-metri-gal - Análise métrico de texto en verso en lingua galega (Galician language) gl-ES
- android_gl_dict - Android Galician (gl_ES) Keyboard Dictionary
- aspell-gl - Galician dictionary for aspell
- CitiusSentiment - Sentiment analysis (opinion mining) for Portuguese, English, Spanish, and Galician
- corpora - This is a collection of corpus of Galician (or related to Galicia) words / Colección de corpus de palabras en galego (ou relacionadas con Galicia)
- DepPattern - Dependency Syntactic Parsing for Portuguese, Spanish, English, and Galician, including MetaRomance parser
- DOGA_scraper - Galician Official journal scraper
- elFinder-language - Galician - Gallego / language for elFinder
- EuroWordNetLemon - EuroWordNet lemon lexicons generated from the LMF versions of the Multilingual Central Repository (MCR) EuroWordNet lexicons. It includes lexicons for Spanish, Catalan, Basque & Galician.
- galeXtra - Multiword Extractor for Portuguese, English, Spanish, Galician, French
- Galician-Dependency-Treebank - This Galician Dependency Treebank has been developed by transliterating and adapting lexically the Portuguese part (Bosque 7.3 by the Floresta sintá(c)tica project) of the CONLL-X 2006.
- Galician-Fuzzy-Text-watch - Based on Fuzzy Text International by Jesse Hallett, uses the galician language to display time.
- galician-locale-for-mac - Galician locale for Mac OS X
- gl-syllabler - Split galician language words into syllables
- gl - Galician OmegaT Localisation
- hunspell-gl-ciencias - Project oriented into developing a science and maths Galician language Hunspell dictionary
- hunspell-gl - Galician hunspell dictionaries
- hyphen-gl - Galician hyphenation rules
- javagalician-java6 - The Java Galician Locale is an implementation of Java localization SPIs which will allow the Java VM to use the Galician Language (locales "gl" and "gl_ES"), one of the official languages of Spain, which is not included in Sun's JVM distribution.
- Linguakit - Multilingual toolkit for NLP: dependency parser, PoS tagger, NERC, multiword extractor, sentiment analysis, etc.
- ParlamentoGalicia - Project based on the information extracted from the transcriptions of the sessions held in the Galician Parlament
- rima - Find rhyming words in galician language.
- stopwords-gl - Galician stopwords collection
- texlive-babel-galician - TeXLive babel-galician package
- UD_Galician-CTG - The Galician UD treebank is based on the automatic parsing of the Galician Technical Corpus created at the University of Vigo by the the TALG NLP research group.
- UD_Galician-TreeGal - The Galician-TreeGal is a treebank for Galician developed at LyS Group (Universidade da Coruña).
- UL_Galician-TreeGal - CoNLL-UL Repository for UD_Galician-TreeGal
-
-
Generic Repositories
-
Software
- LLM Proxy Babylon - An open-source proxy that bridges the quality gap for low-resource languages in LLMs by selectively pre-translating prompts to English at inference time. Measured quality improvements from 0.456 to 0.949 for Thai, with 70% token cost reduction. Supports AWS Bedrock and OpenAI.
- dictdb - dictionary database for language translation. **[archived]**
- lexdb - LexDB is a lexical cognate tracking database. It stores the full provenance of all lexemes and cognate judgements, and allows export into a number of nexus dialects. The database is written in the flexible python/django web framework. **[archived]**
- SayMore - A tool for making common Language Documentation tasks such as keeping all the resulting files and meta data organized, converting files to archive formats, and transcription. [Source](https://github.com/sillsdev/saymore).
- SPHERE Conversion Tools
- tasty-imitation-keyboard - A custom keyboard for iOS8+ that serves as a tasty imitation of the default Apple keyboard. Built using Swift and the latest Apple technologies!. **[archived]**
- WordBoundary - An experiment in the detection and segmentation of word boundaries. **[archived]**
-
- Indic NLP Library - Python library for common text processing and NLP tasks in Indian languages including tokenization, normalization, and transliteration.
- IndicTrans2 - Open-source translation models for all 22 scheduled languages of India.
- Living Tongues - Living Tongues Institute for Endangered Languages works to document, revitalize, and maintain endangered languages.
-
-
Georgian
-
Apertium
- awesome-georgia - A curated list of awesome libraries and packages specific/related to Georgia (country).
- Gadatsqvetilebebi - გადაწყვეტილებები; Web spider and corpora importer for public legal decisions.
- GeoWordsDatabase - Around 310 000 unique Georgian words https://bumbeishvili.github.io/GeoWordsDatabase/.
- Kartuli Speech Recognition - ანდროიდის ქართველი მომხმარებლებისთვის სიტყვის ამოცნობის სისტემის შექმნა. Codebase to turn any webpage from any alphabet into another alphabet, the default is to turn latin letters into Kartuli. [use](https://chrome.google.com/webstore/detail/kartuli-glasses/ccmledaklimnhjchkcgideafpglhejja) "Do your friends keep commenting on Facebook with English keyboards (either because they forgot to switch, or because they didn't/can't install a Georgian keyboard)? Now you can read the web through კართული eyes.".
- KartuliChromeExtension - Chrome აპლიკაცია, რომელიც ყველა ინგლისურ ასო-ბგერას აჩვენებს ქართულ ასო-ბგერად.
- QartuliDaBunebismetkveleba - მათემატიკისა და ბუნებისმეტყველების ინტერაქტიული სახელმძღვანელო მე-2 - მე-3 კლასის მოსწავლეებისათვის.
- SakartvelosUzenaesiSasamartloSarke - საქართველოს უზენაესი სასამართლო სარკე.
- SamartlosSakonstitutsioSasamartdoSarke - სამართლოს საკონსტიტუციო სასამართდო სარკე.
- translitit-latin-to-mkhedruli-georgian - A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript.
- translitit-mkhedruli-georgian-to-ipa - A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript.
- Declensions - Methods to generate declensions for Georgian language
-
Fonts
- Stichoza/font-larisome - Iconic font for Georgian currency inspired by Font-Awesome (CSS).
- Lotuashvili/BPGNateli - Bower package for BPG Nateli font (CSS).
- thecotne/georgian-webfonts - Package for georgian fonts (CSS).
-
Internationalization and Localization (i18n/l10n)
- natchkebiailia/NumberToWord - Convert numbers to localized strings (JavaScript).
- d0ragon/number-to-words-ka - Convert numbers to localized strings (PHP).
- dimakura/ka - Common functionality for georgian projects (Ruby).
- dimakura/ka.js - Georgian language support for node and browser (JavaScript).
- akalongman/kautilities - Convert Georgian letters to Latin and vice-versa (PHP).
- Landish/RedactorJS-GE - Redactor WYSIWYG HTML Editor Georgian Language Pack (JavaScript).
- wenzhixin/bootstrap-table - Bootstrap table with extra features. l10n by [@Lotuashvili](https://github.com/Lotuashvili) and [@Stichoza](https://github.com/Stichoza).
- moment/moment - A lightweight date library (JavaScript).
- ioseb/geokbd - Georgian keyboard library (JavaScript).
-
-
Hausa
-
Internationalization and Localization (i18n/l10n)
- Hausa - Repository for Hausa NLP tools.
-
-
Hindi
-
Internationalization and Localization (i18n/l10n)
- hindi-morph - An open source morphological analyzer for Hindi.
-
-
Høgnorsk
-
Internationalization and Localization (i18n/l10n)
- hunspell-hn_NO - A beginning to a spellchecking tool for Høgnorsk, a conservative variant of Norwegian Nynorsk, based on a set of corpuses.
-
-
Icelandic
-
Internationalization and Localization (i18n/l10n)
- IceNLP - IceNLP is an open source Natural Language Processing (NLP) toolkit for analyzing and processing Icelandic text. The toolkit is implemented in Java.
-
-
Inuktitut
-
Internationalization and Localization (i18n/l10n)
- InuktitutAlignerData - Scripts for alignment of laboratory speech production data.
- InuktitutComputing - Inuktitut Morphological Analyser, transcoder, transliterator, corpus tools, and lexical lists for working with Inuktitut. Usable online at http://inuktitutcomputing.ca/index.php.
-
-
Irish
-
Internationalization and Localization (i18n/l10n)
- aimsigh - Source for the now-defunct aimsigh.com Irish search engine.
- caighdean - Code for standardizing Irish language text.
- fleiscin - Irish hyphenation patterns for TeX https://cadhan.com/fleiscin/.
- GaelSpell - Sources for an Irish language spell checker.
- tesseract-gle-uncial - OCR for old Irish fonts.
-
-
Kinyarwanda
-
Internationalization and Localization (i18n/l10n)
- TurboTagger & TurboParser for Kinyarwanda (download)
- kin-morph-fst - Kinyarwanda morphological analyzer.
-
-
Kurdish
-
Internationalization and Localization (i18n/l10n)
- Kurlex - Morphological analyser and lexicon, written in the Alexina framework, licensed under the LGPL-LR.
- kurmanji-stemmer - NLTK based kurmanji stemmer
-
-
Language Specific Projects
-
Afrikaans
-
Albanian
- out-of-copyright-albanian-authors - authors scraped from the albanian language wikipedia who are out of copyright. **[archived]**
-
Galician
- GalegoDroid - Galician Translator for Android **[archived]**
-
Georgian
- Landish/RedactorJS-GE - Redactor WYSIWYG HTML Editor Georgian Language Pack (JavaScript). **[archived]**
-
Kurdish
- kurlex - Morphological analyser and lexicon, written in the Alexina framework, licensed under the LGPL-LR.
-
Zulu
-
-
Lingala
-
Internationalization and Localization (i18n/l10n)
-
-
Lushootseed
-
Internationalization and Localization (i18n/l10n)
- Lushootseed - Joshua Crowgey's work on Lushootseed http://students.washington.edu/jcrowgey/lushootseed/.
-
-
Malagasy
-
Internationalization and Localization (i18n/l10n)
-
-
Manx
-
Migmaq
-
Internationalization and Localization (i18n/l10n)
- migmaq-lessons - Repository for website building Mi'gmaq language lessons.
-
-
Minderico
-
Internationalization and Localization (i18n/l10n)
- fredericajordarzambarino - A web based game for mobile devices in minderico based in the "Who Wants to be a Millionaire" TV show.
-
-
Nishnaabe
-
Internationalization and Localization (i18n/l10n)
- Ojibway-iphone-app - An iPhone app with audio and images for learning the Ojibway language.
- OjibwayMap - An iPhone app with audio and images for learning Ojibway language and culture.
- nishanimate - A desktop app to facilitate Nishnaabe-language acquisition via animations produced by the natural language processing of audio-accompanied text.
-
-
On GitHub
-
Utilities
- FieldDB
- batumi - Speech recognition and natural language processing for low-resource languages
- BloomBooks
- unicode-cldr - Unicode Common Locale Data Repository (CLDR) Project http://cldr.unicode.org
- cmusphinx - Mirror of the SourceForge repositories
- dativebase - Tools for working with OLD.
- divvun - The Divvun group at UiT develops proofing tools, keyboard apps and other language technology solutions for indigenous and minority languages, especially the Sámi languages. [Website](http://divvun.no).
- GiellaLT - home for keyboard layouts, lexicons and morphologies for indigenous and minority languages, especially for morphologically complex languages, using mainly rule-based techonlogies. The resources are used by Divvun (above) and Giellatekno (below) to build a number of tools for the language communities. Almost everything is open source.
- HFST - Helsinki Finite-State Technology. [Website](http://hfst.github.io/).
- hunspell
- keymanapp - [Website](https://keyman.com/).
- langtech - Language Technology Group, University of Melbourne
- lex4all
- longnow
- MontrealCorpusTools
- moses-smt - Statistical Machine Translation.
- mukurtucms
- NLTK - Natural Language Toolkit.
- PhonologicalCorpusTools)
- Projet de recherche sur l'écriture - Crowdsourcing or conducting large scale psycholinguistics experiments (or statistically significant field linguistics).
- prosodylab - Prosodylab at McGill University, Canada
- SIL International (Dev) - Another SIL organization, with many repositories.
- SIL International - SIL (originally known as the Summer Institute of Linguistics, Inc.) is probably the leading organization which provides software and tools tailored for use by field linguists and lexicographers working on endangered languages. A little known fact is that much of it's code is open sourced on GitHub and SIL is happy to recieve open source contributions and collaborate on open source projects.
- SIL NRSI - SIL Non-Roman Script Initiative. The NRSI is a department of SIL International, whose task is to provide assistance, research and development for SIL International and its partners to support the use of non-Roman and complex scripts in language development.
- StanfordNLP
- ucsd-field-lab - University of California, San Diego
- UniversalDependencies - Universal Dependencies (UD) is a project that is developing cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on an evolution of (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). The general philosophy is to provide a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary.
- utcompling - The University of Texas at Austin's Computational Linguistics Lab. [Website](http://www.utcompling.com).
-
-
Organizations
-
On GitHub
- AI4Bharat - Open-source datasets, tools, and models for Indian languages from IIT Madras, including IndicTrans2 (translation), Indic-TTS, IndicLID (language identification), and IndicVoices.
-
Other OSS Organizations
- African Languages Lab - Develops enterprise-grade language AI models (including Mansa LLM) supporting 30+ African languages for translation, transcription, and NLP.
- 7000 Languages - Creates free online language learning courses and materials in partnership with Indigenous, minority, and refugee communities.
-
Categories
Software
261
Galician
39
On GitHub
28
FieldDB Webservices/Components/Plugins
28
Georgian
23
Generic Repositories
10
Organizations
9
Other OSS Organizations
8
Scottish Gaelic
6
Contribute
6
Language Specific Projects
6
Irish
5
Sami
5
Annotation
5
Albanian
5
Single language lexicography projects and utilities
5
Audio automation
5
Corpora
4
Bengali
4
Text-to-Speech (TTS)
3
Zulu
3
Nishnaabe
3
Kinyarwanda
2
Inuktitut
2
Manx
2
Malagasy
2
Kurdish
2
Basque
2
Somali
2
Høgnorsk
1
Quechua
1
Hausa
1
Icelandic
1
Android Applications
1
Lingala
1
Chichewa
1
Amharic
1
Minderico
1
Hindi
1
License
1
Afrikaans
1
Secwepemctsín
1
Text automation
1
Lushootseed
1
Alutiiq
1
Uralic
1
Migmaq
1
Sub Categories
Keywords
python
15
nlp
14
linguistics
10
keyboard
8
natural-language-processing
7
natural-language
6
javascript
6
ios
5
machine-learning
5
minority-language
5
language
4
language-detection
4
computational-linguistics
4
java
4
linux
4
windows
4
android
4
unicode
3
languages
3
morphological-analysis
3
apertium-trunk
3
speech-recognition
3
awesome
3
awesome-list
3
grammar-checker
3
xml
3
kaldi
3
linguistic-annotation-framework
2
folia
2
speech
2
macos
2
constraint-grammar
2
artificial-intelligence
2
named-entity-recognition
2
divvun
2
spellchecker
2
corpus
2
multilingual
2
translation
2
annotation-tool
2
georgia
2
web-application
2
lemmatizer
2
converter
2
android-app
2
keyman
2
keyboard-layout
2
fst
2
morphology
2
fieldworks
2