low-resource-languages

A curated list of resources for the conservation, development, and documentation of low resource (human) languages.
https://github.com/RichardLitt/low-resource-languages

Last synced: 14 days ago
JSON representation

Organizations
- Other OSS Organizations
  - Gnani AI - Voice AI platform with speech-to-text (Vachana STT) and voice models supporting 15+ Indian languages, funded under the IndiaAI Mission.
  - Invisible Languages Project - University of Amsterdam research project studying the representation and visibility of the world's languages in LLMs and on the internet.
  - SILICON Stanford - Stanford Initiative on Language Inclusion and Conservation in Old and New Media, advancing digital inclusion for underrepresented and endangered languages.
  - Soket AI - Open-source Indian language models including Pragna-1B (4 Indian languages) and the Bhasha dataset series for training Indian language models.
  - Vakyansh / EkStep - Open-source speech-to-text models for Indic languages with 10,000+ hours of training data across 23 languages.
  - Wikitongues - Nonprofit preserving linguistic diversity through a language archive of 700+ languages and grants for endangered language revitalization projects.
Other OSS Organizations
- Utilities
  - Giellatekno - Giellatekno combines cutting-edge linguistic and computational research into the analysis of Saami and other morphologically-rich languages, with the development of practical applications. We focus on deep linguistic modeling and on highly efficient and robust computational analysis with a wide empirical coverage. They use svn for their code: all of it can be found [here](https://victorio.uit.no/langtech/trunk/langs/), sorted by language.
  - LOWLANDS - LOWLANDS – Parsing low-resource languages and domains https://ccc.ku.dk/research/lowlands/
  - LTRC: Language Technologies Research Center IIIT Hyderabad
  - The Language Archive
  - How to Write a Spelling Corrector
  - ISO 639-3 code
  - LTRC: Language Technologies Research Center IIIT Hyderabad
  - The Language Archive
Quechua
- Internationalization and Localization (i18n/l10n)
  - Morphology, spellchecker - XFST and FOMA, plus OpenOffice plugin.
Sami
- Internationalization and Localization (i18n/l10n)
  - Giellatekno
  - Oahpa! - A learning portal for Saami languages. Includes WordPress based, media rich lesson-based learning, and morphological and syntactic exercizes generated from the morphological and syntactic tools
  - Neahttadigisánit - A morphologically sensitive dictionary, with modes for 'social media input' (which allows users to type a 'relaxed' version of the orthography (*acdnstz* will be recognized also as *áčđŋšŧz̄*), and also includes a JavaScript bookmarklet to offer click-to-read dictionary lookup functionality. Also available for [other Uralic, and non-Uralic languages](http://dicts.uit.no/index.eng.html).
  - divvun-webdemo - simple webdemo for divvun grammar checker. [Website](https://gtweb.uit.no/gc/).
  - Oahpa! - A learning portal for Saami languages. Includes WordPress based, media rich lesson-based learning, and morphological and syntactic exercizes generated from the morphological and syntactic tools
Scottish Gaelic
- Internationalization and Localization (i18n/l10n)
  - aspell-gd - Scottish Gaelic dictionary for aspell.
  - briathrachan - This is the source code to Briathrachan, a Gaelic-English dictionary app for iOS.
  - gaidhlig - NLP resources for Scottish Gaelic, mainly in support of gd2ga/ga2gd MT engines.
  - gd-fcfg - Context-free feature-based grammar of Scottish Gaelic in the NLTK format.
  - gdbank - Some tools and resources for natural language processing of Scottish Gaelic. https://www.tantallon.org.uk/cggblog/.
  - hunspell-gd - Files for building Scottish Gaelic spell checkers.
Secwepemctsín
- Internationalization and Localization (i18n/l10n)
  - secwepemctsnem - A project to help people learn Secwepemctsín.
Single language lexicography projects and utilities
- Utilities
  - Project for Free Electronic Dictionaries - for indigenous language dictionaries.
  - Webonary
  - WeSay - Allows language communities to build their own dictionaries. https://software.sil.org/wesay/ (by the SIL International).
  - Project for Free Electronic Dictionaries - for indigenous language dictionaries.
  - Project for Free Electronic Dictionaries - for indigenous language dictionaries.
Software
- Utilities
  - accentuate.us
  - Apertium - source shallow-transfer machine translation systems, especially suitable for related language pairs: it includes the engine, maintenance tools, and open linguistic data for several language pairs.
  - CasualCon - research purposes), though [the maintainer] has been using it for his own research (and may others have). It can generate kwic concordance lines, word clusters, collocation analysis, and word count.
  - charlint
  - CMU Sphinx - independent large vocabulary continuous speech recognizer released under BSD style license. It is also a collection of open source tools and resources that allows researchers and developers to build speech recognition systems.
  - dictdb - dictionary database for language translation.
  - langtech
  - lexdb - LexDB is a lexical cognate tracking database. It stores the full provenance of all lexemes and cognate judgements, and allows export into a number of nexus dialects. The database is written in the flexible python/django web framework.
  - LinGO Grammar Matrix - coverage, precision, implemented grammars for diverse languages.
  - Linguistica - structure). It runs under Windows, Mac OS X and Linux, and is written in C++ within the Qt development framework. Its demands on memory depend on the size of the corpus analyzed.
  - Minority Translate
  - NIST 2008 Open Machine Translation Evalutation
  - OpenDataKit - source suite of tools that helps organizations author, field, and manage mobile data collection solutions
  - SPHERE Conversion Tools
  - tasty-imitation-keyboard - A custom keyboard for iOS8+ that serves as a tasty imitation of the default Apple keyboard. Built using Swift and the latest Apple technologies!.
  - Field Linguist's Toolbox - Toolbox is a data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data.
  - transcriber - An HTML5 transcription tool for Aikuma
  - Word Generator
  - WordBoundary - An experiment in the detection and segmentation of word boundaries.
  - XKeyboardConfig - The non-arch keyboard configuration database for X Window. The goal is to provide the consistent, well-structured, frequently released open source of X keyboard configuration data for X Window System implementations (free, open source and commercial). The project is targeted to XKB-based systems.
  - ELAN
  - XTrans - platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. The XTrans toolkit provides new and efficient solutions to common transcription challenges and addresses critical gaps in existing tools.Designed with input from experienced human transcribers working with real world data, XTrans provides a flexible and intuitive graphical user interface for a multitude of speech annotation tasks including (virtual) segmentation of audio into smaller units like turns and sentences; speaker identification; orthographic transcription in any language; and labeling of structural elements of the transcript like topics.
  - Polyglot.js
  - Transifex - System for providing a nice, userfriendly/project oriented approach to translating `.po` files. Great for non-technical users, free for open-source projects, decent for minority languages; **however**, it can take a while to get a new language added to the Transifex system because the ticketing system Transifex uses results in them losing tickets sometimes. Provides translation memory, ability to appoint reviewers, etc. Transifex used to have an open source system that you could host on your own, but that seems to have disappeared.
  - espeak - eSpeak is a compact open source software speech synthesizer for English and other languages, for Linux and Windows. http://espeak.sourceforge.net.
  - Ossian - Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision.
  - Common Language Resources and Technology Infrastructure Norway / Clarino - One of their projects (not clearly listed here) is about providing an online system for language analysis, so users can connect resources visually, dump in text, and get a result. Kind of like the Yahoo! Pipes but for language processing. Uses the [ABEL](https://www.uio.no/english/services/it/research/hpc/abel/) cluster.
  - ojoVoz - A mobile app for sending georeferenced image and voice recordings from an Adroid phone to an email address. For more information, please go to http://sautiyawakulima.net/ojovoz/.
  - 4lang - Concept dictionary using Eilenberg machines.
  - alignment-with-openfst - This is an implementation of the CRF autoencoder framework for four tasks: bitext word alignment, part-of-speech tagging, code switching, dependency parsing.
  - ark-tweet-nlp - CMU ARK Twitter Part-of-Speech Tagger (_Fork_).
  - ArtOfReading - Index and processing scripts related to the Art Of Reading illustration collection.
  - bayesline - A Multinomial Bayesian Classification for Language Identification.
  - bible-corpus-tools - A collection of tools for reading/processing the multilingual Bible corpus.
  - BloomDesktop - Bloom Desktop is a hybrid c#/javascript/html/css Windows application that dramatically "lowers the bar" for language communities who want books in their own languages. Bloom delivers a low-training, high-output system where mother tongue speakers and their advocates work together to foster both community authorship and access to external materia… https://bloomlibrary.org/.
  - BloomLibrary - Bloom Library Single Page App, using AngularJS & Bootstrap, Parse.com backend. https://bloomlibrary.org/.
  - brain - Neural networks in JavaScript.
  - Bristol Uni MT Morphology tools - This repo is a mirror of scripts previously available on http://www.cs.bris.ac.uk/Research/MachineLearning/Morphology/resources.jsp. Included: Ukwabelana - An open-source morphological Zulu corpus and EMMA: A Novel Evaluation Metric for Morphological Analysis.
  - brown-cluster - C++ implementation of the Brown word clustering algorithm.
  - cdec - Decoder, aligner, and model optimizer for statistical machine translation and other structured prediction models based on (mostly) context-free formalisms.
  - chorus - A version control system designed to enable workflows appropriate for typical language development teams who are geographically distributed.
  - clam - Computational Linguistics Application Mediator -- Quickly turn NLP applications into RESTful webservices with a web-application front-end. You provide a specification of your command line application, its input, output and parameters, and CLAM wraps around your application to form a fully fledged RESTful webservice.
  - cnminlangwebcollect - Chinese minorities website languages detection and websites collection.
  - Cog - Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques. It can be used to automate much of the process of comparing word lists from different language varieties. http://sillsdev.github.io/cog/.
  - convertextract - Convert Excel, Word and PowerPoint files with non-Unicode text (like text requiring SIL fonts) into Unicode, while preserving original file's formatting.
  - CorpusTools - Phonological CorpusTools http://phonologicalcorpustools.github.io/CorpusTools/.
  - CTK - Built around LDC's champollion sentence aligner kernel, Champollion Tool Kit (CTK) aims to providing ready-to-use parallel text sentence alignment tools for as many language pairs as possible. (Original project is on SourceForge: http://champollion.sourceforge.net).
  - DataTags - A system to assess the sensitivity and privacy risk of a dataset, and assign a tag to describe how the dataset must be transfered, stored and accessed. (_Fork_).
  - dataverse - A data repository framework to share and publish research data.
  - dative - A single-page application that interacts with multiple linguistic fieldwork web service databases. [Website](http://www.dative.ca).
  - DeepLearnToolbox - Matlab/Octave toolbox for deep learning. Includes Deep Belief Nets, Stacked Autoencoders, Convolutional Neural Nets, Convolutional Autoencoders and vanilla Neural Nets. Each method has examples to get you started.
  - Desmeme - Database and tools for exploring linguistic templates.
  - discoursegraphs - Python-based tool to convert and merge multilayer annotated linguistic data.
  - divvun-gramcheck - This program does FST lookup on forms specified as Constraint Grammar format readings, and looks up error-tags in an XML file with human-readable messages. It is meant to be used as a late stage of a grammar checker pipeline.
  - divvun-keyboard - keyboard apps for iOS and Android with keyboard layouts for indigenous and minority languages
  - divvunspell - `hfst-ospell` (below) rewritten in Rust, for robust concurrency and memory management. Is in practical use about 10x faster than `hfst-ospell`. It uses the same zhfst files as `hfst-ospell`, which are available for all languages in the [GiellaLT](https://github.com/giellalt/) GitHub org (see below).
  - DLTK - Deutsch Language Tool Kit. [More](https://htmlpreview.github.io/?https://github.com/alvations/DLTK/blob/master/docs/index.html).
  - epitran - Grapheme to Phoneme conversion (G2P) for many low-resource languages.
  - ELDER: Endangered Language Data Electronic Repository - Endangered Language Data Electronic Repository: A web-based ontologically-compliant collaborative linguistic data cataloguing tool.
  - exsite9 - ExSite9 is a desktop application that was built to facilitate researchers easily and quickly tagging their data files with descriptive metadata and subsequently packaging their data files and associated metadata ready for submission to a repository. ExSite9 also allows for the structural organisation of said files within actually moving their physical location on your local file storage; allowing you to correctly organise your files and metadata ready for packaging.
  - fast_align - Simple, fast unsupervised word aligner.
  - fastText - Library for fast text representation and classification.
  - FieldWorks - FieldWorks is a suite of software tools for language and cultural data, with support for complex scripts. https://software.sil.org/fieldworks/ FieldWorks Language Explorer (or FLEx, for short) is designed to help field linguists perform many common language documentation and analysis tasks. It can help you: elicit and record lexical information, create dictionaries, interlinearize texts, analyze discourse features, study morphology.
  - Franc - Natural language detection https://wooorm.com/franc/.
  - FwDocumentation - Developer documentation for FieldWorks (software tools for language and cultural data, with support for complex scripts).
  - FwLocalizations - Localizations for FieldWorks.
  - FwSupportTools - Additional tools for FieldWorks development.
  - Gaia - Gaia is a HTML5-based Phone UI for the Boot 2 Gecko Project. NOTE: For details of what branches are used for what releases, see [the wiki](https://wiki.mozilla.org/B2G). If you're interested in setting up a keyboard in new language, see [this](https://developer.mozilla.org/en-US/docs/Archive/B2G_OS/Developing_Gaia/Customizing_the_keyboard).
  - giellakbd-android - A fork of LatinIME (by Google for Android), targeting marginalised languages that also deserve first-class status on mobile operating systems. Used by [kbdgen](https://github.com/divvun/kbdgen) (see elsewhere on this page).
  - giellakbd-ios - An open source reimplementation of Apple's native iOS keyboard with a specific focus on support for localised keyboards. Used by [kbdgen](https://github.com/divvun/kbdgen) (see elsewhere on this page).
  - giza-pp - GIZA++ is a statistical machine translation toolkit that is used to train IBM Models 1-5 and an HMM word alignment model. This package also contains the source for the mkcls tool which generates the word classes necessary for training some of the alignment models.
  - gv-crawl - Global Voices bitext crawler for creating parallel corpora.
  - GlotLID - Fasttext language identification with support for more than 2000 labels.
  - Glottolog data - [Glottolog](https://glottolog.org) provides comprehensive reference information for the world's languages.
  - Gramadóir - Grammar checking engine that is designed for the rapid development of grammar checkers for minority languages and other languages with limited computational resources.
  - grind - An InDesign 5.5 plug-in designed allow graphite enabled smart fonts to be used in Adobe InDesign. This project integrates SIL's Graphite 2 smart font technology with our own implementation of a paragraph composer plugin.
  - hermitcrab - HermitCrab.NET is a flexible morphological/phonological parser that takes an item-and-process approach.
  - hfst-ospell - HFST spell checker library and command line tool.
  - hfst-ospell-js - Node bindings for hfst-ospell.
  - hfst-optimized-lookup - HFST optimized-lookup standalone library and command line tool.
  - hundict - bilingual dictionary extractor from parallel corpora.
  - hunspell - Spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding.
  - huntag - a sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models.
  - icu-dotnet - C# wrapper for ICU4C.
  - icu4c - Mirror of svn project at http://source.icu-project.org/repos/icu/icu/. The FieldWorks branch has some FieldWorks specific enhancements.
  - iLanguage - A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of possible parses for morphemes in a word. Input: a corpus. Uses compression, maximum entropy and fieldlinguistics.
  - ipa-help - IPA Helps.
  - itweets-geodata - Geodata from Indigenous Tweets.
  - jQuery.ime - jQuery based input methods library.
  - kbdgen - Generate keyboards and keyboard layouts for various operating systems.
  - koreksyon - Tools for developing and implementing spell-checking and grammar-checking capabilities in low-resource languages.
  - l20n.js - L20n reinvents software localization. Users should be able to benefit from the entire expressive power of natural languages. L20n keeps simple things simple, and at the same time makes complex things possible. This is the JavaScript implementation of L20n. http://l20n.org.
  - langid.py - Stand-alone language identification system.
  - LEGO Unified Concepticon - Material relating to the LEGO Unified Concepticon.
  - Lex4All - pronunciation LEXicons for Any Low-resource Language http://lex4all.github.io/lex4all/.
  - LfMerge - Send/Receive for languageforge.org.
  - liblevenshtein - A library for generating Finite State Transducers based on Levenshtein Automata.
  - libpalaso - Palaso Library: A set of .Net libraries useful for developers of Language Software.
  - Lingpy - LingPy: Python library for quantitative tasks in historical linguistics http://lingpy.org.
  - long-press - jQuery plugin to ease the writing of accented or rare characters. http://toki-woki.net/lab/long-press/.
  - lrl - For work concerning low resource languages.
  - MacVoikko - An OS X spelling server based on Voikko.
  - Machine - Machine is a natural language processing library for .NET that is focused on providing tools for processing resource-poor languages (used by FLEx).
  - Make-extensions - Scripts for generating hunspell spellchecking extensions.
  - mgiza - A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training.
  - morfessor - Morfessor is a tool for unsupervised and semi-supervised morphological segmentation.
  - morpholm - Morphology-aware language models.
  - morph-test - A python script to run tests for generation and analysis of a morphological transducer built using the Giella infrastructure. Works with Hfst, Xerox' fst tools, and with Foma.
  - mosesdecoder - Moses, the machine translation system.
  - moz-l10n-tiers - Creates a pseudo-locale to evaluate string prioritization for l10n.
  - mukurtucms - The Mukurtu Content Management System (CMS) is an Internet- based platform designed to enable archiving of digital cultural resources
  - mythes - MyThes is a simple thesaurus that uses a structured text data file and an index file with binary search to lookup words and phrases and return information on part of speech, meanings, and synonyms.
  - myWorkSafe - Smart & Simple Backup for Language Development Workers. http://software.sil.org/myworksafe/.
  - nabu - nabu is a digital media item management system that provides a catalog of audio and video items, metadata for these items, and information about the workflow status of the items. www.paradisec.org.au
  - Natural - *Javascript* general natural language facilities for node.
  - NLTK - *Python* Natural Language Tool Kit. NLTK Source http://www.nltk.org/.
  - node-panlex - node.js client for PanLex.
  - norma - A tool for automatic spelling normalization.
  - nplm - Fork of https://nlg.isi.edu/software/nplm/ with some efficiency tweaks and adaptation for use in mosesdecoder.
  - octothorpe - CouchDB-powered wiki thing.
  - OdtXslt - Perform XSLT transform on contents of a package (such as ODT, Docx, etc.).
  - old-webapp - Online Linguistic Database --- software for creating web applications to collaboratively document languages.http://www.onlinelinguisticdatabase.org.
  - old - The Online Linguistic Database (OLD): software for linguistic fieldwork. http://www.onlinelinguisticdatabase.org.
  - old-pyramid - Online Linguistic Database migrated to the Pyramid framework.
  - OmegaT-hfst-tokenizer - OmegaT-hfst-tokenizer provides fst-based tokenisation in OmegaT.
  - OpenNLP - The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. [Website](https://opennlp.apache.org).
  - ops-devbox - Ansible playbook for a (linux) developer machine.
  - panlex-tools - This package contains scripts to transform lexical resources into a format suitable for importing into PanLex. Documentation may be found at https://dev.panlex.org.
  - pdsc-collection-viewer - Paradisec Collection Browser
  - paradigm - PARADIGM is a .Net (C#) implementation of Joseph E. Grimes' 1983 work entitled "Affix Positions and Cooccurrences: The PARADIGM Program".
  - pathway - Preparing language data for publication.
  - pdfdroplet - Library and GUI for imposition of PDF pages (e.g. 2-up) http://software.sil.org/pdfdroplet/.
  - pepper - Pepper is a pluggable, Java-based, open source converter framework for linguistic data.
  - phonology-assistant - Phonology Assistant is a discovery tool. Provided with a corpus of phonetic data, it automatically charts the sounds and through its searching capabilities, helps a user discover and test the rules of sound in a language.
  - pressagio - Pressagio is a library that predicts text based on n-gram models. For example, you can send a string and the library will return the most likely word completions for the last token in the string.
  - PrimerPro - The purpose of PrimerPro is to assist the literacy worker in the development of primers for a given language.
  - pyDelphin - Python libraries for DELPH-IN (Friendly Fork).
  - RBGParser - Graph-based Dependency Parser.
  - Rosetta Pangloss - The Rosetta Project's Pangloss system.
  - salm - SALM: Suffix Array and its Applications in Empirical Language Processing by Joy.
  - Salt - A graph-based model to store and manipulate linguistic data.
  - saymore - A tool for making common Language Documentation tasks such as keeping all the resulting files and meta data organized, converting files to archive formats, and transcription.
  - Secwepemc-Facebook - Translate Facebook into unsupported languages.
  - SegParser - Randomized Greedy algorithm for joint segmentation, POS tagging and dependency parsing.
  - SeedLing - Building and Using A Seed Corpus for the Human Language Project.
  - Skype in your language - Translate Skype into unsupported languages.
  - solid - Solid is a software tool that can be used to check, clean up, and convert Standard Format (e.g. Toolbox) lexicon data.
  - StandardFormatLib - Standard Format Library.
  - Stanford CoreNLP - Stanford CoreNLP: A Java suite of core NLP tools. https://stanfordnlp.github.io/CoreNLP/.
  - Stanford CoreNLP Python - Python wrapper for Stanford CoreNLP tools.
  - stanza - Stanford NLP group's shared Python tools.
  - str2ipa - Pronunciation dictionaries for languages with close-to-phonetic writing systems.
  - sugali - This is a legacy repository of the language identification project for many (many) languages project for the software project course, NLP projects for low-resource languages.
  - SuGarLike - Language Identification for Low Resource Languages (by Susanne, Guy and Liling).
  - SyllabiPy - Python interface for universal syllabification algorithms
  - TECkit - A Text Encoding Conversion toolkit.
  - teny - Tools for low-resource machine translation.
  - TeraDict - Translate English words into hundreds of languages!.
  - Tesseract.js - Pure Javascript OCR for 62 Languages 📖🎉🖥 http://tesseract.projectnaptha.com/.
  - TexNLP - TexNLP: Texas Natural Language Processing tools.
  - TiMBL - based learning algorithms, among which IB1-IG, an implementation of k-nearest neighbor classification with feature weighting suitable for symbolic feature spaces, and IGTree, a decision-tree approximation of IB1-IG. All implemented algorithms have in common that they store some representation of the training set explicitly in memory. During testing, new cases are classified by extrapolation from the most similar stored cases.
  - Toney - Tone Classification Software.
  - Toolbox Scripts for ELAN - Mirror of Alexander Koenig's Toolbox Scripts https://tla.mpi.nl/tools/tla-tools/elan/thirdparty/.
  - ToolsForFieldLinguistics - A collection of scripts and recipes for linguistics.
  - translitit-engine - A transliteration engine written in JavaScript.
  - Tsammalex data - [Tsammalex](https://tsammalex.clld.org) is a multilingual lexical database on plants and animals.
  - tweet2learn - An app to make it easier to use your native language on Twitter.
  - twitter_langid - A hierarchical character-word neural network for language identification.

Programming Languages

Python 71 JavaScript 58 Java 29 C++ 25 C# 21 HTML 9 Perl 9 C 7 Shell 6 Ruby 6

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

low-resource-languages

Organizations

Other OSS Organizations

Other OSS Organizations

Utilities

Quechua

Internationalization and Localization (i18n/l10n)

Sami

Internationalization and Localization (i18n/l10n)

Scottish Gaelic

Internationalization and Localization (i18n/l10n)

Secwepemctsín

Internationalization and Localization (i18n/l10n)

Single language lexicography projects and utilities

Utilities

Software

Utilities