low-resource-languages

A curated list of resources for the conservation, development, and documentation of low resource (human) languages.
https://github.com/RichardLitt/low-resource-languages

Last synced: 14 days ago
JSON representation

Software
- Utilities
  - UniversalDependencies docs - Universal Dependencies online documentation http://universaldependencies.org/docs/.
  - UniversalDependencies tools - Various utilities for processing the data.
  - VocBench - based, multilingual, editing and workflow tool that manages thesauri, authority lists and glossaries using SKOS-XL.
  - wavesurfer.js - Navigable waveform built on Web Audio and Canvas https://wavesurfer-js.org/ (Also has an ELAN plugin).
  - web-template - This is a web-based template that may be used to present language learning resources to aid language revitalization efforts. It includes a talking dictionary, and a phrasicon, containing sentences and phrases.
  - webcorpus - This project is a collection of scripts and programs for creating a webcorpus from crawled data.
  - wikt2dict - Wiktionary parser tool for many language editions.
  - wikipron - - retrives IPA pronunciations for Wiktionary entries
  - wordbyword - WordByWord is a free, open source, easy-to-use multimedia vocabulary trainer developed by Vera Ferreira, Peter Bouda, and Ricardo Filipe at CIDLeS with the support of the Foundation for Endangered Languages.
  - WSI4URLang - Word Sense Induction (WSI) for Under-resourced Languages (URLang).
  - xdxf_makedict - XDXF dictionary format and "makedict" dictionary converting software (official repository).
  - Keyboard - Virtual Keyboard using jQuery ~ https://mottie.github.io/Keyboard/.
  - Keyboards - Open Source Keyman keyboards.
  - Keyman - Keyman cross platform input methods. Keyman makes it possible for you to type in over 1,000 languages on Windows, iPhone, iPad, Android tablets and phones, and even instantly in your web browser. [Website](https://keyman.com/).
  - keyboardlayouteditor - Keyboard Layout Editor https://code.google.com/archive/p/keyboardlayouteditor/.
  - Keyboard layout editor - Keyboard Layout Editor http://www.keyboard-layout-editor.com
  - lipika-ime - Input Method Engine (IME) for Mac OS X with built-in support for all Indic Languages.
  - AGTK - AGTK is a suite of software components for building tools for annotating linguistic signals, time-series data which documents any kind of linguistic behavior (e.g. audio, video). The internal data structures are based on annotation graphs. (Original project is on SourceForge: https://sourceforge.net/projects/agtk/).
  - gfl_syntax - Graph Fragment Language for Easy Syntactic Annotation https://www.cs.cmu.edu/~ark/FUDG/.
  - eopas - ETHNOER Online Presentation and Annotation System.
  - FLAT - FoLia Linguistic Annotation Tool - FLAT is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia/), a rich XML-based format for linguistic annotation. FLAT allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm. It is a document-centric tool that fully preserves and visualises document structure.
  - graf-python - The library graf-python is an open source Python implemenation to parse and write GrAF/XML files as described in ISO 24612. The parser of the library creates an annotation graph from the files. The user may then query the annotation graph via the API of graf-python.
  - kwaras - Tools for ELAN corpus management.
  - LDC Word Aligner - English and Chinese-English word alignment tasks. It has a clean, easy-to-use interface. Since its development in 2009, LDC has used LDC Word Aligner to generate over 1,000,000 tokens of annotated word alignment data from a variety of genres including broadcast, newswire and web-based sources. [Website](https://www.ldc.upenn.edu/language-resources/tools/ldc-word-aligner).
  - poio-analyzer - Poio is a collection of software tools for linguists working in language documentation, descriptive linguistics and/or language typology. It allows linguists to manage and analyze their data. The Poio Interlinear Editor allows to add morpho-syntactic annotations to transcriptions. It supports various file formats for input, but will only output standardized XML defined by the Corpus Encoding Standard and the Text Encoding Initiative. Several tools for analyzing linguistic data will be made available to further process annotated data. Poio tools are written in Python and are based on PyQt.
  - poio-api - Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow. It converts file formats like Elan’s EAF, Toolbox files, Typecraft XML and others into annotation graphs as defined in ISO 24612. Those graphs, for which we use an implementation called “Graph Annotation F…
  - pyannotation - PyAnnotation is a Python Library to access and manipulate linguistically annotated corpus files.
  - spec - The official specification for the DLx linguistic data format. https://digitallinguistics.github.io/spec/.
  - FoLiA - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are support, making FoLiA a useful format for NLP tasks and data interchange. http://proycon.github.io/folia/
  - Express-Lingua - An i18n middleware for the Express.js framework.
  - arctic-prompts - Generate prompts PDF for CMU ARCTIC dataset.
  - AudioWebService - a simple nodejs server which accepts upload of audio and runs it through praat.
  - AuToBI - Automatic prosodic annotation tool written in Java.
  - BashScriptsForPhonetics - (_Fork_ of a dormant project).
  - esv-text-audio-aligner - ESV Text/Audio Aligner to programmatically obtain the timings for each word in the corresponding audio.
  - html5-audio-read-along - HTML5 Audio Read-Along.
  - ipa-chart - International Phonetic Alphabet (IPA) Unicode Chart and Character Picker.
  - kaldi-svn-archive - An read-only archive of the original Kaldi SVN repository (mainly to keep sandboxes available).
  - lex4all - pronunciation LEXicons for Any Low-resource Language (_Fork_ of a student project).
  - Montreal-Forced-Aligner - Python interface for forced text/speech alignment.
  - node-pocketsphinx
  - opensauce - GNU Octave-compatible version of VoiceSauce.
  - pocketsphinx - PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop.
  - pocketsphinx-ios-demo - Simple demo for iOS.
  - pocketsphinx-python - Python module installed with setup.py.
  - pocketsphinx-ruby - Ruby speech recognition with Pocketsphinx.
  - pocketsphinx-wp-demo - Demo to run pocketsphinx on WP8 platform.
  - pocketsphinx.js - Speech recognition in JavaScript.
  - praat-py - From my PhD days: Praat-Py is a custom build of Praat, the computer program used by linguists for doing phonetic analysis on sound files, to allow for scripts to be written in the Python programming language, rather than in Praat's built-in language. (_Fork_ of a dormant project).
  - Praat-Scripts - Mietta's Scripts.
  - PraatTextGridJS - A small library which can parse TextGrid into json and json into TextGrid.
  - PraatontheWeb - Web implementation of Praat. Source code, running demo scripts on web, samples and documentation.
  - prosodicParsing - different kinds of HMMs to use for incorporating prosody into basic parsing.
  - Prosodylab-Aligner - Python interface for forced audio alignment using HTK and SoX.
  - prosodylab.alignertools
  - Recordmp3js - Record MP3 files directly from the browser using JS and HTML.
  - sphinx4 - Pure Java speech recognition library.
  - sphinxbase
  - sphinxtrain
  - TLSphinx - Swift wrapper around Pocketsphinx.
  - MARY TTS - MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java http://mary.dfki.de.
  - Elpis - Elpis is software for creating speech recognition models and applying them to the transcription of audio. As of 2022, it gives access to Kaldi and Huggingface Transformers.
  - kaldi - This is now the official location of the Kaldi project.
  - Persephone - Persephone aims to make state-of-the-art phonemic transcription accessible to people involved in language documentation, who have a training corpus of about one to four hours of transcribed speech. As of 2022, Persephone is superseded by Elpis.
  - clld - Cross Linguistic Linked Data python library.
  - LaTeX2HTML5 - LaTeX web components.
  - MultilingualCorporaExtractor - Node io Spider for extracting multilingual corpora (_Fork_ of a student project).
  - SeedLing - Building and Using A Seed Corpus for the Human Language Project (_Fork_ of a student project).
  - experigen - A framework for creating linguistic experiments.
  - GamifyPsycholinguisticsExperiments - A simple node server to gamify linguistics experiments, runs offline on a laptop for small scale experiements and online on a server for large scale experiments. Data is sent to a Google spreadsheet. (_Fork_ of a dormant project).
  - OPrime - Open Source Experimentation Libraries - Online and Offline for Android and HTML5.
  - psychopyMegProsody - Runs MegProsody using PsychoPy.
  - PsychScript - A HTML5/Javascript library for running behavioural experiments online.
  - awesome-anki - A curated list of awesome Anki add-ons, decks and resources.
  - VocabLift - Language-learning tool that uses vocabulary from LIFT-format dictionaries produced by programs such as Fieldworks Language Explorer and WeSay.
  - OpenCCG - OpenCCG library for parsing and realization with CCG. Includes mini-grammars for Inuit, Nezperce, Basque and others.
  - Aikuma - Android software for recording and translation.
  - Android Speech Recognition Trainer - Speech recognition training app for low resource languages which interfaces with FieldDB corpora.
  - android-template - This is a template of an Android word-learning app that may be used a way to introduce a language. It includes a quiz. For the documentation, go to http://eddersko.github.io/android-template/.
  - AndroidFieldDB - An Android app which lets the user build a custom visual and auditory vocabulary, useful for guided anomia treatment and self designed language lessons by heritage speakers.
  - AndroidFieldDBElicitationRecorder - A general purpose video recording tool.
  - AndroidLanguageLessons - Lets heritage speakers create self designed language lessons.
  - AndroidProductionExperiment - Android App to run perception experiments.
  - Bevara - Android Phone Application designed for Linguistic Fieldwork to help preserve, maintain, and save endangered languages.
  - pocketsphinx-android - pocketsphinx build for Android.
  - pocketsphinx-android-demo
  - babelfrog - Chrome extension to help learn languages as you browse.
  - DictionaryChromeExtension - Dictionary for websites in low-resource languages. App and codebase which connects to a Wiktionary to provide definitions of any term on any website (current languages Cherokee 194,426 entries, Inuktitut 251 entries, Kartuli 7,363 entries, Plains Cree (incubation) 0 entries) [use](https://chrome.google.com/webstore/detail/my-dictionary/jfmpeiicncingobdejgmmcamknndpbbi).
  - FieldDB - An offline/online field database which adapts to its user's terminology and I-Language, has plugins for various data automation routines along the process of primary data collection to cleaning to publication and archival. [use](https://wwwdev.lingsync.org/).
  - Linguistica - structure). It runs under Windows, Mac OS X and Linux, and is written in C++ within the Qt development framework. Its demands on memory depend on the size of the corpus analyzed.
  - Word Generator
  - ELAN
  - espeak - eSpeak is a compact open source software speech synthesizer for English and other languages, for Linux and Windows. http://espeak.sourceforge.net.
Somali
- Internationalization and Localization (i18n/l10n)
  - somorph - Somali morphological and syntactic analyzers and generators built on XFST and VISL-CG Constraint Grammar. Up to date version checked in on [Giellatekno's](http://giellatekno.uit.no) repository.
  - qaamuus.net
Text automation
- Software
  - L3XDG - Extensible Dependency Grammar (Debusmann, 2007) for translation.
Text-to-Speech (TTS)
- Software
  - Festival Text to Speech - A general multi-lingual speech synthesis system.
  - Indic-TTS - Open-source text-to-speech models for 13 Indian languages including Assamese, Bengali, Hindi, Kannada, Malayalam, Tamil, and Telugu.
  - Ossian - Ossian is a collection of Python code for building text-to-speech (TTS) systems, with an emphasis on easing research into building TTS systems with minimal expert supervision. **[archived]**
Uralic
- Internationalization and Localization (i18n/l10n)
  - UralicNLP - A Python library for processing Uralic languages (Finnish, Skolt Sami, Erzya, Moksha, Komi-Zyrian and so on). The library provides an easy programmatic access to Giellatekno resources such as FST morphology and CG disambiguators. Other functionalities include UD parser, API for the [Online Dictionary of Uralic Languages](https://akusanat.com) and interface to SemFi and SemUr semantic databases. The library is under active development and new features are added from time to time.
Zulu
- Internationalization and Localization (i18n/l10n)
  - Ukwabelana - source morphological Zulu corpus
  - ![License: CC BY-SA 4.0 - sa/4.0/) © Richard Littauer 2014-2017
  - Ukwabelana - source morphological Zulu corpus

Programming Languages

Python 71 JavaScript 58 Java 29 C++ 25 C# 21 HTML 9 Perl 9 C 7 Shell 6 Ruby 6

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

low-resource-languages

Software

Utilities

Somali

Internationalization and Localization (i18n/l10n)

Text automation

Software

Text-to-Speech (TTS)

Software

Uralic

Internationalization and Localization (i18n/l10n)

Zulu

Internationalization and Localization (i18n/l10n)