Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/richardlitt/gaelic-resources

A list of computational resources for Gaelic
https://github.com/richardlitt/gaelic-resources

corpora corpus gaelic irish language nlp resources scots scottish scottish-gaelic

Last synced: 30 days ago
JSON representation

A list of computational resources for Gaelic

Awesome Lists containing this project

README

        

# Gaelic Resources
A list of computational resources for Gaelic.

This list has grown out of https://github.com/RichardLitt/endangered-languages, my list for all open source resources for low resource languages. I'm particularly interested in Gaelic, going forward.

## Tools

### [Hunspell-gd](https://github.com/kscanne/hunspell-gd)

Kevin Scannell has a repository with data files and scripts for building Scottish Gaelic spell checkers. This script was started through [the Crúbadán project](http://crubadan.org/). GPL Licensed. This [hunspell-gd repo](https://github.com/gooselinux/hunspell-gd) is likely derivative.

## Corpora

### [Annotated Reference Corpus of Scottish Gaelic (ARCOSG)](http://datashare.is.ed.ac.uk/handle/10283/2011)

A representative, tagged corpus of Scottish Gaelic, divided into 8 registers (4 spoken, 4 written) of approximately 10k words each. The corpus is presented as individual txt files.

The corpus was hand-tagged by Lamb, Arbuthnot and Naismith and separately verified by them. It uses the Brown format tag separators ('/': e.g. 'agus/Cc') and an annotation scheme derived from the Irish PAROLE tagset (see Uí Dhonnchadha, E. and van Genabith, J. 2006. A Part-of-Speech tagger for Irish using finite state morphology and constraint grammar disambiguation. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), 2241-2244.).

The annotation scheme is described in a PDF included with the data: Lamb, W. and Naismith, S (2014) Scottish Gaelic Part-of-Speech Annotation Guidelines.

This work was funded by Bòrd na Gàidhlig and Carnegie Trust for the Universities of Scotland.

### [DASG Corpus na Gàidhlig](http://www.dasg.ac.uk/corpus/)

[Corpas na Gàidhlig](http://www.dasg.ac.uk/corpus/) is a constituent project of DASG. It was founded in 2008 with the following aims:
to create a comprehensive electronic corpus of Scottish Gaelic texts for students and researchers of Scottish Gaelic language, literature and culture
to provide the textual basis for the interuniversity project Faclair na Gàidhlig (‘Dictionary of the Scottish Gaelic Language’) upon which the future historical dictionary will be based
to provide a resource which will facilitate corpus planning and corpus development technology for Gaelic
The first phase of Corpas na Gàidhlig aims to digitise 337 texts from all periods of Gaelic literature and to include a wide variety of genres, including poetry, prose, song, and folklore. These texts (listed below) have been prioritised in order to provide part of the textual basis for the interuniversity dictionary project, Faclair na Gàidhlig. It is envisaged as Corpas na Gàidhlig progresses that a broad range of other texts will be added, and in time, that speech will also be represented by text and sound files. In the long term, the Corpus will be used to update the dictionary.

To date over 19 million words, mostly Gaelic, have been captured.

The 337 texts to be digitised as part of Phase 1 are listed [here](http://www.dasg.ac.uk/about/cnag/en) (if the appropriate permissions are received).

### [Lancaster Scottish Gaelic corpus](http://www.lancaster.ac.uk/fass/projects/biml/bimls3corpus.htm)

Corpus contents:

conversation.txt - an informal conversation
lecture.txt - a university lecture on philosophy
sermon.txt - a sermon from a Church of Scotland communion service
service.txt - a second sermon
talk.txt - an informal educational/historical/religious talk
All files are encoded in UTF-8 format.

## Contribute

Please add stuff!

## License

[The Unlicense](LICENSE)