https://github.com/alvations/SeedLing

Building and Using A Seed Corpus for the Human Language Project
https://github.com/alvations/SeedLing

Last synced: about 1 month ago
JSON representation

Building and Using A Seed Corpus for the Human Language Project

Host: GitHub
URL: https://github.com/alvations/SeedLing
Owner: alvations
Created: 2016-02-01T18:30:24.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2018-02-09T01:03:16.000Z (over 7 years ago)
Last Synced: 2025-04-30T08:11:26.881Z (about 2 months ago)
Language: Python
Size: 13.8 MB
Stars: 11
Watchers: 6
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

low-resource-languages - SeedLing - Building and Using A Seed Corpus for the Human Language Project. (Software / Utilities)

README

        SeedLing

========

Building and using a seed corpus for the *Human Language Project* (Steven and Abney, 2010).

The SeedLing corpus on this repository includes the data from:

*  **ODIN**: Online Database of Interlinear Text 

*  **Omniglot**: Useful foreign phrases from www.omniglot.com

*  **UDHR**: Universal Declaration of Human Rights

The SeedLing API includes scripts to access data/information from:

* **SeedLing**: different data sources that forms the SeedLing corpus (`odin.py`, `omniglot.py`, `udhr.py`, `wikipedia.py`)

* **WALS**: Language information from World Atlas of Language Structures (`miniwals.py`)

**FAQs**:

- To use the SeedLing corpus through the python API, please follow the instructions on the **Usage** section.

- To download the plaintext version of the SeedLing corpus (excluding wikipedia data), click here: https://goo.gl/qBa4bw 

- To download the wikipedia data, please follow the **Getting Wikipedia** section.

***

Usage

=====

To access the SeedLing from various data sources:

```

from seedling import udhr, omniglot, odin

# Accessing ODIN IGTs:

>>> for lang, igts in odin.igts():

>>>   for igt in igts:

>>>     print lang, igt

# Accesing Omniglot phrases

>>> for lang, sent, trans in omniglot.phrases():

>>>   print lang, sent, trans

# Accessing UDHR sentences.

>>> for lang, sent in udhr.sents():

>>>   print lang, sent

```

To access the SIL and WALS information:

```

from seedling import miniwals

# Accessing WALS information

>>> wals = miniwals.MiniWALS()

>>> print wals['eng']

{u'glottocode': u'stan1293', u'name': u'English', u'family': u'Indo-European', u'longitude': u'0.0', u'sample 200': u'True', u'latitude': u'52.0', u'genus': u'Germanic', u'macroarea': u'Eurasia', u'sample 100': u'True'}

```

Detailed usage of the API can also be found in `demo.py`.

***

Getting Wikipedia

====

There are two ways to access the Wikipedia data:

 1. Plant your own Wiki

 2. Access it from our cloud storage

Plant your own Wiki

----

We encourage SeedLing users to take part in building the Wikipedia data from the SeedLing corpus. A fruitful experience, you will find.

Please **ENSURE** that you have sufficient space on your harddisk (~50-70GB) and also this process of download and cleaning might take up to a week for **ALL** languages available in Wikipedia. 

**For the lazy**: run the script `plant_wiki.py` and it would produce the desired cleaned plaintext Wikipedia data as presented in the SeedLing publication:

```

$ python plant_wiki.py &

```

For more detailed, step-by-step instructions:

 - First, you have to download the Wikipedia dumps. We have used the `wp-download` (https://github.com/babilen/wp-download) tool when building the SeedLing corpus. 

 - Then, you have to extract the text from the Wikipedia dumps. We used the `Wikipedia Extractor` (http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) to convert wikipedia dumps into textfiles.

 - Finally, you can use the cleaning function in `wikipedia.py` to clean the Wikipedia data and assigns the ISO 639-3 language code to textfiles. The cleaning function can be called as such:

```

import codecs

from seedling.wikipedia import clean

extracted_wiki_dir = "/home/yourusername/path/to/extracted/wiki/"

cleaned_wiki_dir = "/home/yourusername/path/to/cleaned/wiki/"

for i in os.listdir(extracted_wiki_dir):

  dirpath, filename = os.path.split(i)

  with codecs.open(i, 'r', 'utf8') as fin, codecs.open(clean_wiki_dir+"/"+filename, 'w', 'utf8') as fout:

    fout.write(clean(fin.read()))

```

Please feel free to contact the colloborators in the SeedLing project if you encounter problems with getting the Wikipedia data.

Access it from our cloud storage

----

To be updated.

***

Cite

=====

To cite the SeedLing corpus:

Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of

*The use of Computational methods in the study of Endangered Languages (ComputEL) Workshop*. Baltimore, USA.

in `bibtex`:

```

@InProceedings{seedling2014,

  author    = {Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri},

  title     = {SeedLing: Building and using a seed corpus for the Human Language Project},

  booktitle = {Proceedings of The use of Computational methods in the study of Endangered Languages (ComputEL) Workshop},

  month     = {June},

  year      = {2014},

  address   = {Baltimore, USA},

  publisher = {Association for Computational Linguistics},

  pages     = {},

  url       = {}

}

```

***

References

====

 - Steven Abney and Steven Bird. 2010. The Human Language Project: Building a universal corpus of the world’s languages. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 88–97. Association for Computational Linguistics.

 - Sime Ager. Omniglot - writing systems and languages of the world. Retrieved from www.omniglot.com.

 - William D Lewis and Fei Xia. 2010. Developing ODIN: A multilingual repository of annotated language data for hundreds of the world’s languages. Literary and Linguistic Computing, 25(3):303–319.

 - UN General Assembly, Universal Declaration of Human Rights, 10 December 1948, 217 A (III), available at: http://www.refworld.org/docid/3ae6b3712c.html [accessed 26 April 2014]

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alvations/SeedLing

Awesome Lists containing this project

README