{"id":13741222,"url":"https://github.com/alvations/SeedLing","last_synced_at":"2025-05-08T21:33:13.879Z","repository":{"id":74302713,"uuid":"50860971","full_name":"alvations/SeedLing","owner":"alvations","description":"Building and Using A Seed Corpus for the Human Language Project ","archived":false,"fork":false,"pushed_at":"2018-02-09T01:03:16.000Z","size":14449,"stargazers_count":11,"open_issues_count":0,"forks_count":1,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-30T08:11:26.881Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/alvations.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2016-02-01T18:30:24.000Z","updated_at":"2024-09-08T18:20:21.000Z","dependencies_parsed_at":"2024-01-07T18:10:39.120Z","dependency_job_id":null,"html_url":"https://github.com/alvations/SeedLing","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alvations%2FSeedLing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alvations%2FSeedLing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alvations%2FSeedLing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/alvations%2FSeedLing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/alvations","download_url":"https://codeload.github.com/alvations/SeedLing/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253153163,"owners_count":21862318,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T04:00:56.978Z","updated_at":"2025-05-08T21:33:13.823Z","avatar_url":"https://github.com/alvations.png","language":"Python","readme":"SeedLing\n========\n\nBuilding and using a seed corpus for the *Human Language Project* (Steven and Abney, 2010).\n\nThe SeedLing corpus on this repository includes the data from:\n*  **ODIN**: Online Database of Interlinear Text \n*  **Omniglot**: Useful foreign phrases from www.omniglot.com\n*  **UDHR**: Universal Declaration of Human Rights\n\nThe SeedLing API includes scripts to access data/information from:\n* **SeedLing**: different data sources that forms the SeedLing corpus (`odin.py`, `omniglot.py`, `udhr.py`, `wikipedia.py`)\n* **WALS**: Language information from World Atlas of Language Structures (`miniwals.py`)\n\n**FAQs**:\n\n- To use the SeedLing corpus through the python API, please follow the instructions on the **Usage** section.\n- To download the plaintext version of the SeedLing corpus (excluding wikipedia data), click here: https://goo.gl/qBa4bw \u003c!--https://db.tt/N7hV3gwW--\u003e\n- To download the wikipedia data, please follow the **Getting Wikipedia** section.\n\n\n***\nUsage\n=====\n\nTo access the SeedLing from various data sources:\n\n```\nfrom seedling import udhr, omniglot, odin\n\n# Accessing ODIN IGTs:\n\u003e\u003e\u003e for lang, igts in odin.igts():\n\u003e\u003e\u003e   for igt in igts:\n\u003e\u003e\u003e     print lang, igt\n\n# Accesing Omniglot phrases\n\u003e\u003e\u003e for lang, sent, trans in omniglot.phrases():\n\u003e\u003e\u003e   print lang, sent, trans\n\n# Accessing UDHR sentences.\n\u003e\u003e\u003e for lang, sent in udhr.sents():\n\u003e\u003e\u003e   print lang, sent\n```\n\nTo access the SIL and WALS information:\n\n```\nfrom seedling import miniwals\n\n# Accessing WALS information\n\u003e\u003e\u003e wals = miniwals.MiniWALS()\n\u003e\u003e\u003e print wals['eng']\n{u'glottocode': u'stan1293', u'name': u'English', u'family': u'Indo-European', u'longitude': u'0.0', u'sample 200': u'True', u'latitude': u'52.0', u'genus': u'Germanic', u'macroarea': u'Eurasia', u'sample 100': u'True'}\n```\n\nDetailed usage of the API can also be found in `demo.py`.\n\n\n***\nGetting Wikipedia\n====\n\nThere are two ways to access the Wikipedia data:\n 1. Plant your own Wiki\n 2. Access it from our cloud storage\n\n\nPlant your own Wiki\n----\n\nWe encourage SeedLing users to take part in building the Wikipedia data from the SeedLing corpus. A fruitful experience, you will find.\n\nPlease **ENSURE** that you have sufficient space on your harddisk (~50-70GB) and also this process of download and cleaning might take up to a week for **ALL** languages available in Wikipedia. \n\n**For the lazy**: run the script `plant_wiki.py` and it would produce the desired cleaned plaintext Wikipedia data as presented in the SeedLing publication:\n\n```\n$ python plant_wiki.py \u0026\n```\n\n\nFor more detailed, step-by-step instructions:\n\n - First, you have to download the Wikipedia dumps. We have used the `wp-download` (https://github.com/babilen/wp-download) tool when building the SeedLing corpus. \n - Then, you have to extract the text from the Wikipedia dumps. We used the `Wikipedia Extractor` (http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) to convert wikipedia dumps into textfiles.\n - Finally, you can use the cleaning function in `wikipedia.py` to clean the Wikipedia data and assigns the ISO 639-3 language code to textfiles. The cleaning function can be called as such:\n\n```\nimport codecs\nfrom seedling.wikipedia import clean\n\nextracted_wiki_dir = \"/home/yourusername/path/to/extracted/wiki/\"\ncleaned_wiki_dir = \"/home/yourusername/path/to/cleaned/wiki/\"\n\nfor i in os.listdir(extracted_wiki_dir):\n  dirpath, filename = os.path.split(i)\n  with codecs.open(i, 'r', 'utf8') as fin, codecs.open(clean_wiki_dir+\"/\"+filename, 'w', 'utf8') as fout:\n    fout.write(clean(fin.read()))\n```\n\nPlease feel free to contact the colloborators in the SeedLing project if you encounter problems with getting the Wikipedia data.\n\nAccess it from our cloud storage\n----\n\nTo be updated.\n\n***\nCite\n=====\n\nTo cite the SeedLing corpus:\n\nGuy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of\n*The use of Computational methods in the study of Endangered Languages (ComputEL) Workshop*. Baltimore, USA.\n\nin `bibtex`:\n\n```\n@InProceedings{seedling2014,\n  author    = {Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri},\n  title     = {SeedLing: Building and using a seed corpus for the Human Language Project},\n  booktitle = {Proceedings of The use of Computational methods in the study of Endangered Languages (ComputEL) Workshop},\n  month     = {June},\n  year      = {2014},\n  address   = {Baltimore, USA},\n  publisher = {Association for Computational Linguistics},\n  pages     = {},\n  url       = {}\n}\n```\n\n***\nReferences\n====\n\n - Steven Abney and Steven Bird. 2010. The Human Language Project: Building a universal corpus of the world’s languages. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 88–97. Association for Computational Linguistics.\n\n - Sime Ager. Omniglot - writing systems and languages of the world. Retrieved from www.omniglot.com.\n\n - William D Lewis and Fei Xia. 2010. Developing ODIN: A multilingual repository of annotated language data for hundreds of the world’s languages. Literary and Linguistic Computing, 25(3):303–319.\n\n - UN General Assembly, Universal Declaration of Human Rights, 10 December 1948, 217 A (III), available at: http://www.refworld.org/docid/3ae6b3712c.html [accessed 26 April 2014]\n\n","funding_links":[],"categories":["Software"],"sub_categories":["Utilities"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falvations%2FSeedLing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falvations%2FSeedLing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falvations%2FSeedLing/lists"}