{"id":21769975,"url":"https://github.com/mideind/greynircorpus","last_synced_at":"2026-01-03T18:46:28.866Z","repository":{"id":55647357,"uuid":"267017432","full_name":"mideind/GreynirCorpus","owner":"mideind","description":"A large treebank of parsed Icelandic text","archived":false,"fork":false,"pushed_at":"2021-06-30T19:08:52.000Z","size":9440,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-01-26T03:08:15.342Z","etag":null,"topics":["corpus","icelandic","natural-language-processing","nlp","parsing","sentences","treebank"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mideind.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-05-26T10:51:01.000Z","updated_at":"2024-05-31T17:03:15.000Z","dependencies_parsed_at":"2022-08-15T05:30:47.809Z","dependency_job_id":null,"html_url":"https://github.com/mideind/GreynirCorpus","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mideind%2FGreynirCorpus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mideind%2FGreynirCorpus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mideind%2FGreynirCorpus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mideind%2FGreynirCorpus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mideind","download_url":"https://codeload.github.com/mideind/GreynirCorpus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244745751,"owners_count":20503050,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corpus","icelandic","natural-language-processing","nlp","parsing","sentences","treebank"],"created_at":"2024-11-26T14:10:51.368Z","updated_at":"2026-01-03T18:46:28.822Z","avatar_url":"https://github.com/mideind.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Join the chat at https://gitter.im/Greynir/Lobby](https://badges.gitter.im/Greynir/Lobby.svg)](https://gitter.im/Greynir/Lobby?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge)\n[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)\n\n\u003cimg src=\"img/greynir-corpus-icon.png\" alt=\"Greynir\" width=\"200\" height=\"200\" align=\"right\" style=\"margin-left:20px; margin-bottom: 30px;\"\u003e\n\n# GreynirCorpus 1.1\n\n\u003cimg src=\"img/is.png\" width=\"16\" height=\"11\" style=\"margin-bottom:20px;\"\u003e Texti\n[á íslensku](#user-content-stór-trjábanki-með-þáttuðum-íslenskum-texta) fyrir neðan.\n\n### A large treebank of parsed Icelandic text\n\n**GreynirCorpus** is a large, parsed treebank of modern Icelandic text.\n\nThe treebank consists of **10 million parsed sentences** containing approximately 140 million words.\nThe sentences were parsed mechanically using the [Greynir](https://github.com/mideind/GreynirPackage)\nrule-based parser. The text was extracted from news and government sites on the web in the years\n2015-2021 and parsed into full constituency trees in flat text format. The format is similar to that of\nthe [Penn Treebank](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.8216\u0026rep=rep1\u0026type=pdf) and\n[The Icelandic Parsed Historical Corpus (IcePaHC)](https://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC)).\n\nThe treebank is published under the\n[**Creative Commons CC-BY 4.0 license**](https://creativecommons.org/licenses/by/4.0/)\nand is thus open and free for general use, with attribution.\n\nThe treebank has four parts:\n\n1. A **copper standard** corpus of 10 million mechanically parsed and shuffled sentences.\n   This treebank is contained in ten gzip-compressed files in the [`psd/copper`](psd/copper)\n   directory, each containing one million sentences. Each file is about 200 MB in compressed\n   form and about 1.3 GB uncompressed.\n   Foreign sentences, unparsed sentences, and uncapitalized sentences were excluded from the corpus.\n\n2. A **silver standard** corpus of 600 thousand *unique* mechanically parsed sentences selected\n   based on various grammatical attributes. Found in the [`psd/silver`](psd/silver) directory.\n   Sentences were picked based on their ability to provide enough information for fine-tuning a neural\n   parser. In addition to exclusions in the copper standard, sentences beyond 500,000 were only added\n   if they contained new information after normalization or the parse trees contained rare terminals\n   or non-terminals.\n\n3. A **gold standard** corpus of 5,000 parsed sentences that have been manually corrected\n   and verified. The gold standard is split into a test set, containing 500 sentences, \n   and a development set, containing 4,500 sentences, located in [`testset/psd/`](testset/psd)\n   and [`devset/psd`](devset/psd), respectively. Each text file contains 10 manually\n   annotated sentences. The sentences tend to get longer with higher file numbers.\n  \n4. **Extra** corpora, such as headings and short sentences containing fewer than 5 tokens.\n   These are contained in the [`extra/`](extra/) directory.\n   The heading corpus contain 531,855 parsed sentences.\n   The short corpus contain 1,652,938 parsed sentences.\n\nMechanically parsed sentences were parsed using\n[Greynir v3.1.0](https://github.com/mideind/GreynirPackage/releases/tag/3.1.0) and\n[Tokenizer v3.1.1](https://github.com/mideind/Tokenizer/releases/tag/3.1.0).\n\nAn adapted version of [**Annotald**](https://github.com/mideind/Annotald) can be used to\nparse and work with the files.\n\nA [**test suite**](https://github.com/mideind/ParsingTestPipe) using the gold standard test\nset to measure the performance of the Greynir parser has been developed.\n\nThe annotation scheme is described extensively in this 60-page\n[guideline document](https://github.com/mideind/GreynirPackage/blob/master/doc/_static/annotation_instructions.pdf?raw=true)\n(Icelandic-language PDF).\n\n**Please note that [git-lfs](https://git-lfs.github.com/) is required to clone this repository.**\n\nGreynirCorpus is a product of [Miðeind ehf.](https://mideind.is), Reykjavík, Iceland,\nto which it should be attributed.\n\nParts of this project were developed under the auspices of the\nIcelandic Government's 5-year Language Technology Programme for Icelandic,\nmanaged by [Almannarómur](https://almannaromur.is). The LT Programme is\ndescribed [here](https://clarin.is/media/uploads/mlt-en.pdf)\n(Icelandic version [here](https://www.stjornarradid.is/lisalib/getfile.aspx?itemid=56f6368e-54f0-11e7-941a-005056bc530c)).\n\nThis project was partially funded by the Icelandic government's\n*Strategic research and development programme for language technology*\n(*Markáætlun í máltækni*), operated by [Rannís](https://rannis.is).\n\n----------\n\n### Stór trjábanki með þáttuðum íslenskum texta\n\n**GreynirCorpus** er stórt safn af fullþáttuðum texta á nútímaíslensku.\n\nTrjábankinn inniheldur **10 milljónir málsgreina**, u.þ.b. 140 milljónir orða.\nTextinn var sóttur á vefsíður fréttamiðla og opinberra aðila á árunum 2015-2021, og\nfullþáttaður í setningatré sem geymd eru í flötu textaformi. Gagnaformið er svipað og í\n[Penn Treebank](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.9.8216\u0026rep=rep1\u0026type=pdf) og\n[The Icelandic Parsed Historical Corpus (IcePaHC)](https://linguist.is/icelandic_treebank/Icelandic_Parsed_Historical_Corpus_(IcePaHC)).\n\nTrjábankinn er gefinn út undir \n[**Creative Commons CC-BY 4.0 leyfi**](https://creativecommons.org/licenses/by/4.0/)\nog er þannig opinn og frjáls til afnota, sé uppruna getið.\n\nTrjábankinn er í fjórum hlutum:\n\n1. **Koparstaðall**, 10 milljón málsgreinar, stokkaðar í handahófskennda röð og vélþáttaðar.\n   Þessi hluti trjábankans er geymdur í tíu skrám í [`psd/copper`](psd/copper) möppunni. Hver skrá er\n   um 200 megabæti í þjöppuðu formi og u.þ.b. 1,3 gígabæti óþjöppuð.\n   Erlendar setningar, óþáttaðar setningar og setningar sem hefjast á lágstaf voru undanskildar.\n\n2. **Silfurstaðall**, 600 þúsund *einstakar*, stokkaðar og vélþáttaðar málsgreinar valdar út frá\n  margvíslegum málfræðilegum eiginleikum. Trjábanka þennan má finna í [`psd/silver`](psd/silver)\n  möppunni. Setningar voru valdar sem veittu nægar upplýsingar fyrir fínþjálfun taugaþáttara.\n  Auk takmarkana frá koparstaðlinum voru setningar umfram 500 þúsund aðeins teknar með ef þær\n  innihéldu nýjar upplýsingar eftir textastöðlun eða þáttunartrén innihéldu fátíð lauf eða liði.\n\n3. **Gullstaðall** sem samanstendur af 2.610 málsgreinum og þáttunartrjám þeirra, sem hafa\n   verið handyfirfarin og leiðrétt. Gullstaðallinn skiptist í [prófunarmengi](testset/psd),\n   sem inniheldur 500 setningar, og [þróunarmengi](devset/psd), sem inniheldur 4.500 setningar.\n   Hver textaskrá inniheldur 10 handþáttaðar málsgreinar. Málsgreinarnar eru almennt lengri\n   eftir því sem skrárnúmer hækka.\n\n4. **Aukagögn**, sem eru geymd í [`extra/`](extra/) möppunni.\n   Fyrirsagnasafnið inniheldur 531.855 þáttaðar setningar.\n   Safn stuttra setninga inniheldur 1.652.938 þáttaðar setningar.\n\nVélþáttaðar setningar voru þáttaðar með\n[Greyni útgáfu 3.1.0](https://github.com/mideind/GreynirPackage/releases/tag/3.1.0) og\n[Tokenizer útgáfu 3.1.1](https://github.com/mideind/Tokenizer/releases/tag/3.1.0).\n\nÞáttunarskemanu er ýtarlega lýst í þessu 60 síðna \n[leiðbeiningarskjali (PDF)](https://github.com/mideind/GreynirPackage/blob/master/doc/_static/annotation_instructions.pdf?raw=true).\n\nUppfærð útgáfa af [**Annotald**](https://github.com/mideind/Annotald) er notuð til að vinna\nmeð skjölin.\n\n[**Prófunarsvíta**](https://github.com/mideind/ParsingTestPipe) var þróuð sem notar\ngullprófunarmengið til að mæla árangur Greynisþáttarans.\n\n**Git-afritun á kóðasafninu krefst [git-lfs](https://git-lfs.github.com/).**\n\nGreynirCorpus er gefinn út á vegum [Miðeindar ehf.](https://mideind.is), Reykjavík,\nsem geta skal sem útgefanda þegar gögnin eru notuð samkvæmt CC-BY 4.0 leyfinu.\n\nVerkefnið naut styrks úr *Markáætlun í máltækni* á vegum [Rannís](https://rannis.is).\n\nHlutar safnsins voru þróaðir undir hatti 5 ára máltækniáætlunar ríkisins.\n[Almannarómur](https://almannaromur.is) sér um framkvæmd áætlunarinnar. Áætluninni er lýst\n[hér](https://www.stjornarradid.is/lisalib/getfile.aspx?itemid=56f6368e-54f0-11e7-941a-005056bc530c)\n(ensk útgáfa [hér](https://clarin.is/media/uploads/mlt-en.pdf))\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmideind%2Fgreynircorpus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmideind%2Fgreynircorpus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmideind%2Fgreynircorpus/lists"}