{"id":18493795,"url":"https://github.com/neurocode-io/icelandic-language-model","last_synced_at":"2026-05-17T00:05:33.005Z","repository":{"id":56577455,"uuid":"297872691","full_name":"neurocode-io/icelandic-language-model","owner":"neurocode-io","description":"Icelandic language model","archived":false,"fork":false,"pushed_at":"2020-11-10T12:36:41.000Z","size":10637,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-07-01T16:08:52.846Z","etag":null,"topics":["azure","huggingface-transformers","icelandic-language","neural-networks","nlp","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/neurocode-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-09-23T06:20:13.000Z","updated_at":"2023-10-18T19:08:04.000Z","dependencies_parsed_at":"2022-08-15T21:20:25.890Z","dependency_job_id":null,"html_url":"https://github.com/neurocode-io/icelandic-language-model","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/neurocode-io/icelandic-language-model","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neurocode-io%2Ficelandic-language-model","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neurocode-io%2Ficelandic-language-model/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neurocode-io%2Ficelandic-language-model/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neurocode-io%2Ficelandic-language-model/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/neurocode-io","download_url":"https://codeload.github.com/neurocode-io/icelandic-language-model/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neurocode-io%2Ficelandic-language-model/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279008291,"owners_count":26084431,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-11T02:00:06.511Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["azure","huggingface-transformers","icelandic-language","neural-networks","nlp","python"],"created_at":"2024-11-06T13:16:04.306Z","updated_at":"2025-10-11T18:35:38.510Z","avatar_url":"https://github.com/neurocode-io.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# IsRoBERTa - icelandic transformer language model\n\n\n**Natural language processing (NLP)** is one of the fields in AI where software analyzes large amounts of text. This has many applications, among of them we considered: \n- **Masked language modeling**: predict one or more *masked* (unknown) words given the other words in sentence. \n- **Named-Entity Recognition (NER)**:  locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations,ect.\n\nWe trained a model, from scratch, for the icelandic language using the **Huggingface** library. \n\nIcelandic is the official language in Iceland. As one of the Nordic languages, it belongs to the family of the Germanic languages. With a population of only 350 thousand, the language is definitely not wide spread.\n\n\n\n## 1. Dataset\nThe input in NLP is text. We used the Icelandic portion of the OSCAR corpus from INRIA. The Icelandic portion of the dataset is only 1.5G. Thus, as a next step, we will concatenate the portion from OSCAR with the Icelandic sub-corpus of the Leipzig Corpora Collection, which is comprised of text from diverse sources like news, literature, and wikipedia. \n\n## 2. Tokenization\nTokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words or subwords, called tokens, which then are converted to ids. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words. \n\nGreat article on tokinizers can be read [here](https://blog.floydhub.com/tokenization-nlp/)\n\n\nWe used the **Tokenizers** library from Huggingface, in particular, a **byte-level Byte-pair encoding tokenizer**, with the same special tokens as **RoBERTa** (from the tutorial [Tokenizer summary](https://huggingface.co/transformers/master/tokenizer_summary.html), read the paragraphs [Byte-Pair Encoding](https://huggingface.co/transformers/master/tokenizer_summary.html#byte-pair-encoding) and [Byte-level BPE](https://huggingface.co/transformers/master/tokenizer_summary.html#byte-level-bpe) to get the best overview of a Byte-level BPE i.e. Byte-level Byte-Pair-Encoding). Training a **byte-level Byte-pair encoding tokenizer**, moreover with the same special tokens as [**RoBERTa**](https://huggingface.co/transformers/master/model_doc/roberta.html), enables us to build a vocabulary from an alphabet of single bytes, hence all words will be decomposable into tokens (no more \u003cunk\u003e tokens!). \n\n\n## 3. Training \u0026 Infrastructure\n\nTraining a language models is heavy and very time-consuming. As a first attempt we tried to train the model in Google Colab using GPU's but were unable due to low RAM. We then used the cloud, in particular, Azure.  \n\nWe created a [NC_6_promo machine](https://docs.microsoft.com/en-us/azure/virtual-machines/nc-series?toc=/azure/virtual-machines/linux/toc.json\u0026bc=/azure/virtual-machines/linux/breadcrumb/toc.json) which comes with a K80 Nvidia GPU.\n\nStill the training took 3 days!\n\nFor packing the code we used docker. The image lives in [docker hub](https://hub.docker.com/r/donchev7/icelandic-model)\n\n\n## 5. Do it yourself\n\nIf you want to run the code you'll need to have an Azure account in particular an azure storage account. \n\nIf you have access to azure infrastructure you can start with creating an **.env** file:\n\n```env\nACCESS_KEY=\u003cAzure Storage Access Key\u003e\nWANDB_API_KEY=\u003cWand API Key\u003e\nWANDB_PROJECT=\u003cNot mandatory\u003e\nVM_ADMIN_PASSWD=\u003cIf you use our infra scripts\u003e\n```\n\nAfterwards you can:\n\n```\nmake create_machine\n```\n\nyou'll see the IP address of the machine in your terminal. Use the IP to connect to your machine and run the packaged software:\n\n```\nssh azureuser@\u003cmachine_ip\u003e\n\nscreen\n\ncat \u003c\u003c EOF \u003e .env\nACCESS_KEY=\u003cMandatory Azure AccessKey\u003e\nWANDB_API_KEY=\u003cOptional\u003e\nWANDB_PROJECT=\u003cOptional\u003e\nEOF\n\ndocker run --gpus all -it --rm \\\n    --env-file=.env \\\n    --ipc=host \\\n    -v /tmp:/tmp \\\n    donchev7/icelandic-model:vc0c9243 python src/train_xml_roberta_large.py --data_dir=/tmp --run_name=xml_roberta_large_malfong_ner\n\n\nCTRL a + d to detatch from your screen session\n\nexit\n```\nCheck back later:\n\n```\nssh azureuser@\u003cmachine_ip\u003e\n\nscreen -r\n```\n\n\nUse NER:\n```python\nfrom transformers import pipeline\nfrom transformers import AutoTokenizer\ntokenizer = AutoTokenizer.from_pretrained(\"neurocode/IsRoBERTa\")\n\nnlp = pipeline(\"ner\", model=\"./data/isroberta_malfong_ner/results\", tokenizer=tokenizer)\nres = nlp(\"Eftir að henni lýkur er hægt að gerast áskrifandi að efni vefjarins fyrir 1.290 kr. á mánuði.\")\n\ntokens = [r[\"word\"] for r in res]\ntokenizer.convert_tokens_to_string(tokens)\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneurocode-io%2Ficelandic-language-model","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneurocode-io%2Ficelandic-language-model","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneurocode-io%2Ficelandic-language-model/lists"}