{"id":15707836,"url":"https://github.com/stefan-it/plur","last_synced_at":"2026-02-12T22:33:07.730Z","repository":{"id":110974101,"uuid":"197845113","full_name":"stefan-it/plur","owner":"stefan-it","description":"Pre-trained Language Models for Under-represented Languages in NLP","archived":false,"fork":false,"pushed_at":"2019-10-05T00:02:17.000Z","size":6,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-09-17T16:57:58.041Z","etag":null,"topics":["elmo","flair","nlp","under-represented"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stefan-it.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-07-19T21:34:58.000Z","updated_at":"2022-06-20T08:37:00.000Z","dependencies_parsed_at":null,"dependency_job_id":"0e444549-086a-431a-94a4-11ca0d48d0d9","html_url":"https://github.com/stefan-it/plur","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/stefan-it/plur","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stefan-it%2Fplur","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stefan-it%2Fplur/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stefan-it%2Fplur/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stefan-it%2Fplur/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stefan-it","download_url":"https://codeload.github.com/stefan-it/plur/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stefan-it%2Fplur/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29383945,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-12T22:07:52.078Z","status":"ssl_error","status_checked_at":"2026-02-12T22:07:49.026Z","response_time":55,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["elmo","flair","nlp","under-represented"],"created_at":"2024-10-03T20:41:30.788Z","updated_at":"2026-02-12T22:33:07.692Z","avatar_url":"https://github.com/stefan-it.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# plur: **P**re-trained **L**anguage Models for **U**nder-**r**epresented Languages\n\nThis repository contains pre-trained language models for under-represented languages in NLP.\n\nLanguage models are available for Flair and ELMo (soon: XLNet). All trained language models\nare evaluated on NER and PoS tagging downstream tasks with Flair.\n\n# Basque\n\n## Corpus\n\nFlair Embeddings and ELMo are trained on a recent Wikipedia dump and various texts are\ncollected from OPUS and the Leipzig Corpora Collection.\n\nSome statistics:\n\n* Number of tokens: 57,110,741 (untokenized), 72,683,662 (tokenized)\n* Size: 417M (untokenized), 440M (tokenized)\n\nRemember: Flair Embeddings are trained on raw and untokenized texts, so no tokenization is needed.\nThe underlying language model is a character-based one, in contrast to ELMo: ELMo needs tokenized\ninput. For tokenization we use a very simple tokenization method that is adopted from the\nTensor2Tensor repository.\n\n## ELMo\n\nWe use the official implementation from the [`bilm-tf` repository](https://github.com/allenai/bilm-tf).\nDue to limited hardware resources, we limit the vocabulary to 700,000 tokens. We train for 10 epochs\non a GTX 1080.\n\n### Release:\n\n* [ELMo options file](https://schweter.eu/cloud/eu-elmo/options.json)\n* [ELMo weights](https://schweter.eu/cloud/eu-elmo/weights.hdf5)\n\n### Flair import\n\nThe trained ELMo model can easily be used in Flair:\n\n```python\nfrom flair.embeddings import ELMoEmbeddings\n\nembeddings = ELMoEmbeddings(options_file=\"https://schweter.eu/cloud/eu-elmo/options.json\", \n                            weight_file=\"https://schweter.eu/cloud/eu-elmo/weights.hdf5\")\n```\n\n## Flair Embeddings\n\nWe follow the official recommendations for training Flair Embeddings from the\n[Flair documentation](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_9_TRAINING_LM_EMBEDDINGS.md).\n\nThe following parameters are used:\n\n| Parameter         | Value\n| ----------------- | ------\n| `hidden_size`     | 2048\n| `dropout`         | 0.1\n| `nlayers`         | 1\n| `sequence_length` | 250\n| `mini_batch_size` | 100\n| `max_epochs`      | 10\n| `learning_rate`   | 20\n\nWe did not decrease the initial learning rate during training.\n\n### Release:\n\n* [Forward Flair Embeddings](https://schweter.eu/cloud/flair-lms/lm-eu-opus-large-forward-v0.2.pt)\n* [Backward Flair Embeddings](https://schweter.eu/cloud/flair-lms/lm-eu-opus-large-backward-v0.2.pt)\n\n### Flair import\n\n```python\nfrom flair.embeddings import FlairEmbeddings\n\nembeddings_forward  = FlairEmbeddings(\"lm-eu-opus-large-forward-v0.2.pt\")\nembeddings_backward = FlairEmbeddings(\"lm-eu-opus-large-backward-v0.2.pt\")\n```\n\n**Notice**: Our trained embeddings are included in Flair \u003e= *0.4.3*. So you can easily load them with:\n\n```python\nfrom flair.embeddings import FlairEmbeddings\n\nembeddings_forward  = FlairEmbeddings(\"eu-forward\")\nembeddings_backward = FlairEmbeddings(\"eu-backward\")\n```\n\n## NER\n\nWe use the Basque Named Entities Corpus (EIEC) that can be obtained from [here](http://ixa.eus/node/4486?language=en).\nThis corpus has a total of 2552 training and 842 test sentences. For evaluation, the official\nCoNLL-2003 evaluation script is used. We report averaged F-Score over three runs.\n\n| Language model   | Run 1 | Run 2 | Run 3 | Final F-Score\n| ---------------- | ----- | ----- | ----- | -------------\n| ELMo             | 81.50 | 83.13 | 81.41 | **82.01**\n| Flair Embeddings | 81.62 | 81.56 | 81.51 | 81.56\n\n## UD\n\nWe use the Basque Universal Dependencies in version 1.2 for comparison.\nThe corpus has a total of 5,396 training, 1,798 development and 1,799 test sentences.\nWe report averaged accuracy over three runs.\n\n| Language model   | Run 1 | Run 2 | Run 3 | Final Accuracy\n| ---------------- | ----- | ----- | ----- | --------------\n| ELMo             | 97.35 | 97.33 | 97.38 | 97.35\n| Flair Embeddings | 97.60 | 97.67 | 97.67 | **97.65**\n| mBERT uncased    | 95.06 | 94.62 | 94.70 | 94.79\n| mBERT cased      | 94.26 | 94.43 | 94.33 | 94.35\n\n## WikiANN\n\nExperiments on the WikiANN dataset for Basque are coming soon.\n\n# Tamil\n\n## Corpus\n\nFlair Embeddings and ELMo are trained on a recent Wikipedia dump and various texts are\ncollected from OPUS and the Leipzig Corpora Collection.\n\nSome statistics:\n\n* Number of tokens: 18,365,106 (untokenized), 21,581,878 (tokenized)\n* Size: 423M (untokenized), 426M (tokenized)\n\n## ELMo\n\nWe use the official implementation from the [`bilm-tf` repository](https://github.com/allenai/bilm-tf).\nDue to limited hardware resources, we limit the vocabulary to 700,000 tokens. We train for 10 epochs\non a GTX 1080.\n\n### Release:\n\n* [ELMo options file](https://schweter.eu/cloud/ta-elmo/options.json)\n* [ELMo weights](https://schweter.eu/cloud/ta-elmo/weights.hdf5)\n\n### Flair import\n\nThe trained ELMo model can easily be used in Flair:\n\n```python\nfrom flair.embeddings import ELMoEmbeddings\n\nembeddings = ELMoEmbeddings(options_file=\"https://schweter.eu/cloud/ta-elmo/options.json\",\n                            weight_file=\"https://schweter.eu/cloud/ta-elmo/weights.hdf5\")\n```\n\n## Flair Embeddings\n\nWe follow the official recommendations for training Flair Embeddings from the\n[Flair documentation](https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_9_TRAINING_LM_EMBEDDINGS.md).\n\nThe following parameters are used:\n\n| Parameter         | Value\n| ----------------- | ------\n| `hidden_size`     | 2048\n| `dropout`         | 0.1\n| `nlayers`         | 1\n| `sequence_length` | 250\n| `mini_batch_size` | 100\n| `max_epochs`      | 10\n| `learning_rate`   | 20\n\nWe did not decrease the initial learning rate during training.\n\n### Release:\n\n* [Forward Flair Embeddings](https://schweter.eu/cloud/flair-lms/lm-ta-opus-large-forward-v0.1.pt)\n* [Backward Flair Embeddings](https://schweter.eu/cloud/flair-lms/lm-ta-opus-large-forward-v0.1.pt)\n\n### Flair import\n\n```python\nfrom flair.embeddings import FlairEmbeddings\n\nembeddings_forward  = FlairEmbeddings(\"lm-ta-opus-large-forward-v0.1.pt\")\nembeddings_backward = FlairEmbeddings(\"lm-ta-opus-large-forward-v0.1.pt\")\n```\n\n**Notice**: Our trained embeddings are included in Flair \u003e= *0.4.3*. So you can easily load them with:\n\n```python\nfrom flair.embeddings import FlairEmbeddings\n\nembeddings_forward  = FlairEmbeddings(\"ta-forward\")\nembeddings_backward = FlairEmbeddings(\"ta-backward\")\n```\n\n## UD\n\nWe use the Tamil Universal Dependencies in version 1.2 for comparison.\nThe corpus has a total of 400 training, 80 development and 120 test sentences.\nWe report averaged accuracy over three runs. We use Subword Embeddings with different\nvocabulary sizes and a fixed dimension of 300 for both Flair and ELMo models.\n\n### Flair\n\n| BPE vocab | Run 1 | Run 2 | Run 3 | Final Accuracy\n| --------- | ----- | ----- | ----- | --------------\n| 200,000   | 92.31 | 91.55 | 92.46 | 92.11\n| 100,000   | 92.06 | 92.51 | 92.51 | 92.36\n| 50,000    | 92.51 | 92.61 | 93.11 | **92.74**\n| 25,000    | 92.61 | 92.06 | 92.81 | 92.49\n| 10,000    | 91.86 | 92.31 | 91.30 | 91.82\n|  5,000    | 92.06 | 92.56 | 92.51 | 92.37\n|  3,000    | 92.31 | 92.86 | 92.76 | 92.64\n|  1,000    | 92.41 | 92.36 | 93.31 | 92.69\n\n### ELMo\n\n| BPE vocab | Run 1 | Run 2 | Run 3 | Final Accuracy\n| --------- | ----- | ----- | ----- | --------------\n| 200,000   | 91.91 | 91.45 | 92.76 | 92.04\n| 100,000   | 91.96 | 92.01 | 92.16 | 92.04\n| 50,000    | 91.96 | 92.46 | 91.75 | **92.06**\n| 25,000    | 92.26 | 90.90 | 92.11 | 91.76\n| 10,000    | 91.91 | 91.50 | 91.65 | 91.69\n|  5,000    | 92.36 | 91.55 | 91.91 | 91.94\n|  3,000    | 92.06 | 91.96 | 92.06 | 92.03\n|  1,000    | 92.06 | 91.80 | 91.70 | 91.85\n\n# ToDo\n\n* [ ] WikiANN experiments\n* [ ] Run NER and PoS tagging experiments on (already) trained XLNet models\n* [ ] Add training scripts\n* [ ] Play around with `allennlp` to add configuration for training NER and PoS tagging models\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstefan-it%2Fplur","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstefan-it%2Fplur","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstefan-it%2Fplur/lists"}