{"id":13767584,"url":"https://github.com/Joppewouts/belabBERT","last_synced_at":"2025-05-10T23:30:47.814Z","repository":{"id":216687710,"uuid":"274746796","full_name":"Joppewouts/belabBERT","owner":"Joppewouts","description":"🤧belabBERT: Repository for a new Dutch language model based on the RoBERTa architecture","archived":false,"fork":false,"pushed_at":"2020-07-10T15:22:01.000Z","size":10,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-17T03:30:28.981Z","etag":null,"topics":["bert","language-model","nlp","roberta"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Joppewouts.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2020-06-24T18:56:18.000Z","updated_at":"2023-08-19T08:46:09.000Z","dependencies_parsed_at":null,"dependency_job_id":"643bc937-687c-4fe2-860a-37b4db1f6f3a","html_url":"https://github.com/Joppewouts/belabBERT","commit_stats":null,"previous_names":["joppewouts/belabbert"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Joppewouts%2FbelabBERT","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Joppewouts%2FbelabBERT/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Joppewouts%2FbelabBERT/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Joppewouts%2FbelabBERT/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Joppewouts","download_url":"https://codeload.github.com/Joppewouts/belabBERT/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253497296,"owners_count":21917683,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","language-model","nlp","roberta"],"created_at":"2024-08-03T16:01:09.979Z","updated_at":"2025-05-10T23:30:47.577Z","avatar_url":"https://github.com/Joppewouts.png","language":null,"funding_links":[],"categories":["Uncategorized"],"sub_categories":["Uncategorized"],"readme":"# belabBERT 🤧\n\n**Note** the current release of this model is not fully trained yet, the fully trained version of the model will be released later this month\n\nA new Dutch RoBERTa based language model, pretrained on the Dutch unshuffled OSCAR corpus using the masked language modeling (MLM) objective.\nThe model is case sensitive and includes punctuation. The huggingface🤗  [transformer](https://github.com/huggingface/transformers) library was used for the pretraining process\n\n## Model description\n\n### How to use\n\nYou can use this model directly with a pipeline for masked language modeling:\n\n```python\n\u003e\u003e\u003e from transformers import pipeline\n\u003e\u003e\u003e unmasker = pipeline('fill-mask', model='jwouts/belabBERT_115k', tokenizer='jwouts/belabBERT_115k')\n\u003e\u003e\u003e unmasker(\"Hoi ik ben een \u003cmask\u003e model.\")\n\n[{'sequence': '\u003cs\u003eHoi ik ben een dames model.\u003c/s\u003e',\n  'score': 0.05529128015041351,\n  'token': 3079,\n  'token_str': 'Ġdames'},\n {'sequence': '\u003cs\u003eHoi ik ben een kleding model.\u003c/s\u003e',\n  'score': 0.042242035269737244,\n  'token': 3333,\n  'token_str': 'Ġkleding'},\n {'sequence': '\u003cs\u003eHoi ik ben een mode model.\u003c/s\u003e',\n  'score': 0.04132745787501335,\n  'token': 6541,\n  'token_str': 'Ġmode'},\n {'sequence': '\u003cs\u003eHoi ik ben een horloge model.\u003c/s\u003e',\n  'score': 0.029257522895932198,\n  'token': 7196,\n  'token_str': 'Ġhorloge'},\n {'sequence': '\u003cs\u003eHoi ik ben een sportief model.\u003c/s\u003e',\n  'score': 0.028365155681967735,\n  'token': 15357,\n  'token_str': 'Ġsportief'}]\n```\n\nHere is how to use this model to get the features of a given text in PyTorch:\n\n```python\nfrom transformers import RobertaTokenizer, RobertaModel\ntokenizer = RobertaTokenizer.from_pretrained('jwouts/belabBERT_115k')\nmodel = RobertaModel.from_pretrained('jwouts/belabBERT_115k')\ntext = \"Vervang deze tekst.\"\nencoded_input = tokenizer(text, return_tensors='pt')\noutput = model(**encoded_input)\n```\n\nand in TensorFlow:\n\n```python\nfrom transformers import RobertaTokenizer, TFRobertaModel\ntokenizer = RobertaTokenizer.from_pretrained('jwouts/belabBERT_115k')\nmodel = TFRobertaModel.from_pretrained('jwouts/belabBERT_115k')\ntext = \"Vervang deze tekst.\"\nencoded_input = tokenizer(text, return_tensors='tf')\noutput = model(encoded_input)\n```\n\n## Release Notes\n- Publication of repo: 24 / 06 / 2020\n- Publication of model at 150M batches: 10 / 07 / 2020  \n- Publication of fully trained model: TBD\n\n## Training data\nbelabBERT was pretrained on the Dutch version of the **unshuffled** [OSCAR](https://oscar-corpus.com/) corpus, the current state-of-the-art Dutch BERT model [RobBERT](https://github.com/iPieter/RobBERT) was trained on the **shuffled** version of this corpus.\nAfter deduplication the size of this corpus was 32GB\n\n## Training procedure\n\n### Preprocessing\n\nThe texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50.000. The inputs of\nthe model take pieces of 512 contiguous token that may span over documents. The tokenizer was trained on Dutch texts, The beginning of a new document is marked\nwith `\u003cs\u003e` and the end of one by `\u003c/s\u003e`\n\nThe details of the masking procedure for each sentence are the following:\n- 20% of the tokens are masked.\n- In 80% of the cases, the masked tokens are replaced by `\u003cmask\u003e`.\n- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.\n- In the 10% remaining cases, the masked tokens are left as is.\n\nContrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).\n\n### Pretraining\n\nThe model was trained on 4 Titan RTX GPUs for 115K steps with a batch size of 1.3K and a sequence length of 512. The\noptimizer used is Adam with a learning rate of 5e-5, ![image](https://render.githubusercontent.com/render/math?math=%5Cbeta_%7B1%7D%20%3D%200.9), ![image](https://render.githubusercontent.com/render/math?math=%5Cbeta_%7B2%7D%20%3D%200.98) and\n![image](https://render.githubusercontent.com/render/math?math=%5Cepsilon%20%3D%201e%5E%7B-6%7D), a weight decay of 0.01, learning rate warmup for 20000 steps and linear decay of the learning\nrate after.\n\n## Evaluation results\n\nDue to credit limitations on the HPC I was not able to finetune belabBERT on the common evaluation tasks.\n\nHowever, belabBERT is likely to outperform the current state-of-the-art RobBERT since belabBERT uses a Dutch tokenizer where RobBERT is trained with an English tokenizer.\nOn top of that, RobBERT is trained on a shuffled corpus (at line level) while belabBERT is trained on the unshuffled version of the same corpus, this makes belabBERT more capable to deal with long sequences of text.\n\n\n## Acknowledgements\n\nThis work was carried out on the Dutch national e-infrastructure with the support of [SURF Cooperative](http://surfsara.nl/).\n\nThanks to the builders of the [OSCAR](https://oscar-corpus.com/) corpus for giving me permission to use the unshuffled Dutch version\n\nA major shout out to the brilliant [@elslooo](https://github.com/elslooo) for the name of this model 🤗\n\nThanks to the [model card](https://github.com/huggingface/transformers/blob/master/model_cards/roberta-base-README.md) of RoBERTa for the README format/text.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJoppewouts%2FbelabBERT","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FJoppewouts%2FbelabBERT","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJoppewouts%2FbelabBERT/lists"}