{"id":17874011,"url":"https://github.com/bminixhofer/gerpt2","last_synced_at":"2025-03-21T22:31:42.022Z","repository":{"id":112619076,"uuid":"311728858","full_name":"bminixhofer/gerpt2","owner":"bminixhofer","description":"German small and large versions of GPT2.","archived":false,"fork":false,"pushed_at":"2022-05-11T09:15:49.000Z","size":62,"stargazers_count":20,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-18T05:43:54.599Z","etag":null,"topics":["common-crawl","german","gpt2","language-model","machine-learning","nlp"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bminixhofer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-10T17:03:27.000Z","updated_at":"2024-10-22T19:42:40.000Z","dependencies_parsed_at":"2023-06-09T22:00:26.389Z","dependency_job_id":null,"html_url":"https://github.com/bminixhofer/gerpt2","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bminixhofer%2Fgerpt2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bminixhofer%2Fgerpt2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bminixhofer%2Fgerpt2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bminixhofer%2Fgerpt2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bminixhofer","download_url":"https://codeload.github.com/bminixhofer/gerpt2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244880283,"owners_count":20525505,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["common-crawl","german","gpt2","language-model","machine-learning","nlp"],"created_at":"2024-10-28T11:07:18.770Z","updated_at":"2025-03-21T22:31:42.014Z","avatar_url":"https://github.com/bminixhofer.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GerPT2\n\nGerman large and small versions of GPT2:\n\n- https://huggingface.co/benjamin/gerpt2\n- https://huggingface.co/benjamin/gerpt2-large\n\nSee the [GPT2 model card](https://huggingface.co/gpt2) for considerations on limitations and bias. See the [GPT2 documentation](https://huggingface.co/transformers/model_doc/gpt2.html) for details on GPT2.\n\n## Comparison to [dbmdz/german-gpt2](https://huggingface.co/dbmdz/german-gpt2)\n\nI evaluated both GerPT2-large and the other German GPT2, [dbmdz/german-gpt2](https://huggingface.co/dbmdz/german-gpt2) on the [CC-100](http://data.statmt.org/cc-100/) dataset and on the German Wikipedia:\n\n|                   | CC-100 (PPL) | Wikipedia (PPL) |\n|-------------------|--------------|-----------------|\n| dbmdz/german-gpt2 | 49.47        | 62.92           |\n| GerPT2            | 24.78        | 35.33           |\n| GerPT2-large      | __16.08__    | __23.26__       |\n|                   |              |                 |\n\nSee the script `evaluate.py` in the [GerPT2 Github repository](https://github.com/bminixhofer/gerpt2) for the code.\n\n## Usage\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, pipeline\n\ntokenizer = AutoTokenizer.from_pretrained(\"benjamin/gerpt2-large\")\nmodel = AutoModelForCausalLM.from_pretrained(\"benjamin/gerpt2-large\")\n\nprompt = \"\u003cyour prompt\u003e\"\n\npipe = pipeline(\"text-generation\", model=model, tokenizer=tokenizer)\nprint(pipe(prompt)[0][\"generated_text\"])\n```\n\nAlso, two tricks might improve the generated text:\n\n```python\noutput = model.generate(\n    # during training an EOS token was used to mark the beginning of each text\n    # so it can help to insert it at the start\n    torch.tensor(\n        [tokenizer.eos_token_id] + tokenizer.encode(prompt)\n    ).unsqueeze(0),\n    do_sample=True,\n    # try setting bad_words_ids=[[0]] to disallow generating an EOS token, without this the model is\n    # prone to ending generation early because a significant number of texts from the training corpus\n    # is quite short\n    bad_words_ids=[[0]],\n    max_length=max_length,\n)[0]\nprint(tokenizer.decode(output))\n```\n\n## Training details\n\nGerPT2-large is trained on the entire German data from the [CC-100 Corpus](http://data.statmt.org/cc-100/) and weights were initialized from the [English GPT2 model](https://huggingface.co/gpt2-large). \nGerPT2-large was trained with:\n\n- a batch size of 256\n- using OneCycle learning rate with a maximum of 5e-3\n- with AdamW with a weight decay of 0.01\n- for 2 epochs\n\nTraining took roughly 12 days on 8 TPUv3 cores.\n\nTo train GerPT2-large, follow these steps. Scripts are located in the [Github repository](https://github.com/bminixhofer/gerpt2):\n\n0. Download and unzip training data from http://data.statmt.org/cc-100/.\n1. Train a tokenizer using `prepare/train_tokenizer.py`. As training data for the tokenizer I used a random subset of 5% of the CC-100 data.\n2. (optionally) generate a German input embedding matrix with `prepare/generate_aligned_wte.py`. This uses a neat trick to semantically map tokens from the English tokenizer to tokens from the German tokenizer using aligned word embeddings. E. g.:\n\n```\nĠMinde -\u003e Ġleast\nĠjed -\u003e Ġwhatsoever\nflughafen -\u003e Air\nvermittlung -\u003e employment\nteilung -\u003e ignment\nĠInterpretation -\u003e Ġinterpretation\nĠimport -\u003e Ġimported\nhansa -\u003e irl\ngenehmigungen -\u003e exempt\nĠAuflist -\u003e Ġlists\nĠverschwunden -\u003e Ġdisappeared\nĠFlyers -\u003e ĠFlyers\nKanal -\u003e Channel\nĠlehr -\u003e Ġteachers\nĠnahelie -\u003e Ġconvenient\ngener -\u003e Generally\nmitarbeiter -\u003e staff\n```\n\nThis helps a lot on a trial run I did, although I wasn't able to do a full comparison due to budget and time constraints. To use this WTE matrix it can be passed via the `wte_path` to the training script. Credit to [this blogpost](https://medium.com/@pierre_guillou/faster-than-training-from-scratch-fine-tuning-the-english-gpt-2-in-any-language-with-hugging-f2ec05c98787) for the idea of initializing GPT2 from English weights. \n\n3. Tokenize the corpus using `prepare/tokenize_text.py`. This generates files for train and validation tokens in JSON Lines format.\n4. Run the training script `train.py`! `run.sh` shows how this was executed for the full run with config `configs/tpu_large.json`.\n\n## License\n\nGerPT2 is licensed under the MIT License.\n\n## Citing\n\nPlease cite GerPT2 as follows:\n\n```\n@misc{Minixhofer_GerPT2_German_large_2020,\nauthor = {Minixhofer, Benjamin},\ndoi = {10.5281/zenodo.5509984},\nmonth = {12},\ntitle = {{GerPT2: German large and small versions of GPT2}},\nurl = {https://github.com/bminixhofer/gerpt2},\nyear = {2020}\n}\n```\n\n## Acknowledgements\n\nThanks to [Hugging Face](https://huggingface.co) for awesome tools and infrastructure.\nHuge thanks to [Artus Krohn-Grimberghe](https://twitter.com/artuskg) at [LYTiQ](https://www.lytiq.de/) for making this possible by sponsoring the resources used for training.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbminixhofer%2Fgerpt2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbminixhofer%2Fgerpt2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbminixhofer%2Fgerpt2/lists"}