{"id":13688151,"url":"https://github.com/imcaspar/gpt2-ml","last_synced_at":"2025-05-15T18:11:27.720Z","repository":{"id":36023789,"uuid":"219644079","full_name":"imcaspar/gpt2-ml","owner":"imcaspar","description":"GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型","archived":false,"fork":false,"pushed_at":"2023-05-22T22:32:13.000Z","size":1000,"stargazers_count":1712,"open_issues_count":22,"forks_count":334,"subscribers_count":37,"default_branch":"master","last_synced_at":"2025-03-31T22:22:13.519Z","etag":null,"topics":["bert","chinese","colab","gpt-2","nlp","pretrained-models","tensorflow","text-generation","tpu"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/imcaspar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-11-05T02:52:26.000Z","updated_at":"2025-03-26T20:00:34.000Z","dependencies_parsed_at":"2023-10-21T11:53:31.606Z","dependency_job_id":null,"html_url":"https://github.com/imcaspar/gpt2-ml","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imcaspar%2Fgpt2-ml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imcaspar%2Fgpt2-ml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imcaspar%2Fgpt2-ml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/imcaspar%2Fgpt2-ml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/imcaspar","download_url":"https://codeload.github.com/imcaspar/gpt2-ml/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247744335,"owners_count":20988783,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","chinese","colab","gpt-2","nlp","pretrained-models","tensorflow","text-generation","tpu"],"created_at":"2024-08-02T15:01:07.842Z","updated_at":"2025-04-07T23:11:34.102Z","avatar_url":"https://github.com/imcaspar.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Python"],"sub_categories":["其他_文本生成_文本对话"],"readme":"\u003cimg src=\"./.github/logo.svg\" width=\"480\"\u003e\n\n# **GPT2** for Multiple Languages\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/imcaspar/gpt2-ml/blob/master/pretrained_model_demo.ipynb)\n[![GitHub](https://img.shields.io/github/license/imcaspar/gpt2-ml)](https://github.com/imcaspar/gpt2-ml)\n[![GitHub All Releases](https://img.shields.io/github/downloads/imcaspar/gpt2-ml/total)](https://github.com/imcaspar/gpt2-ml/releases)\n[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/imcaspar/gpt2-ml/issues)\n[![GitHub stars](https://img.shields.io/github/stars/imcaspar/gpt2-ml?style=social)](https://github.com/imcaspar/gpt2-ml)\n\n[**中文说明**](./README_CN.md) | [**English**](./README.md)\n\n- [x] Simplifed GPT2 train scripts（based on Grover, supporting TPUs）\n- [x] Ported bert tokenizer, multilingual corpus compatible\n- [x] 1.5B GPT2 pretrained Chinese model ( ~15G corpus, 10w steps )\n- [x] Batteries-included Colab demo [#](https://github.com/imcaspar/gpt2-ml#google-colab)\n- [x] 1.5B GPT2 pretrained Chinese model ( ~30G corpus, 22w steps )\n\n\n## Pretrained Model\n| Size | Language | Corpus | Vocab | Link1 | Link2 | SHA256 |\n| ---- | -------- | ------ | ----- | ----- | ----- | ------ |\n| 1.5B Params | Chinese  | ~30G   | CLUE ( 8021 tokens )  | [**Google Drive**](https://drive.google.com/file/d/1mT_qCQg4AWnAXTwKfsyyRWCRpgPrBJS3) | [**Baidu Pan (ffz6)**](https://pan.baidu.com/s/1yiuTHXUr2DpyBqmFYLJH6A) | e698cc97a7f5f706f84f58bb469d614e\u003cbr/\u003e51d3c0ce5f9ab9bf77e01e3fcb41d482 |\n| 1.5B Params | Chinese  | ~15G   | Bert ( 21128 tokens ) | [**Google Drive**](https://drive.google.com/file/d/1IzWpQ6I2IgfV7CldZvFJnZ9byNDZdO4n) | [**Baidu Pan (q9vr)**](https://pan.baidu.com/s/1TA_3e-u2bXg_hcx_NwVbGw) | 4a6e5124df8db7ac2bdd902e6191b807\u003cbr/\u003ea6983a7f5d09fb10ce011f9a073b183e |\n\nCorpus from [THUCNews](http://thuctc.thunlp.org/#%E4%B8%AD%E6%96%87%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E6%95%B0%E6%8D%AE%E9%9B%86THUCNews) and [nlp_chinese_corpus](https://github.com/brightmart/nlp_chinese_corpus)\n\nUsing [Cloud TPU Pod v3-256](https://cloud.google.com/tpu/docs/types-zones#types) to train 22w steps\n\n![loss](./.github/loss.png)\n\n\n## Google Colab\nWith just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go:\n\n[**[Colab Notebook]**](https://colab.research.google.com/github/imcaspar/gpt2-ml/blob/master/pretrained_model_demo.ipynb)\n\n\u003cimg src=\"./.github/demo.png\" width=\"640\"\u003e\n\n## Train\n\n## Disclaimer\nThe contents in this repository are for academic research purpose, and we do not provide any conclusive remarks.\n\n## Citation\n\n```\n@misc{GPT2-ML,\n  author = {Zhibo Zhang},\n  title = {GPT2-ML: GPT-2 for Multiple Languages},\n  year = {2019},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/imcaspar/gpt2-ml}},\n}\n```\n\n## Reference\nhttps://github.com/google-research/bert\n\nhttps://github.com/rowanz/grover\n\nResearch supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)\n\n## Press\n[[机器之心] 只需单击三次，让中文GPT-2为你生成定制故事](https://mp.weixin.qq.com/s/FpoSNNKZSQOE2diPvJDHog)\n\n[[科学空间] 现在可以用Keras玩中文GPT2了](https://kexue.fm/archives/7292)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimcaspar%2Fgpt2-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fimcaspar%2Fgpt2-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fimcaspar%2Fgpt2-ml/lists"}