Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/imcaspar/gpt2-ml

GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型
https://github.com/imcaspar/gpt2-ml

bert chinese colab gpt-2 nlp pretrained-models tensorflow text-generation tpu

Last synced: 6 days ago
JSON representation

GPT2 for Multiple Languages, including pretrained models. GPT2 多语言支持, 15亿参数中文预训练模型

Awesome Lists containing this project

README

        

# **GPT2** for Multiple Languages

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/imcaspar/gpt2-ml/blob/master/pretrained_model_demo.ipynb)
[![GitHub](https://img.shields.io/github/license/imcaspar/gpt2-ml)](https://github.com/imcaspar/gpt2-ml)
[![GitHub All Releases](https://img.shields.io/github/downloads/imcaspar/gpt2-ml/total)](https://github.com/imcaspar/gpt2-ml/releases)
[![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/imcaspar/gpt2-ml/issues)
[![GitHub stars](https://img.shields.io/github/stars/imcaspar/gpt2-ml?style=social)](https://github.com/imcaspar/gpt2-ml)

[**中文说明**](./README_CN.md) | [**English**](./README.md)

- [x] Simplifed GPT2 train scripts(based on Grover, supporting TPUs)
- [x] Ported bert tokenizer, multilingual corpus compatible
- [x] 1.5B GPT2 pretrained Chinese model ( ~15G corpus, 10w steps )
- [x] Batteries-included Colab demo [#](https://github.com/imcaspar/gpt2-ml#google-colab)
- [x] 1.5B GPT2 pretrained Chinese model ( ~30G corpus, 22w steps )

## Pretrained Model
| Size | Language | Corpus | Vocab | Link1 | Link2 | SHA256 |
| ---- | -------- | ------ | ----- | ----- | ----- | ------ |
| 1.5B Params | Chinese | ~30G | CLUE ( 8021 tokens ) | [**Google Drive**](https://drive.google.com/file/d/1mT_qCQg4AWnAXTwKfsyyRWCRpgPrBJS3) | [**Baidu Pan (ffz6)**](https://pan.baidu.com/s/1yiuTHXUr2DpyBqmFYLJH6A) | e698cc97a7f5f706f84f58bb469d614e
51d3c0ce5f9ab9bf77e01e3fcb41d482 |
| 1.5B Params | Chinese | ~15G | Bert ( 21128 tokens ) | [**Google Drive**](https://drive.google.com/file/d/1IzWpQ6I2IgfV7CldZvFJnZ9byNDZdO4n) | [**Baidu Pan (q9vr)**](https://pan.baidu.com/s/1TA_3e-u2bXg_hcx_NwVbGw) | 4a6e5124df8db7ac2bdd902e6191b807
a6983a7f5d09fb10ce011f9a073b183e |

Corpus from [THUCNews](http://thuctc.thunlp.org/#%E4%B8%AD%E6%96%87%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB%E6%95%B0%E6%8D%AE%E9%9B%86THUCNews) and [nlp_chinese_corpus](https://github.com/brightmart/nlp_chinese_corpus)

Using [Cloud TPU Pod v3-256](https://cloud.google.com/tpu/docs/types-zones#types) to train 22w steps

![loss](./.github/loss.png)

## Google Colab
With just 2 clicks (not including Colab auth process), the 1.5B pretrained Chinese model demo is ready to go:

[**[Colab Notebook]**](https://colab.research.google.com/github/imcaspar/gpt2-ml/blob/master/pretrained_model_demo.ipynb)

## Train

## Disclaimer
The contents in this repository are for academic research purpose, and we do not provide any conclusive remarks.

## Citation

```
@misc{GPT2-ML,
author = {Zhibo Zhang},
title = {GPT2-ML: GPT-2 for Multiple Languages},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/imcaspar/gpt2-ml}},
}
```

## Reference
https://github.com/google-research/bert

https://github.com/rowanz/grover

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)

## Press
[[机器之心] 只需单击三次,让中文GPT-2为你生成定制故事](https://mp.weixin.qq.com/s/FpoSNNKZSQOE2diPvJDHog)

[[科学空间] 现在可以用Keras玩中文GPT2了](https://kexue.fm/archives/7292)