Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/retarfi/language-pretraining

Pre-training Language Models for Japanese
https://github.com/retarfi/language-pretraining

bert electra implementation japanese language-model language-models natural-language-processing nlp pytorch transformer transformers

Last synced: about 15 hours ago
JSON representation

Pre-training Language Models for Japanese

Awesome Lists containing this project

README

        

Pre-training Language Models for Japanese



GitHub


GitHub release

This is a repository of pretrained Japanese transformer-based models.
BERT, ELECTRA, RoBERTa, DeBERTa, and DeBERTaV2 is available.

Our pre-trained models are available in Transformers by Hugging Face: [https://huggingface.co/izumi-lab](https://huggingface.co/izumi-lab).
BERT-small, BERT-base, ELECTRA-small, ELECTRA-small-paper, and ELECTRA-base models trained by Wikipedia or financial dataset is available in this URL.

**issue は日本語でも大丈夫です。**

Table of Contents



  1. Usage



  2. Pre-trained Models



  3. Training Data


  4. Roadmap


  5. Citation


  6. Licenses

  7. Related Work

  8. Acknowledgements

## Usage

### Train Tokenizer

In our pretrained models, the texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) with [IPAdic](https://pypi.org/project/ipadic/) dictionary and then split into subwords by the WordPiece algorithm.

From v2.2.0, [jptranstokenizer](https://github.com/retarfi/jptranstokenizer) is required, which enables to use word tokenizers other than MeCab, such as Juman++, Sudachi, and spaCy LUW.

For subword tokenization, [SentencePiece](https://github.com/google/sentencepiece) is also available for subword algorithm.

```
$ python train_tokenizer.py \
--word_tokenizer mecab \
--input_file corpus.txt \
--model_dir tokenizer/ \
--intermediate_dir ./data/corpus_split/ \
--mecab_dic ipadic \
--tokenizer_type wordpiece \
--vocab_size 32768 \
--min_frequency 2 \
--limit_alphabet 2900 \
--num_unused_tokens 10
```

You can see all the arguments with `python train_tokenizer.py --help`

### Create Dataset

You can train any type of corpus in Japanese.
When you train with another dataset, please add your corpus name with the line.
The output directory name is `__`.
In the following case, the output directory name is `nsp_128_wiki-ja`.
``tokenizer_name_or_path`` will end with vocab.txt for wordpiece and with spiece.model for sentencepiece.

We show 2 examples to create dataset.

- When you use your trained tokenizer:

```
$ python create_datasets.py \
--input_corpus wiki-ja \
--max_length 512 \
--input_file corpus.txt \
--mask_style bert \
--tokenizer_name_or_path tokenizer/vocab.txt \
--word_tokenizer_type mecab \
--subword_tokenizer_type wordpiece \
--mecab_dic ipadic
```

- When you use the tokenizer existing in [HuggingFace Hub](https://huggingface.co/):

```
$ python create_datasets.py \
--input_corpus wiki-ja \
--max_length 512 \
--input_file corpus.txt \
--mask_style roberta-wwm \
--tokenizer_name_or_path izumi-lab/bert-small-japanese \
--load_from_hub
```

### Training

Distributed training is available.
For run command, please see the [PyTorch document](https://pytorch.org/docs/stable/distributed.html#launch-utility) in detail.
In official PyTorch implementation, different batch size between nodes is not available.
We improved PyTorch sampling implementation (utils/trainer_pt_utils.py).

For example, `bert-base-dist` model is defined in parameter.json:

```
"bert-base-dist" : {
"number-of-layers" : 12,
"hidden-size" : 768,
"sequence-length" : 512,
"ffn-inner-hidden-size" : 3072,
"attention-heads" : 12,
"warmup-steps" : 10000,
"learning-rate" : 1e-4,
"batch-size" : {
"0" : 80,
"1" : 80,
"2" : 48,
"3" : 48
},
"train-steps" : 1000000,
"save-steps" : 50000,
"logging-steps" : 5000,
"fp16-type": 0,
"bf16": false
}
```

In this case, node 0 and node 1 have 80 batch sizes and node 2 and node 3 have 48 respectively.
If node 0 has 2 GPUs, each GPU have a 40 batch size.
**10G or higher network speed** is recommended for training with multi-nodes.

`fp16-type` argument specifies which precision mode to use:

- 0: FP32 training
- 1: Mixed Precision
- 2: "Almost FP16" Mixed Precision
- 3: FP16 training

In detail, please see [NVIDIA Apex document](https://nvidia.github.io/apex/amp.html).

`bf16` argument determine whether bfloat16 is enabled or not.
You cannot use `fp16-type` (1, 2 or 3) and `bf16` (true) simultaneously.

The whole word masking option is also available.

```
# Train with 1 node
$ python run_pretraining.py \
--dataset_dir ./datasets/nsp_128_wiki-ja/ \
--model_dir ./model/bert/ \
--parameter_file parameter.json \
--model_type bert-small \
--tokenizer_name_or_path tokenizer/vocab.txt \
--word_tokenizer_type mecab \
--subword_tokenizer_type wordpiece \
--mecab_dic ipadic \
(--use_deepspeed \)
(--do_whole_word_mask \)
(--do_continue)

# Train with multi-node and multi-process
$ NCCL_SOCKET_IFNAME=eno1 CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="10.0.0.1" \
--master_port=50916 run_pretraining.py \
--dataset_dir ./datasets/nsp_128_wiki-ja/ \
--model_dir ./model/bert/ \
--parameter_file parameter.json \
--model_type bert-small \
--tokenizer_name_or_path tokenizer/vocab.txt \
--word_tokenizer_type mecab \
--subword_tokenizer_type wordpiece \
--mecab_dic ipadic \
(--use_deepspeed \)
(--do_whole_word_mask \)
(--do_continue)
```

#### Additional Pre-training

You can train models additionally with existing pre-trained model.
For example, `bert-small-additional` model is defined in parameter.json:

```
"bert-small-additional" : {
"pretrained_model_name_or_path" : "izumi-lab/bert-small-japanese",
"flozen-layers" : 6,
"warmup-steps" : 10000,
"learning-rate" : 5e-4,
"batch-size" : {
"-1" : 128
},
"train-steps" : 1450000,
"save-steps" : 100000,
"fp16-type": 0,
"bf16": false
}
```

`pretrained_model_name_or_path` specifies a pretrained model in HuggingFace Hub or the path of a pretrained model.
`flozen-layers` specifies the flozen (not trained) layers of transformer.
When it is -1, train all layers (including embedding layer).
When it is 3, train upper (near output layer) 9 layers.

When you train ELECTRA model additionally, you need to specify `pretrained_generator_model_name_or_path` and `discriminator_model_name_or_path` instead of `pretrained_model_name_or_path`.

```
$ python run_pretraining.py \
--tokenizer_name_or_path izumi-lab/bert-small-japanese \
--dataset_dir ./datasets/nsp_128_fin-ja/ \
--model_dir ./model/bert/ \
--parameter_file parameter.json \
--model_type bert-small-additional
```

### For ELECTRA

ELECTRA models generated by run_pretraining.py contain both generator and discriminator.
For general use, separation is needed.

```
$ python extract_electra_model.py \
--input_dir ./model/electra/checkpoint-1000000 \
--output_dir ./model/electra/extracted-1000000 \
--parameter_file parameter.json \
--model_type electra-small \
--generator \
--discriminator
```

In this example, the generator model is saved in `./model/electra/extracted-1000000/generator/` and discriminator model is saved in `./model/electra/extracted-1000000/discriminator/` respectively.

### Training Log

Tensorboard is available for the training log.

## Pre-trained Models

### Model Architecture

Following models are available now:

- BERT
- ELECTRA

The architecture of BERT-small, BERT-base, ELECTRA-small-paper, ELECTRA-base models are the same as those in [the original ELECTRA paper](https://arxiv.org/abs/2003.10555) (ELECTRA-small-paper is described as ELECTRA-small in the paper).
The architecture of ELECTRA-small is the same as that in [the ELECTRA implementation by Google](https://github.com/google-research/electra).

| Parameter | BERT-small | BERT-base | ELECTRA-small | ELECTRA-small-paper | ELECTRA-base |
| :--------------: | :--------: | :-------: | :-----------: | :-----------------: | :----------: |
| Number of layers | 12 | 12 | 12 | 12 | 12 |
| Hidden Size | 256 | 768 | 256 | 256 | 768 |
| Attention Heads | 4 | 12 | 4 | 4 | 12 |
| Embedding Size | 128 | 512 | 128 | 128 | 128 |
| Generator Size | - | - | 1/1 | 1/4 | 1/3 |
| Train Steps | 1.45M | 1M | 1M | 1M | 766k |

Other models such as BERT-large or ELECTRA-large are also available in this implementation.
You can also add your original parameters in parameter.json.

### Training Data

Training data are aggregated to a text file.
Each sentence is in one line and a blank line is inserted between documents.

#### Wikipedia Model

The normal models (not financial models) are trained on the Japanese version of Wikipedia, using [Wikipedia dump](https://dumps.wikimedia.org/jawiki/) file as of June 1, 2021.
The corpus file is 2.9GB, consisting of approximately 20M sentences.

#### Financial Model

The financial models are trained on Wikipedia corpus and financial corpus.
The Wikipedia corpus is the same as described above.
The financial corpus consists of 2 corpora:

- Summaries of financial results from October 9, 2012, to December 31, 2020
- Securities reports from February 8, 2018, to December 31, 2020

The financial corpus file is 5.2GB, consisting of approximately 27M sentences.

## Roadmap

See the [open issues](https://github.com/retarfi/language-pretraining/issues) for a full list of proposed features (and known issues).

## Citation

```
@article{Suzuki-etal-2023-ipm,
title = {Constructing and analyzing domain-specific language model for financial text mining}
author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi},
journal = {Information Processing \& Management},
volume = {60},
number = {2},
pages = {103194},
year = {2023},
doi = {10.1016/j.ipm.2022.103194}
}
```

## Licenses

The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).

The codes in this repository are distributed under MIT.

## Related Work

- Original BERT model by Google Research Team
- https://github.com/google-research/bert
- Original ELECTRA model by Google Research Team
- https://github.com/google-research/electra
- Pretrained Japanese BERT models
- Autor Tohoku University
- https://github.com/cl-tohoku/bert-japanese
- ELECTRA training with PyTorch implementation
- Author: Richard Wang
- https://github.com/richarddwang/electra_pytorch

## Acknowledgements

This work was supported by JSPS KAKENHI Grant Number JP21K12010, JST-Mirai Program Grant Number JPMJMI20B1, and JST PRESTO Grand Number JPMJPR2267, Japan.