{"id":20474585,"url":"https://github.com/retarfi/language-pretraining","last_synced_at":"2025-04-13T11:43:04.130Z","repository":{"id":37822013,"uuid":"383658621","full_name":"retarfi/language-pretraining","owner":"retarfi","description":"Pre-training Language Models for Japanese","archived":false,"fork":false,"pushed_at":"2023-07-02T07:30:42.000Z","size":208,"stargazers_count":49,"open_issues_count":1,"forks_count":6,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-12T07:32:35.892Z","etag":null,"topics":["bert","electra","implementation","japanese","language-model","language-models","natural-language-processing","nlp","pytorch","transformer","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/retarfi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-07-07T03:07:22.000Z","updated_at":"2024-11-26T01:05:44.000Z","dependencies_parsed_at":"2024-11-15T14:43:44.007Z","dependency_job_id":null,"html_url":"https://github.com/retarfi/language-pretraining","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/retarfi%2Flanguage-pretraining","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/retarfi%2Flanguage-pretraining/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/retarfi%2Flanguage-pretraining/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/retarfi%2Flanguage-pretraining/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/retarfi","download_url":"https://codeload.github.com/retarfi/language-pretraining/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248709859,"owners_count":21149183,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","electra","implementation","japanese","language-model","language-models","natural-language-processing","nlp","pytorch","transformer","transformers"],"created_at":"2024-11-15T14:33:21.847Z","updated_at":"2025-04-13T11:43:04.078Z","avatar_url":"https://github.com/retarfi.png","language":"Python","readme":"\u003cdiv id=\"top\"\u003e\u003c/div\u003e\n\n\u003ch1 align=\"center\"\u003ePre-training Language Models for Japanese\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/retarfi/language-pretraining#licenses\"\u003e\n    \u003cimg alt=\"GitHub\" src=\"https://img.shields.io/badge/license-MIT-brightgreen\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://github.com/retarfi/language-pretraining/releases\"\u003e\n    \u003cimg alt=\"GitHub release\" src=\"https://img.shields.io/github/v/release/retarfi/language-pretraining.svg\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\nThis is a repository of pretrained Japanese transformer-based models.\nBERT, ELECTRA, RoBERTa, DeBERTa, and DeBERTaV2 is available.\n\nOur pre-trained models are available in Transformers by Hugging Face: [https://huggingface.co/izumi-lab](https://huggingface.co/izumi-lab).\nBERT-small, BERT-base, ELECTRA-small, ELECTRA-small-paper, and ELECTRA-base models trained by Wikipedia or financial dataset is available in this URL.\n\n**issue は日本語でも大丈夫です。**\n\n\u003c!-- TABLE OF CONTENTS --\u003e\n\u003cdetails\u003e\n  \u003csummary\u003eTable of Contents\u003c/summary\u003e\n  \u003col\u003e\n    \u003cli\u003e\n      \u003ca href=\"#usage\"\u003eUsage\u003c/a\u003e\n      \u003cul\u003e\n        \u003cli\u003e\u003ca href=\"#train-tokenizer\"\u003eTrain Tokenizer\u003c/a\u003e\u003c/li\u003e\n        \u003cli\u003e\u003ca href=\"#create-dataset\"\u003eCreate Dataset\u003c/a\u003e\u003c/li\u003e\n        \u003cli\u003e\n          \u003ca href=\"#training\"\u003eTraining\u003c/a\u003e\n          \u003cul\u003e\n            \u003cli\u003e\u003ca href=\"#additional-pre-training\"\u003eAdditional Pre-training\u003c/a\u003e\u003c/li\u003e\n            \u003cli\u003e\u003ca href=\"#for-electra\"\u003eFor ELECTRA\u003c/a\u003e\u003c/li\u003e\n            \u003cli\u003e\u003ca href=\"#training-log\"\u003eTraining Log\u003c/a\u003e\u003c/li\u003e\n          \u003c/ul\u003e\n        \u003c/li\u003e\n      \u003c/ul\u003e\n    \u003c/li\u003e\n    \u003cli\u003e\n      \u003ca href=\"#pre-trained-models\"\u003ePre-trained Models\u003c/a\u003e\n      \u003cul\u003e\n        \u003cli\u003e\u003ca href=\"#model-architecture\"\u003eModel Architecture\u003c/a\u003e\u003c/li\u003e\n      \u003c/ul\u003e\n    \u003c/li\u003e\n    \u003cli\u003e\n      \u003ca href=\"#training-data\"\u003eTraining Data\u003c/a\u003e\n      \u003cul\u003e\n        \u003cli\u003e\u003ca href=\"#wikipedia-model\"\u003eWikipedia Model\u003c/a\u003e\u003c/li\u003e\n        \u003cli\u003e\u003ca href=\"#financial-model\"\u003eFinancial Model\u003c/a\u003e\u003c/li\u003e\n      \u003c/ul\u003e\n    \u003c/li\u003e\n    \u003cli\u003e\u003ca href=\"#roadmap\"\u003eRoadmap\u003c/a\u003e\u003c/li\u003e\n    \u003cli\u003e\n      \u003ca href=\"#citation\"\u003eCitation\u003c/a\u003e\n      \u003cul\u003e\n        \u003cli\u003e\u003ca href=\"#pre-trained-model\"\u003ePre-trained Model\u003c/a\u003e\u003c/li\u003e\n        \u003cli\u003e\u003ca href=\"#this-implementation\"\u003eThis Implementation\u003c/a\u003e\u003c/li\u003e\n      \u003c/ul\u003e\n    \u003c/li\u003e\n    \u003cli\u003e\u003ca href=\"#licenses\"\u003eLicenses\u003c/a\u003e\u003c/li\u003e\n    \u003cli\u003e\u003ca href=\"#related-work\"\u003eRelated Work\u003c/a\u003e\u003c/li\u003e\n    \u003cli\u003e\u003ca href=\"#acknowledgements\"\u003eAcknowledgements\u003c/a\u003e\u003c/li\u003e\n  \u003c/ol\u003e\n\u003c/details\u003e\n\n\n## Usage\n\n### Train Tokenizer\n\nIn our pretrained models, the texts are first tokenized by [MeCab](https://taku910.github.io/mecab/) with [IPAdic](https://pypi.org/project/ipadic/) dictionary and then split into subwords by the WordPiece algorithm.\n\nFrom v2.2.0, [jptranstokenizer](https://github.com/retarfi/jptranstokenizer) is required, which enables to use word tokenizers other than MeCab, such as Juman++, Sudachi, and spaCy LUW.\n\nFor subword tokenization, [SentencePiece](https://github.com/google/sentencepiece) is also available for subword algorithm.\n\n```\n$ python train_tokenizer.py \\\n--word_tokenizer mecab \\\n--input_file corpus.txt \\\n--model_dir tokenizer/ \\\n--intermediate_dir ./data/corpus_split/ \\\n--mecab_dic ipadic \\\n--tokenizer_type wordpiece \\\n--vocab_size 32768 \\\n--min_frequency 2 \\\n--limit_alphabet 2900 \\\n--num_unused_tokens 10\n```\n\nYou can see all the arguments with `python train_tokenizer.py --help`\n\n### Create Dataset\n\nYou can train any type of corpus in Japanese.  \nWhen you train with another dataset, please add your corpus name with the line.  \nThe output directory name is `\u003cdataset_type\u003e_\u003cmax_length\u003e_\u003cinput_corpus\u003e`.  \nIn the following case, the output directory name is `nsp_128_wiki-ja`.  \n``tokenizer_name_or_path`` will end with vocab.txt for wordpiece and with spiece.model for sentencepiece.\n\nWe show 2 examples to create dataset.\n\n- When you use your trained tokenizer:\n\n```\n$ python create_datasets.py \\\n--input_corpus wiki-ja \\\n--max_length 512 \\\n--input_file corpus.txt \\\n--mask_style bert \\\n--tokenizer_name_or_path tokenizer/vocab.txt \\\n--word_tokenizer_type mecab \\\n--subword_tokenizer_type wordpiece \\\n--mecab_dic ipadic\n```\n\n- When you use the tokenizer existing in [HuggingFace Hub](https://huggingface.co/):\n\n```\n$ python create_datasets.py \\\n--input_corpus wiki-ja \\\n--max_length 512 \\\n--input_file corpus.txt \\\n--mask_style roberta-wwm \\\n--tokenizer_name_or_path izumi-lab/bert-small-japanese \\\n--load_from_hub\n```\n\n### Training\n\nDistributed training is available.\nFor run command, please see the [PyTorch document](https://pytorch.org/docs/stable/distributed.html#launch-utility) in detail.\nIn official PyTorch implementation, different batch size between nodes is not available.\nWe improved PyTorch sampling implementation (utils/trainer_pt_utils.py).\n\nFor example, `bert-base-dist` model is defined in parameter.json:\n\n```\n\"bert-base-dist\" : {\n    \"number-of-layers\" : 12,\n    \"hidden-size\" : 768,\n    \"sequence-length\" : 512,\n    \"ffn-inner-hidden-size\" : 3072,\n    \"attention-heads\" : 12,\n    \"warmup-steps\" : 10000,\n    \"learning-rate\" : 1e-4,\n    \"batch-size\" : {\n        \"0\" : 80,\n        \"1\" : 80,\n        \"2\" : 48,\n        \"3\" : 48\n    },\n    \"train-steps\" : 1000000,\n    \"save-steps\" : 50000,\n    \"logging-steps\" : 5000,\n    \"fp16-type\": 0,\n    \"bf16\": false\n}\n```\n\nIn this case, node 0 and node 1 have 80 batch sizes and node 2 and node 3 have 48 respectively.\nIf node 0 has 2 GPUs, each GPU have a 40 batch size.\n**10G or higher network speed** is recommended for training with multi-nodes.\n\n`fp16-type` argument specifies which precision mode to use:\n\n- 0: FP32 training\n- 1: Mixed Precision\n- 2: \"Almost FP16\" Mixed Precision\n- 3: FP16 training\n\nIn detail, please see [NVIDIA Apex document](https://nvidia.github.io/apex/amp.html).\n\n`bf16` argument determine whether bfloat16 is enabled or not.  \nYou cannot use `fp16-type` (1, 2 or 3) and `bf16` (true) simultaneously.\n\nThe whole word masking option is also available.\n\n```\n# Train with 1 node\n$ python run_pretraining.py \\\n--dataset_dir ./datasets/nsp_128_wiki-ja/ \\\n--model_dir ./model/bert/ \\\n--parameter_file parameter.json \\\n--model_type bert-small \\\n--tokenizer_name_or_path tokenizer/vocab.txt \\\n--word_tokenizer_type mecab \\\n--subword_tokenizer_type wordpiece \\\n--mecab_dic ipadic \\\n(--use_deepspeed \\)\n(--do_whole_word_mask \\)\n(--do_continue)\n\n# Train with multi-node and multi-process\n$ NCCL_SOCKET_IFNAME=eno1 CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \\\n--nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr=\"10.0.0.1\" \\\n--master_port=50916 run_pretraining.py \\\n--dataset_dir ./datasets/nsp_128_wiki-ja/ \\\n--model_dir ./model/bert/ \\\n--parameter_file parameter.json \\\n--model_type bert-small \\\n--tokenizer_name_or_path tokenizer/vocab.txt \\\n--word_tokenizer_type mecab \\\n--subword_tokenizer_type wordpiece \\\n--mecab_dic ipadic \\\n(--use_deepspeed \\)\n(--do_whole_word_mask \\)\n(--do_continue)\n```\n\n#### Additional Pre-training\n\nYou can train models additionally with existing pre-trained model.  \nFor example, `bert-small-additional` model is defined in parameter.json:\n\n```\n\"bert-small-additional\" : {\n    \"pretrained_model_name_or_path\" : \"izumi-lab/bert-small-japanese\",\n    \"flozen-layers\" : 6,\n    \"warmup-steps\" : 10000,\n    \"learning-rate\" : 5e-4,\n    \"batch-size\" : {\n        \"-1\" : 128\n    },\n    \"train-steps\" : 1450000,\n    \"save-steps\" : 100000,\n    \"fp16-type\": 0,\n    \"bf16\": false\n}\n```\n\n`pretrained_model_name_or_path` specifies a pretrained model in HuggingFace Hub or the path of a pretrained model.  \n`flozen-layers` specifies the flozen (not trained) layers of transformer.  \nWhen it is -1, train all layers (including embedding layer).  \nWhen it is 3, train upper (near output layer) 9 layers.\n\nWhen you train ELECTRA model additionally, you need to specify `pretrained_generator_model_name_or_path` and `discriminator_model_name_or_path` instead of `pretrained_model_name_or_path`.\n\n```\n$ python run_pretraining.py \\\n--tokenizer_name_or_path izumi-lab/bert-small-japanese \\\n--dataset_dir ./datasets/nsp_128_fin-ja/ \\\n--model_dir ./model/bert/ \\\n--parameter_file parameter.json \\\n--model_type bert-small-additional\n```\n\n### For ELECTRA\n\nELECTRA models generated by run_pretraining.py contain both generator and discriminator.\nFor general use, separation is needed.\n\n```\n$ python extract_electra_model.py \\\n--input_dir ./model/electra/checkpoint-1000000 \\\n--output_dir ./model/electra/extracted-1000000 \\\n--parameter_file parameter.json \\\n--model_type electra-small \\\n--generator \\\n--discriminator\n```\n\nIn this example, the generator model is saved in `./model/electra/extracted-1000000/generator/` and discriminator model is saved in `./model/electra/extracted-1000000/discriminator/` respectively.\n\n### Training Log\n\nTensorboard is available for the training log.\n\n## Pre-trained Models\n\n### Model Architecture\n\nFollowing models are available now:\n\n- BERT\n- ELECTRA\n\nThe architecture of BERT-small, BERT-base, ELECTRA-small-paper, ELECTRA-base models are the same as those in [the original ELECTRA paper](https://arxiv.org/abs/2003.10555) (ELECTRA-small-paper is described as ELECTRA-small in the paper).\nThe architecture of ELECTRA-small is the same as that in [the ELECTRA implementation by Google](https://github.com/google-research/electra).\n\n|    Parameter     | BERT-small | BERT-base | ELECTRA-small | ELECTRA-small-paper | ELECTRA-base |\n| :--------------: | :--------: | :-------: | :-----------: | :-----------------: | :----------: |\n| Number of layers |     12     |    12     |      12       |         12          |      12      |\n|   Hidden Size    |    256     |    768    |      256      |         256         |     768      |\n| Attention Heads  |     4      |    12     |       4       |          4          |      12      |\n|  Embedding Size  |    128     |    512    |      128      |         128         |     128      |\n|  Generator Size  |     -      |     -     |      1/1      |         1/4         |     1/3      |\n|   Train Steps    |   1.45M    |    1M     |      1M       |         1M          |     766k     |\n\nOther models such as BERT-large or ELECTRA-large are also available in this implementation.\nYou can also add your original parameters in parameter.json.\n\n### Training Data\n\nTraining data are aggregated to a text file.\nEach sentence is in one line and a blank line is inserted between documents.\n\n#### Wikipedia Model\n\nThe normal models (not financial models) are trained on the Japanese version of Wikipedia, using [Wikipedia dump](https://dumps.wikimedia.org/jawiki/) file as of June 1, 2021.\nThe corpus file is 2.9GB, consisting of approximately 20M sentences.\n\n#### Financial Model\n\nThe financial models are trained on Wikipedia corpus and financial corpus.\nThe Wikipedia corpus is the same as described above.\nThe financial corpus consists of 2 corpora:\n\n- Summaries of financial results from October 9, 2012, to December 31, 2020\n- Securities reports from February 8, 2018, to December 31, 2020\n\nThe financial corpus file is 5.2GB, consisting of approximately 27M sentences.\n\n\n## Roadmap\n\nSee the [open issues](https://github.com/retarfi/language-pretraining/issues) for a full list of proposed features (and known issues).\n\n## Citation\n\n\n```\n@article{Suzuki-etal-2023-ipm,\n  title = {Constructing and analyzing domain-specific language model for financial text mining}\n  author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi},\n  journal = {Information Processing \\\u0026 Management},\n  volume = {60},\n  number = {2},\n  pages = {103194},\n  year = {2023},\n  doi = {10.1016/j.ipm.2022.103194}\n}\n```\n\n\n## Licenses\n\nThe pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).\n\nThe codes in this repository are distributed under MIT.\n\n## Related Work\n\n- Original BERT model by Google Research Team\n  - https://github.com/google-research/bert\n- Original ELECTRA model by Google Research Team\n  - https://github.com/google-research/electra\n- Pretrained Japanese BERT models\n  - Autor Tohoku University\n  - https://github.com/cl-tohoku/bert-japanese\n- ELECTRA training with PyTorch implementation\n  - Author: Richard Wang\n  - https://github.com/richarddwang/electra_pytorch\n\n## Acknowledgements\n\nThis work was supported by JSPS KAKENHI Grant Number JP21K12010, JST-Mirai Program Grant Number JPMJMI20B1, and JST PRESTO Grand Number JPMJPR2267, Japan.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fretarfi%2Flanguage-pretraining","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fretarfi%2Flanguage-pretraining","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fretarfi%2Flanguage-pretraining/lists"}