{"id":41168679,"url":"https://github.com/cosmoquester/transformers-bart-pretrain","last_synced_at":"2026-01-22T19:37:55.024Z","repository":{"id":42016454,"uuid":"378100671","full_name":"cosmoquester/transformers-bart-pretrain","owner":"cosmoquester","description":"Script to pre-train hugginface transformers BART with Tensorflow 2","archived":false,"fork":false,"pushed_at":"2023-04-13T05:14:44.000Z","size":1497,"stargazers_count":33,"open_issues_count":1,"forks_count":6,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-05T01:42:06.833Z","etag":null,"topics":["bart","gpu","huggingface-transformers","pretraining","tensorflow","tpu"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cosmoquester.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-06-18T09:31:37.000Z","updated_at":"2024-11-14T09:14:03.000Z","dependencies_parsed_at":"2025-04-13T05:58:43.467Z","dependency_job_id":"c17dd47b-c2dc-49dd-b2c8-dea6c3a8044c","html_url":"https://github.com/cosmoquester/transformers-bart-pretrain","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":"cosmoquester/tf2-keras-template","purl":"pkg:github/cosmoquester/transformers-bart-pretrain","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosmoquester%2Ftransformers-bart-pretrain","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosmoquester%2Ftransformers-bart-pretrain/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosmoquester%2Ftransformers-bart-pretrain/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosmoquester%2Ftransformers-bart-pretrain/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cosmoquester","download_url":"https://codeload.github.com/cosmoquester/transformers-bart-pretrain/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cosmoquester%2Ftransformers-bart-pretrain/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28669392,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-22T19:36:09.361Z","status":"ssl_error","status_checked_at":"2026-01-22T19:36:05.567Z","response_time":144,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bart","gpu","huggingface-transformers","pretraining","tensorflow","tpu"],"created_at":"2026-01-22T19:37:54.357Z","updated_at":"2026-01-22T19:37:54.989Z","avatar_url":"https://github.com/cosmoquester.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# transformers TF BART pre-training\n\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat\u0026labelColor=ef8336)](https://pycqa.github.io/isort/)\n[![cosmoquester](https://circleci.com/gh/cosmoquester/transformers-bart-pretrain.svg?style=svg)](https://app.circleci.com/pipelines/github/cosmoquester/transformers-bart-pretrain)\n[![codecov](https://codecov.io/gh/cosmoquester/transformers-bart-pretrain/branch/master/graph/badge.svg?token=FT7NreB8Ku)](https://codecov.io/gh/cosmoquester/transformers-bart-pretrain)\n\n- Script to pre-train hugginface transformers BART\n- Training [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)\n- `Text infilling` and `Sentence Permutation` functions are available now\n\n# Train\n\nYou can train huggingface transformers model simply like below example.\n(below example works without change as itself using sample data)\n\n```sh\n$ CUDA_VISIBLE_DEVICES=1 python -m scripts.train \\\n    --model-config-path configs/base.json \\\n    --train-dataset-path tests/data/sample1.txt \\\n    --dev-dataset-path tests/data/sample1.txt \\\n    --sp-model-path sp_model/sp_model_unigram_8K.model \\\n    --device GPU \\\n    --auto-encoding \\\n    --batch-size 2 \\\n    --steps-per-epoch 100 \\\n    --mask-token \"[MASK]\" \\\n    --mixed-precision\n```\n\n## Arguments\n\n```sh\nFile Paths:\n  --model-config-path MODEL_CONFIG_PATH\n                        model config file\n  --train-dataset-path TRAIN_DATASET_PATH\n                        training dataset, a text file or multiple files ex)\n                        *.txt\n  --dev-dataset-path DEV_DATASET_PATH\n                        dev dataset, a text file or multiple files ex) *.txt\n  --pretrained-checkpoint PRETRAINED_CHECKPOINT\n                        pretrained checkpoint path\n  --output-path OUTPUT_PATH\n                        output directory to save log and model checkpoints\n  --sp-model-path SP_MODEL_PATH\n                        sentencepiece model path to tokenizer\n\nTraining Parameters:\n  --mask-token MASK_TOKEN\n                        mask token ex) [MASK]\n  --mask-token-id MASK_TOKEN_ID\n                        mask token id of vocab\n  --epochs EPOCHS\n  --steps-per-epoch STEPS_PER_EPOCH\n  --learning-rate LEARNING_RATE\n  --min-learning-rate MIN_LEARNING_RATE\n  --warmup-steps WARMUP_STEPS\n  --warmup-rate WARMUP_RATE\n  --batch-size BATCH_SIZE\n                        total training batch size of all devices\n  --dev-batch-size DEV_BATCH_SIZE\n  --num-total-dataset NUM_TOTAL_DATASET\n  --shuffle-buffer-size SHUFFLE_BUFFER_SIZE\n  --prefetch-buffer-size PREFETCH_BUFFER_SIZE\n  --max-sequence-length MAX_SEQUENCE_LENGTH\n  --weight-decay WEIGHT_DECAY\n                        use weight decay\n  --clipnorm CLIPNORM   clips gradients to a maximum norm.\n  --disable-text-infilling\n                        disable input noising\n  --disable-sentence-permutation\n                        disable input noising\n  --masking-rate MASKING_RATE\n                        text infilling masking rate\n  --permutation-segment-token-id PERMUTATION_SEGMENT_TOKEN_ID\n                        segment token id for sentence permutation\n\nOther settings:\n  --tensorboard-update-freq TENSORBOARD_UPDATE_FREQ\n                        log losses and metrics every after this value step\n  --mixed-precision     Use mixed precision FP16\n  --auto-encoding       train by auto encoding with text lines dataset\n  --use-tfrecord        train using tfrecord dataset\n  --repeat-each-file    repeat each dataset and uniform sample for train\n                        example\n  --debug-nan-loss      Trainin with this flag, print the number of Nan loss\n                        (not supported on TPU)\n  --seed SEED           random seed\n  --skip-epochs SKIP_EPOCHS\n                        skip this number of epochs\n  --device {CPU,GPU,TPU}\n                        device to train model\n  --max-over-sequence-policy {filter,slice}\n                        Policy for sequences of which length is over the max\n```\n- `model-config-path` is huggingface bart model config file path.\n- `pretrained-checkpoint` is trained model checkpoint path.\n- `sp-model-path` is sentencepiece tokenizer model path.\n- with `repeat-each-file` flag, you can repeat each dataset files forever even if one of dataset were run out.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcosmoquester%2Ftransformers-bart-pretrain","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcosmoquester%2Ftransformers-bart-pretrain","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcosmoquester%2Ftransformers-bart-pretrain/lists"}