https://github.com/nicholaswilven/pegasus-tpu-trainer

Implementation to pretrain and finetune Transformer encoder-decoder (PEGASUS) using Tensorflow + TFRecords on TPU
https://github.com/nicholaswilven/pegasus-tpu-trainer

nlp tensorflow tpu transformers

Last synced: about 2 months ago
JSON representation

Implementation to pretrain and finetune Transformer encoder-decoder (PEGASUS) using Tensorflow + TFRecords on TPU

Host: GitHub
URL: https://github.com/nicholaswilven/pegasus-tpu-trainer
Owner: nicholaswilven
License: apache-2.0
Created: 2023-04-10T14:48:40.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2024-12-10T04:01:39.000Z (10 months ago)
Last Synced: 2024-12-10T04:27:22.004Z (10 months ago)
Topics: nlp, tensorflow, tpu, transformers
Language: Jupyter Notebook
Homepage: https://huggingface.co/thonyyy/pegasus_indonesian_base-finetune
Size: 24.1 MB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# PEGASUS TPU Trainer
Model Card: [pegasus-indonesian-base_finetune](https://huggingface.co/thonyyy/pegasus-indonesian-base_finetune)

Report (in Bahasa Indonesia): [Indonesian News Absractive Summarization using PEGASUS](https://github.com/user-attachments/files/18071465/Draft.Final.Buku.Tugas.Akhir.-.Anthony.10119038.1.pdf)

Reference Paper: [“PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization”](https://arxiv.org/abs/1912.08777)

In this project, I implemented Transformer encoder-decoder model (PEGASUS) pretraining and finetuning using Tensorflow + TFRecords on TPU. Weights for final model can be used to make abstractive summarization of Indonesian News.

## Sample summarization
![image](https://github.com/user-attachments/assets/7c603a58-d5a7-4539-a0de-dff0574a66f9)

## Datasets:
### Pretraining:
1. [kaggle id news 2017](https://www.kaggle.com/datasets/aashari/indonesian-news-articles-published-at-2017)
2. [CC_news_id](https://github.com/Wikidepia/indonesian_datasets/tree/master/dump/cc-news)
3. [OSCAR_2201](https://huggingface.co/datasets/oscar-corpus/OSCAR-2201/viewer/id/train)

### Finetuning:
1. [Indosum](https://paperswithcode.com/dataset/indosum)
2. [Liputan6](https://paperswithcode.com/dataset/liputan6)
3. [XLSum](https://huggingface.co/datasets/csebuetnlp/xlsum)

## Performance

| datasets | rouge-1 | rouge-2 | rouge-L | BERTScore |
| ---- | ---- | ---- | ---- | --- |
| Indosum | 52.43 | 41.23 | 48.18 | 80.68 |
| Liputan6 | 38.27 | 20.22 | 31.26 | 76.31 |
| XLSum | 26.97 | 9.99 | 21.70 | 73.62|

## ⚡️ Getting Started
### Clone Repository
To start working on this project, clone `PEGASUS TPU Trainer` repository.
```
git clone https://github.com/nicholaswilven/pegasus-tpu-trainer.git
```
## Structure of this Repository
The structure of this project can be seen in the tree diagram below.
```
.
├── LICENSE
├── README.md
├── app.py
├── model
│ ├── evaluate.py
│ ├── generate_demo.py
│ ├── generate_iter.py
│ ├── trainer.py
│ └── utils
│ ├── cleaning.py
│ ├── convert_to_records.py
│ ├── gap_sentence_generation.py
│ ├── model_config.py
│ ├── parse_records.py
│ ├── process_xlsum.py
│ └── sentencepiece_tokenizer.py
├── notebook
│ ├── demo_pegasus.ipynb
│ └── preprocessing.ipynb
├── requirements.txt
├── script.sh
├── setup.py
├── tpu-test.py
└── train_tokenizer.py
```

### Environment Variables
There are various of Environment Variables contained in this project. Some credentials are not stored in this repository but expects a value.
We expect `.env` with values containing below
```
MODEL_MAX_LENGTH =
MAX_SUMMARY_LENGTH =
MIN_SUMMARY_LENGTH =
GCS_BUCKET_NAME =
PATH_TO_TOKENIZER =
TOKENIZER_TYPE =
SAMPLE_PER_FILE =

GSG_RATE =
RETURN_MASK_RATE =

LOAD_CKPT_PATH =
VOCAB_SIZE =

REPO_NAME =
```

## 📑 Usage Documentation
### Infrastructure setup
1. Create TPU VM on GCP (TF version 2.12.0, preferable v3-8, free access from [TRC](https://sites.research.google/trc/about/) program)
2. Create GCS buckets on GCP

### First time setup
1. pip install -r requirements.txt
2. python setup.py (Download ntlk data)
3. python tpu-test.py (Check TPU)

### Prepare training dataset
0. Preprocess all dataset on notebook (except OSCAR, XLSum)
1. Upload to GCS bucket as parquet file
2. Dump all text into one .txt file
3. Train sentencepiece tokenizer using train_tokenizer.py
4. Convert all training data to TFRecords using convert_to_records.py

### Training model
1. Specify model hyperparams on model_config.py, trainer.py args and .env
2. Run trainer.py

### Deploy mini showcase using FastAPI
1. Load model on demo.py (use checkpoint or huggingface repo)
2. Start server by `uvicorn app:app`

## Special Thanks
Research supported with Cloud TPUs from Google’s TPU Research Cloud (TRC)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nicholaswilven/pegasus-tpu-trainer

Awesome Lists containing this project

README