https://github.com/iwasakiyuuki/bert-abstractive-text-summarization
Japanese Sentence Summarization with BERT
https://github.com/iwasakiyuuki/bert-abstractive-text-summarization
bert deep-learning japanese-language pytorch summarization
Last synced: 10 months ago
JSON representation
Japanese Sentence Summarization with BERT
- Host: GitHub
- URL: https://github.com/iwasakiyuuki/bert-abstractive-text-summarization
- Owner: IwasakiYuuki
- Created: 2019-10-03T07:26:17.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2023-05-09T02:46:02.000Z (about 3 years ago)
- Last Synced: 2025-08-02T16:50:18.670Z (11 months ago)
- Topics: bert, deep-learning, japanese-language, pytorch, summarization
- Language: Python
- Homepage:
- Size: 75.2 KB
- Stars: 49
- Watchers: 1
- Forks: 15
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Abstractive text summarization using BERT
This is the models using BERT (refer the paper [Pretraining-Based Natural Language Generation for Text Summarization
](https://arxiv.org/abs/1902.09243) ) for one of the NLP(Natural Language Processing) task, abstractive text summarization.
## Requirements
- Python 3.6.5+
- Pytorch 0.4.1+
- Tensorflow
- Pandas
- tqdm
- Numpy
- MeCab
- Tensorboard X and others...
All packages used here can be installed by pip as follow:
~~~
pip install -r requirement.txt
~~~
## Docker
If you train the model with GPU, it is easy to use [Pytorch docker images](https://hub.docker.com/r/pytorch/pytorch) in DockerHub.
In this study, pytorch/pytorch:0.4.1-cuda9-cudnn7-devel(2.62GB) has been used.
## Before using
When you use this, please follow the steps below.
1. Make a repository named "/data/checkpoint" under root.
And put bert_model, vocabulary file and config file for bert.
These files can be download [here](http://nlp.ist.i.kyoto-u.ac.jp/index.php?BERT%E6%97%A5%E6%9C%AC%E8%AA%9EPretrained%E3%83%A2%E3%83%87%E3%83%AB).
2. Put data file for training and validate under /workspace/data/. The format is as follow:
```preprocess.py
data = {
'settings': opt,
'dict': {
'src': text2token,
'tgt': text2token},
'train': {
'src': content[:100000],
'tgt': summary[:100000]},
'valid': {
'src': content[100000:],
'tgt': summary[100000:]}}
torch.save(data, opt.save_data)
```
overall directory structure is as follow:
```
`-- data # under workspace
|-- checkpoint
| |-- bert_config.json # BERT config file
| |-- pytorch_model.bin # BERT model file
| `-- vocab.txt # vocabulary file
`-- preprocessed_data.data # train and valid data file
```
## Setting
|Name |Value |
|---|---|
|Encoder |BERT |
|Decoder |Transformer (Only Decoder) |
|Embed dimension |768 |
|Hidden dimension |3072 |
|Encoder layers |12 |
|Decoder layers |8 |
|Optimizer |Adam |
|Learning rate |init=0.0001 |
|Wormup step |4000 |
|Input max length |512 |
|Batch size |4 |
## Usage
### Train the model
```
python train.py -data data/preprocessed_data.data -bert_path data/checkpoint/ -proj_share_weight -label_smoothing -batch_size 4 -epoch 10 -save_model trained -save_mode best
```
### Generate summarization with trained model
```
python summarize.py -model data/checkpoint/trained/trained.chkpt -src data/preprocessed_data.data -vocab data/checkpoint/vocab.txt -output pred.txt
```
## Resut
### Tensorboard X image

## TODO
- Eval the model with score such as ROUGE-N
- Make some examples
## Acknowledge
- This repository structure and many codes are borrowed from [jadore801120/attention-is-all-you-need-pytorch](https://github.com/jadore801120/attention-is-all-you-need-pytorch).