https://github.com/26hzhang/bert_classification

Token and Sentence Level Classification with Google's BERT (TensorFlow)
https://github.com/26hzhang/bert_classification
bert-bilstm-crf bert-embeddings bert-model tensorflow
Last synced: 9 months ago
JSON representation
Token and Sentence Level Classification with Google's BERT (TensorFlow)
Host: GitHub
URL: https://github.com/26hzhang/bert_classification
Owner: 26hzhang
Created: 2019-04-29T09:51:50.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2019-07-11T01:28:43.000Z (almost 7 years ago)
Last Synced: 2025-03-29T09:21:53.935Z (over 1 year ago)
Topics: bert-bilstm-crf, bert-embeddings, bert-model, tensorflow
Language: Python
Homepage:
Size: 26.5 MB
Stars: 10
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # BERT Classification

Use google BERT (tensorflow-based) to do token-level and sentence-level classification.

## Requirements

- tensorflow>=1.11.0 (or tensorflow-gpu>=1.11.0)

- numpy>=1.14.4

- official tensorflow based bert code, get the code [`https://github.com/google-research/bert.git`](

https://github.com/google-research/bert.git) and place it under this repository.

- pre-trained bert models (according to the tasks), download and place to the `checkpoint/` directory.

```

bert_classification/

    |____ bert/

    |____ bert_ckpt/

    |____ checkpoint/

    |____ datasets/

    |____ .gitignore

    |____ conlleval.pl

    |____ data_cls_helper.py

    |____ data_seq_helper.py

    |____ README.md

    |____ run_sequence_tagger.py

    |____ run_text_classifier.py

```

## Dataset Overview

**Token level classification datasets (POS, Chunk and NER)**:

Dataset | Language | Classes | Training tokens | Dev tokens | Test tokens

:---: | :---: | :---: | :---: | :---: | :---:

CoNLL2000 Chunk (en) | English | 23 | 211,727 | _N.A._ | 47,377

CoNLL2002 NER (es) | Spanish | 9 | 207,484 (18,797) | 51,645 (4,351) | 52098 (3,558)

CoNLL2002 NER (nl) | Dutch | 9 | 20,2931 (13,344) | 37,761 (2,616) | 68,994 (3,941)

CoNLL2003 NER (en) | English | 9 | 20,4567 (23,499) | 51,578 (5,942) | 46,666 (5,648)

CoNLL2003 NER (de 1) | German | 9 |  208,836 (16,839) | 51,444 (6,588) | 51,943 (5,171)

GermEval2014 NER (de 2) | German | 25 | 452,853 (42,089) | 41,653 (3,960) | 96,499 (8,969)

Chinese NER 1 (zh 1) | Chinese | 21 | 1,044,967 (311,637) | 86,454 (24,444) | 119,467 (38,854)

Chinese NER 2 (Zh 2) | Chinese | 7 | 979,180 (110,093) | 109,870 (12,059) | 219,197 (25,012)

> All the lines in those datasets are convert to `(word, label)` pairs with `\t` as separator and drop all the lines

start with `-DOCSTART-` and other undesired lines, while the label is in BIO2 format (Begin, Inside, Others).

**Sentence level classification datasets**:

Dataset | Classes | Average sentence length | Train size | Dev size | Test size

:---: | :---: | :---: | :---: | :---: | :---:

CR | 2 | 19 | 3,395 | _N.A._ | 377

MR | 2 | 20 | 9,595 | _N.A._ | 1,066

SST2 | 2 | 11 | 67,349 | 872 | 1,821

SST5 | 5 | 18 | 8,544 | 1,101 | 2,210

SUBJ | 2 | 23 | 9,000 | _N.A._ | 1,000

TREC | 6 | 10 | 5,452 | _N.A._ | 500

> All the datasets are converted to `utf-8` format via `iconv -f  -t utf-8 filename -o save_name`. For the 

_SUBJ_, _MR_ and _CR_ datasets, `90%` for train, `10%` for test, while the dev dataset is the duplicate of test dataset. 

For _TREC_ dataset, the dev dataset is the duplicate of test dataset.

**Natural language inference (sentence pair classification) datasets**:

Dataset | Classes | Train size | Dev size | Test size

:---: | :---: | :---: | :---: | :---:

MRPC | 2 | 4,077 | 1,726 | 1,726

SICK | 3 | 4,501 | 501 | 4,928

SNLI | 3 | 549,367 | 9,842 | 9,824

CoLA | 2 | 8,551 | 527 | 516

> [_MNLI_](https://www.nyu.edu/projects/bowman/multinli/) and [_XNLI_](

https://www.nyu.edu/projects/bowman/xnli/) datasets are implemented by the official BERT already, see 

`run_classifier.py` in [[google-research/bert]](https://github.com/google-research/bert).

## Usage

For token-level classification, run:

```bash

python3 run_sequence_tagger.py --task_name ner  \  # task name

                               --data_dir datasets/CoNLL2003_en  \  # dataset folder

                               --output_dir checkpoint/conll2003_en  \  # path to save outputs and trained params

                               --bert_config_file bert_ckpt/cased_L-12_H-768_A-12/bert_config.json  \  # pre-trained BERT configs

                               --init_checkpoint bert_ckpt/cased_L-12_H-768_A-12/bert_model.ckpt  \  # pre-trained BERT params

                               --vocab_file bert_ckpt/cased_L-12_H-768_A-12/vocab.txt  \  # BERT vocab file

                               --do_lower_case False  \  # whether lowercase the input tokens

                               --max_seq_length 128  \  # maximal sequence allowed

                               --do_train True  \  # if training

                               --do_eval True  \  # if evaluation

                               --do_predict True  \  # if prediction

                               --batch_size 32  \  # batch_size, change to `16` if OOM happens

                               --num_train_epochs 6  \  # number of epochs

                               --use_crf True  # if use CRF for decoding

```

The token-level classification model contains two modules, one is using CRF for decode while another use a classifier 

directly. The output sequence of bert model is first fed into a dense layer and then decode by CRF/classifier, no 

intermediate RNN layers are used.

For sentence-level classification, run:

```bash

python3 run_text_classifier.py --task_name mrpc  \  # task name

                               --data_dir datasets/MRPC  \  # dataset folder

                               --output_dir checkpoint/mrpc  \  # path to save outputs and trained params

                               --bert_config_file bert_ckpt/uncased_L-12_H-768_A-12/bert_config.json  \  # pre-trained BERT configs

                               --init_checkpoint bert_ckpt/uncased_L-12_H-768_A-12/bert_model.ckpt  \  # pre-trained BERT params

                               --vocab_file bert_ckpt/uncased_L-12_H-768_A-12/vocab.txt  \  # BERT vocab file

                               --do_lower_case True  \  # whether lowercase the input tokens

                               --max_seq_length 128  \  # maximal sequence allowed

                               --do_train True  \  # if training

                               --do_eval True  \  # if evaluation

                               --do_predict True  \  # if prediction

                               --batch_size 32  \  # batch_size, change to `16` if OOM happens

                               --num_train_epochs 6  # number of epochs

```

The sentence-level classification directly take the pooled output of bert model and feed it into a classifier for 

decode.

## Experiment Results

> All the experiments are running on `1` GeForce GTX 1080 Ti GPU.

**Token level classification datasets**

Dataset | en Chunk | es NER | nl NER | en NER | de NER 1 | de NER 2 | zh NER 1 | zh NER 2

:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---:

Precision (%) | 96.8 | 89.0 | 89.8 | 92.0 | 82.0 | 86.2 | 77.9 | 95.7

Recall (%) | 96.4 | 88.6 | 90.0 | 90.8 | 86.4 | 85.4 | 73.1 | 95.7

F1 (%) | 96.6 | 88.8 | 89.9 | 91.4 | 84.2 | 85.8 | 75.5 | 95.7

> CoNLL2002 Spanish/Dutch, CoNLL2003 German NER and GermEval2014 German NER use [`multi_cased_L-12_H-768_A-12.zip`](

https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip) pre-trained model (base, 

multilingual, cased)

> CoNLL2000 Chunk and CoNLL2003 NER utilize [`cased_L-12_H-768_A-12.zip`](

https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip) pre-trained model (base, English, 

cased)

> Chinese NER uses [`chinese_L-12_H-768_A-12.zip`](

https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip) pre-trained model (base, Chinese).

The testing results on CoNLL-2003 English NER are lower than the reported score of the [paper](

https://arxiv.org/pdf/1810.04805.pdf) (`91.4%` v.s. `92.4%`). As the paper says a `0.2%` difference is reasonable, 

however, I got `1.0%` error. I think maybe some tricks are missing, for example, the parameters setting in 

output classifier or data pre-processing strategies.

**Sentence level classification datasets**

Dataset | CR | MR | SST2 | SST5 | SUBJ | TREC

:---: | :---: | :---: | :---: | :---: | :---: | :---:

Dev Accuracy (%) | _N.A._ | _N.A._ | 91.3 | 50.1 | _N.A._ | _N.A._

Test Accuracy (%) | 89.2 | 85.4 | 93.5 | 53.3 | 97.3 | 96.6

> All the tasks use [`uncased_L-12_H-768_A-12.zip`](

https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) pre-trained model (base, English, 

uncased).

**Natural language inference datasets**

Dataset | MRPC | SICK | SNLI | CoLA

:---: | :---: | :---: | :---: | :---:

Dev Accuracy (%) | _N.A._ | 86.4 | 91.1 | 83.1

Test Accuracy (%) | 84.7 | 87.0 | 90.7 | 78.9

> All the tasks use [`uncased_L-12_H-768_A-12.zip`](

https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) pre-trained model (base, English, 

uncased). 

The results may differ from the reported results, since I do not use the _GLUE version_ datasets.

## Reference

- [[google-research/bert]](https://github.com/google-research/bert).

- [[macanv/BERT-BiLSTM-CRF-NER]](https://github.com/macanv/BERT-BiLSTM-CRF-NER).

- [[Kyubyong/bert_ner]](https://github.com/Kyubyong/bert_ner).

- [[kyzhouhzau/BERT-NER]](https://github.com/kyzhouhzau/BERT-NER).

- Natural language inference datasets are obtained from [[nyu-mll/GLUE-baselines]](

https://github.com/nyu-mll/GLUE-baselines), _MRPC_ data can be download [[here]](

https://github.com/jaisong87/prDetect/tree/master/Preprocess).

- Sentence level classification datasets are obtained from [[facebookresearch/SentEval]](

https://github.com/facebookresearch/SentEval).

- Chinese NER 1 is obtained from [[lancopku/Chinese-Literature-NER-RE-Dataset]](

https://github.com/lancopku/Chinese-Literature-NER-RE-Dataset).

- Chinese NER 2 is obtained from[[zjy-ucas/ChineseNER]](https://github.com/zjy-ucas/ChineseNER).

- CoNLL-2003 German NER dataset is obtained from [[MaviccPRP/ger_ner_evals]](https://github.com/MaviccPRP/ger_ner_evals).

- [GermEval 2014 Named Entity Recognition Shared Task](https://sites.google.com/site/germeval2014ner/data).

- CoNLL-2000 Chunk and CoNLL-2002 NER datasets are obtained from [[teropa/nlp]](https://github.com/teropa/nlp).

- CoNLL-2003 English NER dartaset is obtained from [[synalp/NER/corpus/CoNLL-2003]](

https://github.com/synalp/NER/tree/master/corpus/CoNLL-2003).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/26hzhang/bert_classification

Awesome Lists containing this project

README