https://github.com/Hoiy/berserker

Berserker - BERt chineSE woRd toKenizER
https://github.com/Hoiy/berserker

bert bert-chinese chinese-nlp chinese-word-segmentation nlp sequence-to-sequence state-of-the-art tensorflow tokenizer tpu

Last synced: 13 days ago
JSON representation

Berserker - BERt chineSE woRd toKenizER

Host: GitHub
URL: https://github.com/Hoiy/berserker
Owner: Hoiy
License: mit
Created: 2019-01-06T16:40:11.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-02-25T14:02:20.000Z (about 6 years ago)
Last Synced: 2024-11-02T23:32:34.063Z (5 months ago)
Topics: bert, bert-chinese, chinese-nlp, chinese-word-segmentation, nlp, sequence-to-sequence, state-of-the-art, tensorflow, tokenizer, tpu
Language: Python
Homepage:
Size: 174 KB
Stars: 17
Watchers: 4
Forks: 1
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-bert - Hoiy/berserker - BERt chineSE woRd toKenizER, Berserker (BERt chineSE woRd toKenizER) is a Chinese tokenizer built on top of Google's BERT model. , (BERT NER task:)

README

        # Berserker

Berserker (BERt chineSE woRd toKenizER) is a Chinese tokenizer built on top of Google's [BERT](https://github.com/google-research/bert) model.

## Installation

```python

pip install basaka

```

## Usage

```python

import berserker

berserker.load_model() # An one-off installation

berserker.tokenize('姑姑想過過過兒過過的生活。') # ['姑姑', '想', '過', '過', '過兒', '過過', '的', '生活', '。']

```

## Benchmark

The table below shows that Berserker achieved state-of-the-art F1 measure on the [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) [dataset](http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip).

The result below is trained with 15 epoches on each dataset with a batch size of 64.

|                    | PKU      | CITYU    | MSR      | AS       |

|--------------------|----------|----------|----------|----------|

| Liu et al. (2016)  | **96.8** | --       | 97.3     | --       |

| Yang et al. (2017) | 96.3     | 96.9     | 97.5     | 95.7     |

| Zhou et al. (2017) | 96.0     | --       | 97.8     | --       |

| Cai et al. (2017)  | 95.8     | 95.6     | 97.1     | --       |

| Chen et al. (2017) | 94.3     | 95.6     | 96.0     | 94.6     |

| Wang and Xu (2017) | 96.5     | --       | 98.0     | --       |

| Ma et al. (2018)   | 96.1     | **97.2** | 98.1     | 96.2     |

|--------------------|----------|----------|----------|----------|

| Berserker          | 96.6     | 97.1     | **98.4** | **96.5** |

Reference: [Ji Ma, Kuzman Ganchev, David Weiss - State-of-the-art Chinese Word Segmentation with Bi-LSTMs](https://arxiv.org/pdf/1808.06511.pdf)

## Limitation

Since Berserker ~~is muscular~~ is based on BERT, it has a large model size (~300MB) and run slowly on CPU. Berserker is just a proof of concept on what could be achieved with BERT.

Currently the default model provided is trained with [SIGHAN 2005](http://sighan.cs.uchicago.edu/bakeoff2005/) [PKU dataset](http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip). We plan to release more pretrained model in the future.

## Architecture

Berserker is fine-tuned over TPU with [pretrained Chinese BERT model](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip). It is connected with a single dense layer which is applied to all tokens to produce a sequence of [0, 1] output, where 1 denote a split.

## Training

We provided the source code for training under the `trainer` subdirectory. Feel free to contact me if you need any help reproducing the result.

## Bonus Video

[](https://www.youtube.com/watch?v=H_xmyvABZnE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Hoiy/berserker

Awesome Lists containing this project

README