Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hankcs/id-cnn-cws

Source codes and corpora of paper "Iterated Dilated Convolutions for Chinese Word Segmentation"
https://github.com/hankcs/id-cnn-cws

bilstm cnn crf cws nlp tensorflow

Last synced: 3 months ago
JSON representation

Source codes and corpora of paper "Iterated Dilated Convolutions for Chinese Word Segmentation"

Host: GitHub
URL: https://github.com/hankcs/id-cnn-cws
Owner: hankcs
License: gpl-3.0
Created: 2017-10-20T16:14:58.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2021-04-15T20:44:41.000Z (almost 4 years ago)
Last Synced: 2024-10-14T12:18:57.797Z (3 months ago)
Topics: bilstm, cnn, crf, cws, nlp, tensorflow
Language: Python
Size: 27.2 MB
Stars: 136
Watchers: 10
Forks: 40
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# ID-CNN-CWS
Source codes and corpora of paper "[Iterated Dilated Convolutions for Chinese Word Segmentation](http://www.nnw.cz/doi/2020/NNW.2020.30.022.pdf)" published in NNW journal.

![2017-10-20_13-23-31](http://wx3.sinaimg.cn/large/006Fmjmcly1fkpa3q8maej30dh0c2jup.jpg)

It implements the following `4` models for CWS:

- Bi-LSTM
- Bi-LSTM-CRF
- ID-CNN
- ID-CNN-CRF

## Dependencies

- Python >= 3.6
- TensorFlow >= 1.2

Both CPU and GPU are supported. GPU training is `10` times faster.

## Preparation

Run following script to convert corpus to TensorFlow dataset.

```
$ ./scripts/make.sh
```

## Train and Test

### Quick Start

```
$ ./scripts/run.sh $dataset $model
```

- `$dataset` can be `pku`, `msr`, `asSC` or `cityuSC`.
- `$model` can be `cnn` or `bilstm`.

For example:

```
$ ./scripts/run.sh pku cnn
```

It will train a `cnn` model on `pku` dataset, then evaluate performance on test set.

### CRF Layer

To enable CRF layer, simply append `--viterbi` to your command, e.g.

```
$ ./scripts/run.sh pku cnn --viterbi
```

## Accuracy

![2017-10-20_13-25-11](http://wx1.sinaimg.cn/large/006Fmjmcly1fkpa3in2haj30dq0h9q8u.jpg)

## Speed

![2017-10-20_11-44-42](http://wx3.sinaimg.cn/large/006Fmjmcly1fkp6wafcngj30d407l0th.jpg)

## Acknowledgments

- Corpora are from SIGHAN05, converted to Simplified Chinese via [HanLP](https://github.com/hankcs/HanLP). Note that the SIGHAN datasets should only be used for research purposes.
- Model implementations adopted from https://github.com/iesl/dilated-cnn-ner by [Emma Strubell](https://cs.umass.edu/~strubell).