Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/GanjinZero/ChineseEHRBert

A Chinese EHR Bert Pretrained Model.
https://github.com/GanjinZero/ChineseEHRBert

Last synced: 3 months ago
JSON representation

A Chinese EHR Bert Pretrained Model.

Host: GitHub
URL: https://github.com/GanjinZero/ChineseEHRBert
Owner: GanjinZero
Created: 2019-10-14T03:33:41.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2021-07-14T02:40:37.000Z (almost 3 years ago)
Last Synced: 2024-01-17T01:05:35.956Z (5 months ago)
Language: Python
Size: 746 KB
Stars: 244
Watchers: 12
Forks: 44
Open Issues: 5
Metadata Files:
- Readme: README.md

Lists

awesome-stars-copy - GanjinZero/ChineseEHRBert - A Chinese EHR Bert Pretrained Model. (Python)

README

        # ChineseEHRBert

A Chinese Electric Health Record Bert Pretrained Model.

[中文版](./README_zh.md)

# cleaner

The cleaner is responsible for cleaning txt files, which is used for training a Chinese bert model. The cleaner split lines in original lines into small lines. Each small line is a complete sentence with a punctuation. This is required for training next sentence prediction task.

## usage

```

cd ./cleaner/

python parser.py [-h] [--input INPUT] [--output OUTPUT] [-s] [--log LOG]

```

- --input: input directory

- --output: output directory

- -s: output is one single file

- --log: log frequency

# train

Pre-train a bert model with cleaned text. We should generate .tfrecord first, and pre-train with google's code. To notice, cleaner file may be too big to load in RAM. Our script splits these files and generate multiple .tfrecord.

## usage

Split file and convert to .tfrecord

```

cd ./train/

python make_pretrain_bert.py [-h] [-f FILE_PATH] [-s SPLIT_LINE]

                             [-p SPLIT_PATH] [-o OUTPUT_PATH] [-l MAX_LENGTH]

                             [-b BERT_BASE_DIR]

```

- -f: cleaned file path

- -s: split line count, default=500000

- -p: splited file save path

- -o: .tfrecord save path

- -l: max length

- -b: bert base dir

One should change parameters for your specific requirement in **pretrain128.sh** and **pretrain512.sh**.

```

sh pretrain128.sh

sh pretrain512.sh

```

# test

Test Chinese medical NLP tasks by BERT in one line! Two NER tasks, one QA task, one RE task and one sentence similarity task.

```

cd ./test/

sh run_test.sh

```

Tasks include [CCKS2019NER](https://www.biendata.com/competition/CCKS2019_1/), [cMedQA2](https://github.com/zhangsheng93/cMedQA2), [Tianchi\_NER](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288), [Tianchi\_RE](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288), [ncov2019\_sim](https://tianchi.aliyun.com/competition/entrance/231776/introduction).

# Results

Results compared with original BERT and ChineseEHRBert. Results are preparing.

# Citation

# Author

- [Zheng Yuan](https://github.com/GanjinZero)

- [Peng Zhao](https://github.com/zp9763)

- Chen Yu

- [Sheng Yu](http://www.stat.tsinghua.edu.cn/teambuilder/faculty/yusheng/)