Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/GanjinZero/ChineseEHRBert
A Chinese EHR Bert Pretrained Model.
https://github.com/GanjinZero/ChineseEHRBert
Last synced: 3 months ago
JSON representation
A Chinese EHR Bert Pretrained Model.
- Host: GitHub
- URL: https://github.com/GanjinZero/ChineseEHRBert
- Owner: GanjinZero
- Created: 2019-10-14T03:33:41.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2021-07-14T02:40:37.000Z (almost 3 years ago)
- Last Synced: 2024-01-17T01:05:35.956Z (5 months ago)
- Language: Python
- Size: 746 KB
- Stars: 244
- Watchers: 12
- Forks: 44
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Lists
- awesome-stars-copy - GanjinZero/ChineseEHRBert - A Chinese EHR Bert Pretrained Model. (Python)
README
# ChineseEHRBert
A Chinese Electric Health Record Bert Pretrained Model.[中文版](./README_zh.md)
# cleaner
The cleaner is responsible for cleaning txt files, which is used for training a Chinese bert model. The cleaner split lines in original lines into small lines. Each small line is a complete sentence with a punctuation. This is required for training next sentence prediction task.## usage
```
cd ./cleaner/
python parser.py [-h] [--input INPUT] [--output OUTPUT] [-s] [--log LOG]
```
- --input: input directory
- --output: output directory
- -s: output is one single file
- --log: log frequency# train
Pre-train a bert model with cleaned text. We should generate .tfrecord first, and pre-train with google's code. To notice, cleaner file may be too big to load in RAM. Our script splits these files and generate multiple .tfrecord.## usage
Split file and convert to .tfrecord
```
cd ./train/
python make_pretrain_bert.py [-h] [-f FILE_PATH] [-s SPLIT_LINE]
[-p SPLIT_PATH] [-o OUTPUT_PATH] [-l MAX_LENGTH]
[-b BERT_BASE_DIR]
```
- -f: cleaned file path
- -s: split line count, default=500000
- -p: splited file save path
- -o: .tfrecord save path
- -l: max length
- -b: bert base dirOne should change parameters for your specific requirement in **pretrain128.sh** and **pretrain512.sh**.
```
sh pretrain128.sh
sh pretrain512.sh
```# test
Test Chinese medical NLP tasks by BERT in one line! Two NER tasks, one QA task, one RE task and one sentence similarity task.
```
cd ./test/
sh run_test.sh
```
Tasks include [CCKS2019NER](https://www.biendata.com/competition/CCKS2019_1/), [cMedQA2](https://github.com/zhangsheng93/cMedQA2), [Tianchi\_NER](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288), [Tianchi\_RE](https://tianchi.aliyun.com/dataset/dataDetail?spm=5176.12281978.0.0.75926bacsx0LyL&dataId=22288), [ncov2019\_sim](https://tianchi.aliyun.com/competition/entrance/231776/introduction).# Results
Results compared with original BERT and ChineseEHRBert. Results are preparing.# Citation
# Author
- [Zheng Yuan](https://github.com/GanjinZero)
- [Peng Zhao](https://github.com/zp9763)
- Chen Yu
- [Sheng Yu](http://www.stat.tsinghua.edu.cn/teambuilder/faculty/yusheng/)