Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ou-medinfo/medbertjp
Trials of pre-trained BERT models for the medical domain in Japanese.
https://github.com/ou-medinfo/medbertjp
Last synced: about 2 months ago
JSON representation
Trials of pre-trained BERT models for the medical domain in Japanese.
- Host: GitHub
- URL: https://github.com/ou-medinfo/medbertjp
- Owner: ou-medinfo
- License: other
- Created: 2020-10-22T18:07:13.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2020-11-21T23:37:28.000Z (about 4 years ago)
- Last Synced: 2024-07-28T15:33:34.704Z (6 months ago)
- Language: Jupyter Notebook
- Homepage: https://arxiv.org/abs/2005.07202
- Size: 272 KB
- Stars: 12
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-japanese-llm - medBERTjp - NC-SA 4.0 | △ | (入力テキストの処理に主に使うモデル / ドメイン特化型)
README
# Trials of pre-trained BERT models for the medical domain in Japanese
They are designed to be adapted to the Japanese medical domain.
The medical corpora were scraped for academic use from [Today's diagnosis and treatment: premium](https://www.igaku-shoin.co.jp/bookDetail.do?book=89056), which consists of 15 digital references for clinicians in Japanese published by [IGAKU-SHOIN Ltd.](https://www.igaku-shoin.co.jp/).
The general corpora were extracted from a Wikipedia dump file (jawiki-20190901) on https://dumps.wikimedia.org/jawiki/.## Our demonstration models
* [medBERTjp - MeCab-IPAdic](https://github.com/ou-medinfo/medbertjp/releases/tag/v0.1-mi)
- pre-trained model following [MeCab-IPAdic-tokenized Japanese BERT model](https://github.com/cl-tohoku/bert-japanese/).
- Japanese tokenizer: [MeCab](https://taku910.github.io/mecab/) + Byte Pair Encoding (BPE)
- [ipadic-py](https://github.com/polm/ipadic-py), or manual install of IPAdic is required.
- max_seq_length=128
* [medBERTjp - Unidic-2.3.0](https://github.com/ou-medinfo/medbertjp/releases/tag/v0.1-mu)
- Japanese tokenizer: [MeCab](https://taku910.github.io/mecab/) + BPE
- Unidic v2.3.0+2020-10-08 via [unidic-py](https://github.com/polm/unidic-py) is required.
- max_seq_length=128
* [medBERTjp - MeCab-IPAdic-NEologd-JMeDic](https://github.com/ou-medinfo/medbertjp/releases/tag/v0.1-minj)
- Japanese tokenizer: [MeCab](https://taku910.github.io/mecab/) + BPE
- install of both [mecab-ipadic-NEologd](https://github.com/neologd/mecab-ipadic-neologd/) and [J-MeDic (MANBYO_201907_Dic-utf8.dic)](https://sociocom.naist.jp/manbyo-dic-en/) is required.
- max_seq_length=128
* [medBERTjp - SentencePiece](https://github.com/ou-medinfo/medbertjp/releases/tag/v0.2-sp)
(*Old: [v0.1-sp](https://github.com/ou-medinfo/medbertjp/releases/tag/v0.1-sp)*)
- Japanese tokenizer: [SentencePiece](https://github.com/google/sentencepiece/) following [Sentencepiece Japanese BERT model](https://github.com/yoheikikuta/bert-japanese)
- use customized tokenization for the medical domain by SentencePiece
- max_seq_length=128## Requirements
For just using the models:
+ [Transformers](https://github.com/huggingface/transformers/) (>=2.11.0)
+ [fugashi](https://github.com/polm/fugashi), a Cython wrapper for [MeCab](https://taku910.github.io/mecab/)
- [ipadic](https://github.com/polm/ipadic-py), [unidic-py](https://github.com/polm/unidic-py), [mecab-ipadic-NEologd](https://github.com/neologd/mecab-ipadic-neologd/), and [J-MeDic](https://sociocom.naist.jp/manbyo-dic-en/): if required.
+ [SentencePiece](https://github.com/google/sentencepiece/) would be automatically installed with [Transformers](https://github.com/huggingface/transformers/).## Usage
Please check code examples of [`tokenization_example.ipynb`](./tokenization_example.ipynb), or try to use [`example_google_colab.ipynb`](./example_google_colab.ipynb) on [Google Colab](https://colab.research.google.com/).## Funding
This work was supported by Council for Science, Technology and Innovation (CSTI), cross-ministerial Strategic Innovation Promotion Program (SIP), "Innovative AI Hospital System" (Funding Agency: National Institute of Biomedical Innovation, Health and Nutrition (NIBIOHN)).## Licenses
The pretrained models are distributed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).
They are freely available for academic purpose or individual research, but restricted for commecial use.The codes in this repository are licensed under the Apache License, Version2.0.