Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/naver/biobert-pretrained

BioBERT: a pre-trained biomedical language representation model for biomedical text mining
https://github.com/naver/biobert-pretrained

Last synced: 2 months ago
JSON representation

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Host: GitHub
URL: https://github.com/naver/biobert-pretrained
Owner: naver
Created: 2019-01-28T02:05:23.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2020-06-02T03:28:07.000Z (about 4 years ago)
Last Synced: 2024-03-16T05:36:00.077Z (3 months ago)
Homepage:
Size: 22.5 KB
Stars: 627
Watchers: 26
Forks: 85
Open Issues: 7
Metadata Files:
- Readme: README.md

Lists

awesome-bioie - BioBERT - [paper](https://arxiv.org/abs/1901.08746) - [code](https://github.com/dmis-lab/biobert) - A PubMed and PubMed Central-trained version of the [BERT language model](https://arxiv.org/abs/1810.04805). (Techniques / Language Models)
awesome-medical-nlp - biobert

README

        # BioBERT Pre-trained Weights

This repository provides pre-trained weights of BioBERT, a language representation model for biomedical domain, especially designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc.

Please refer to our paper [BioBERT: a pre-trained biomedical language representation model for biomedical text mining](https://arxiv.org/abs/1901.08746) for more details.

## Downloading pre-trained weights

Go to [releases](https://github.com/naver/biobert-pretrained/releases) section of this repository or click links below to download pre-trained weights of BioBERT.

We provide three combinations of pre-trained weights: BioBERT (+ PubMed), BioBERT (+ PMC), and BioBERT (+ PubMed + PMC).

Pre-training was based on the [original BERT code](https://github.com/google-research/bert) provided by Google, and training details are described in our paper. Currently available versions of pre-trained weights are as follows:

* **[BioBERT-Base v1.1 (+ PubMed 1M)](https://drive.google.com/file/d/1R84voFKHfWV9xjzeLzWBbmY1uOMYpnyD/view?usp=sharing)** - based on BERT-base-Cased (same vocabulary)

* **[BioBERT-Large v1.1 (+ PubMed 1M)](https://drive.google.com/file/d/1GJpGjQj6aZPV-EfbiQELpBkvlGtoKiyA/view?usp=sharing)** - based on BERT-large-Cased (custom 30k vocabulary), [NER/QA Results](https://github.com/dmis-lab/biobert/wiki/BioBERT-Large-Results)

* **[BioBERT-Base v1.0 (+ PubMed 200K)](https://drive.google.com/file/d/17j6pSKZt5TtJ8oQCDNIwlSZ0q5w7NNBg/view?usp=sharing)** - based on BERT-base-Cased (same vocabulary)

* **[BioBERT-Base v1.0 (+ PMC 270K)](https://drive.google.com/file/d/1LiAJklso-DCAJmBekRTVEvqUOfm0a9fX/view?usp=sharing)** - based on BERT-base-Cased (same vocabulary)

* **[BioBERT-Base v1.0 (+ PubMed 200K + PMC 270K)](https://drive.google.com/file/d/1jGUu2dWB1RaeXmezeJmdiPKQp3ZCmNb7/view?usp=sharing)** - based on BERT-base-Cased (same vocabulary)

Make sure to specify the versions of pre-trained weights used in your works.

If you have difficulty choosing which one to use, we recommend using **BioBERT-Base v1.1 (+ PubMed 1M)** or **BioBERT-Large v1.1 (+ PubMed 1M)** depending on your GPU resources.

Note that for BioBERT-Base, we are using WordPiece vocabulary (`vocab.txt`) provided by Google as any new words in biomedical corpus can be represented with subwords (for instance, Leukemia => Leu + ##ke + ##mia).

More details are in the closed [issue #1](https://github.com/naver/biobert-pretrained/issues/1).

## Pre-training corpus

We do not provide pre-processed version of each corpus. However, each pre-training corpus could be found in the following links:

*   **`PubMed Abstracts1`**: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/

*   **`PubMed Abstracts2`**: ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/

*   **`PubMed Central Full Texts`**: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_bulk/

Estimated size of each corpus is 4.5 billion words for **`PubMed Abstracts1`** + **`PubMed Abstracts2`**, and 13.5 billion words for **`PubMed Central Full Texts`**.

## Fine-tuning BioBERT

To fine-tunine BioBERT on biomedical text mining tasks using provided pre-trained weights, refer to the [DMIS GitHub repository for BioBERT](https://github.com/dmis-lab/biobert).

## Citation

```

@article{10.1093/bioinformatics/btz682,

    author = {Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo},

    title = "{BioBERT: a pre-trained biomedical language representation model for biomedical text mining}",

    journal = {Bioinformatics},

    year = {2019},

    month = {09},

    issn = {1367-4803},

    doi = {10.1093/bioinformatics/btz682},

    url = {https://doi.org/10.1093/bioinformatics/btz682},

}

```

## Contact information

For help or issues using pre-trained weights of BioBERT, please submit a GitHub issue. Please contact Jinhyuk Lee

(`[email protected]`), or Sungdong Kim (`[email protected]`) for communication related to pre-trained weights of BioBERT.