https://github.com/wglab/bioformer

Bioformer: an efficient BERT model for biomedical text mining
https://github.com/wglab/bioformer

bert-model biomedical-nlp natural-language-processing natural-language-understanding nlp

Last synced: 6 months ago
JSON representation

Bioformer: an efficient BERT model for biomedical text mining

Host: GitHub
URL: https://github.com/wglab/bioformer
Owner: WGLab
License: apache-2.0
Created: 2021-04-20T03:56:49.000Z (about 5 years ago)
Default Branch: main
Last Pushed: 2023-02-07T04:09:44.000Z (over 3 years ago)
Last Synced: 2025-12-09T09:04:29.478Z (7 months ago)
Topics: bert-model, biomedical-nlp, natural-language-processing, natural-language-understanding, nlp
Homepage:
Size: 29.3 KB
Stars: 55
Watchers: 6
Forks: 6
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Bioformer: an efficient BERT model for biomedical text mining
Bioformer is a lightweight BERT model pretrained from biomedical Literature. We pretrained two Bioformer models, `Bioformer-8L` and `Bioformer-16L`. Both models were pretrained on all PubMed abstracts (as of Jan 2021) and 1 million subsampled PubMed Central full-text articles. We used the original implementation of [BERT](https://github.com/google-research/bert) to train the model. Bioformer models have the following features:

- **Accurate**. Bioformer achieves comparable or even better performance than BioBERT/PubMedBERT on downstream NLP tasks. A detailed evaluation is [here](https://arxiv.org/abs/2302.01588).
- **Smaller model size**. `Bioformer-8L` and `Bioformer-16L` reduced the model size by 60% compared with `BERT-Base`/`BioBERT-Base`/`PubMedBERT`.
- **Fast and memory efficient**. `Bioformer-8L` is 3X as fast as `PubMedBERT`, and `Bioformer-16L` is 2X as fast as `PubMedBERT`.
- **Biomedical vocabulary**. Bioformer uses a biomedical vocabulary of 32768 tokens, which was trained from PubMed abstracts and PubMed Central full-text articles. Bioformer is able to encode some special unicode symbols that are not in the original BERT vocabulary.

## Download

#### Pytorch checkpoint

Pretrained model weights of `Bioformer-8L` and `Bioformer-16L` are available on HuggingFace ([Bioformer-8L](https://huggingface.co/bioformers/bioformer-8L), and [Bioformer-16L](https://huggingface.co/bioformers/bioformer-16L))

You can easily use Bioformer with the [transformers](https://github.com/huggingface/transformers) library.

## Acknowledgment

Pretraining of Bioformer is partly supported by the Google TPU Research Cloud (TRC) program.

## Citation

Fang L, Chen Q, Wei C-H, Lu Z, Wang K: Bioformer: an efficient transformer language model for biomedical text mining. arXiv preprint arXiv:2302.01588 (2023). DOI: https://doi.org/10.48550/arXiv.2302.01588

```
@ARTICLE{fangli2023bioformer,
author = {{Fang}, Li and {Chen}, Qingyu and {Wei}, Chih-Hsuan and {Lu}, Zhiyong and {Wang}, Kai},
title = "{Bioformer: an efficient transformer language model for biomedical text mining}",
journal = {arXiv preprint arXiv:2302.01588},
year = {2023}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wglab/bioformer

Awesome Lists containing this project

README