https://github.com/allenai/scibert
A BERT model for scientific text.
https://github.com/allenai/scibert
bert nlp scientific-papers
Last synced: 4 months ago
JSON representation
A BERT model for scientific text.
- Host: GitHub
- URL: https://github.com/allenai/scibert
- Owner: allenai
- License: apache-2.0
- Created: 2019-01-30T19:00:00.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2022-02-22T19:57:07.000Z (almost 4 years ago)
- Last Synced: 2024-04-14T07:50:02.922Z (almost 2 years ago)
- Topics: bert, nlp, scientific-papers
- Language: Python
- Homepage: https://arxiv.org/abs/1903.10676
- Size: 51.8 MB
- Stars: 1,424
- Watchers: 50
- Forks: 211
- Open Issues: 61
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-bert - allenai/scibert
- StarryDivineSky - allenai/scibert
- awesome-scholarly-data-analysis - SciBERT - Bert LM for Biomedical and CS papers
- awesome-bioie - SciBERT - [paper](https://arxiv.org/abs/1903.10676) - A BERT model trained on >1M papers from the Semantic Scholar database. (Techniques and Models / BERT models)
- awesome-nlp-resource - SciBERT
- Awesome-Scientific-Datasets-and-LLMs - SciBERT - training | Raw text | 2019.09 | EN | Academic and research resources | Automated | N/A | Data generation | Crawlers, text processing tools | 3.3B (tokens) | (๐งช Scientific Pretraining, SFT, Reasoning, and Agent Datasets / ๐ญ General Science)
README
[](https://paperswithcode.com/sota/named-entity-recognition-bc5cdr?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/relation-extraction-chemprot?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/participant-intervention-comparison-outcome?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/named-entity-recognition-ncbi-disease?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/sentence-classification-paper-field?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/citation-intent-classification-scicite?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/sentence-classification-sciencecite?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/relation-extraction-scierc?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/named-entity-recognition-scierc?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/citation-intent-classification-acl-arc?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/sentence-classification-acl-arc?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/dependency-parsing-genia-las?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/dependency-parsing-genia-uas?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/named-entity-recognition-jnlpba?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/sentence-classification-pubmed-20k-rct?p=scibert-pretrained-contextualized-embeddings)
[](https://paperswithcode.com/sota/sentence-classification-scicite?p=scibert-pretrained-contextualized-embeddings)
#
`SciBERT`
`SciBERT` is a `BERT` model trained on scientific text.
* `SciBERT` is trained on papers from the corpus of [semanticscholar.org](https://semanticscholar.org). Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.
* `SciBERT` has its own vocabulary (`scivocab`) that's built to best match the training corpus. We trained cased and uncased versions. We also include models trained on the original BERT vocabulary (`basevocab`) for comparison.
* It results in state-of-the-art performance on a wide range of scientific domain nlp tasks. The details of the evaluation are in the [paper](https://arxiv.org/abs/1903.10676). Evaluation code and data are included in this repo.
### Downloading Trained Models
Update! SciBERT models now installable directly within Huggingface's framework under the `allenai` org:
```
from transformers import *
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')
tokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased')
model = AutoModel.from_pretrained('allenai/scibert_scivocab_cased')
```
------
We release the tensorflow and the pytorch version of the trained models. The tensorflow version is compatible with code that works with the model from [Google Research](https://github.com/google-research/bert). The pytorch version is created using the [Hugging Face](https://github.com/huggingface/pytorch-pretrained-BERT) library, and this repo shows how to use it in AllenNLP. All combinations of `scivocab` and `basevocab`, `cased` and `uncased` models are available below. Our evaluation shows that `scivocab-uncased` usually gives the best results.
#### Tensorflow Models
* __[`scibert-scivocab-uncased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/tensorflow_models/scibert_scivocab_uncased.tar.gz) (Recommended)__
* [`scibert-scivocab-cased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/tensorflow_models/scibert_scivocab_cased.tar.gz)
* [`scibert-basevocab-uncased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/tensorflow_models/scibert_basevocab_uncased.tar.gz)
* [`scibert-basevocab-cased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/tensorflow_models/scibert_basevocab_cased.tar.gz)
#### PyTorch AllenNLP Models
* __[`scibert-scivocab-uncased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_uncased.tar) (Recommended)__
* [`scibert-scivocab-cased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_cased.tar)
* [`scibert-basevocab-uncased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_basevocab_uncased.tar)
* [`scibert-basevocab-cased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_basevocab_cased.tar)
#### PyTorch HuggingFace Models
* __[`scibert-scivocab-uncased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/huggingface_pytorch/scibert_scivocab_uncased.tar) (Recommended)__
* [`scibert-scivocab-cased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/huggingface_pytorch/scibert_scivocab_cased.tar)
* [`scibert-basevocab-uncased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/huggingface_pytorch/scibert_basevocab_uncased.tar)
* [`scibert-basevocab-cased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/huggingface_pytorch/scibert_basevocab_cased.tar)
### Using SciBERT in your own model
SciBERT models include all necessary files to be plugged in your own model and are in same format as BERT.
If you are using Tensorflow, refer to Google's [BERT repo](https://github.com/google-research/bert) and if you use PyTorch, refer to [Hugging Face's repo](https://github.com/huggingface/pytorch-pretrained-BERT) where detailed instructions on using BERT models are provided.
### Training new models using AllenNLP
To run experiments on different tasks and reproduce our results in the [paper](https://arxiv.org/abs/1903.10676), you need to first setup the Python 3.6 environment:
```pip install -r requirements.txt```
which will install dependencies like [AllenNLP](https://github.com/allenai/allennlp/).
Use the `scibert/scripts/train_allennlp_local.sh` script as an example of how to run an experiment (you'll need to modify paths and variable names like `TASK` and `DATASET`).
We include a broad set of scientific nlp datasets under the `data/` directory across the following tasks. Each task has a sub-directory of available datasets.
```
โโโ ner
โย ย โโโ JNLPBA
โย ย โโโ NCBI-disease
โย ย โโโ bc5cdr
โย ย โโโ sciie
โโโ parsing
โย ย โโโ genia
โโโ pico
โย ย โโโ ebmnlp
โโโ text_classification
โโโ chemprot
โโโ citation_intent
โโโ mag
โโโ rct-20k
โโโ sci-cite
โโโ sciie-relation-extraction
```
For example to run the model on the Named Entity Recognition (`NER`) task and on the `BC5CDR` dataset (BioCreative V CDR), modify the `scibert/train_allennlp_local.sh` script according to:
```
DATASET='bc5cdr'
TASK='ner'
...
```
Decompress the PyTorch model that you downloaded using
`tar -xvf scibert_scivocab_uncased.tar`
The results will be in the `scibert_scivocab_uncased` directory containing two files:
A vocabulary file (`vocab.txt`) and a weights file (`weights.tar.gz`).
Copy the files to your desired location and then set correct paths for `BERT_WEIGHTS` and `BERT_VOCAB` in the script:
```
export BERT_VOCAB=path-to/scibert_scivocab_uncased.vocab
export BERT_WEIGHTS=path-to/scibert_scivocab_uncased.tar.gz
```
Finally run the script:
```
./scibert/scripts/train_allennlp_local.sh [serialization-directory]
```
Where `[serialization-directory]` is the path to an output directory where the model files will be stored.
### Citing
If you use `SciBERT` in your research, please cite [SciBERT: Pretrained Language Model for Scientific Text](https://arxiv.org/abs/1903.10676).
```
@inproceedings{Beltagy2019SciBERT,
title={SciBERT: Pretrained Language Model for Scientific Text},
author={Iz Beltagy and Kyle Lo and Arman Cohan},
year={2019},
booktitle={EMNLP},
Eprint={arXiv:1903.10676}
}
```
`SciBERT` is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).
AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.