{"id":13535030,"url":"https://github.com/allenai/scibert","last_synced_at":"2025-10-13T15:59:43.673Z","repository":{"id":43641422,"uuid":"168397780","full_name":"allenai/scibert","owner":"allenai","description":"A BERT model for scientific text.","archived":false,"fork":false,"pushed_at":"2022-02-22T19:57:07.000Z","size":54349,"stargazers_count":1424,"open_issues_count":61,"forks_count":211,"subscribers_count":50,"default_branch":"master","last_synced_at":"2024-04-14T07:50:02.922Z","etag":null,"topics":["bert","nlp","scientific-papers"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/1903.10676","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/allenai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-01-30T19:00:00.000Z","updated_at":"2024-04-07T15:54:06.000Z","dependencies_parsed_at":"2022-08-12T10:30:57.596Z","dependency_job_id":null,"html_url":"https://github.com/allenai/scibert","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/allenai/scibert","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Fscibert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Fscibert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Fscibert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Fscibert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/allenai","download_url":"https://codeload.github.com/allenai/scibert/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/allenai%2Fscibert/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279015953,"owners_count":26085777,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-13T02:00:06.723Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","nlp","scientific-papers"],"created_at":"2024-08-01T08:00:48.766Z","updated_at":"2025-10-13T15:59:43.668Z","avatar_url":"https://github.com/allenai.png","language":"Python","readme":"[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/named-entity-recognition-bc5cdr)](https://paperswithcode.com/sota/named-entity-recognition-bc5cdr?p=scibert-pretrained-contextualized-embeddings)  \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/relation-extraction-chemprot)](https://paperswithcode.com/sota/relation-extraction-chemprot?p=scibert-pretrained-contextualized-embeddings)  \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/participant-intervention-comparison-outcome)](https://paperswithcode.com/sota/participant-intervention-comparison-outcome?p=scibert-pretrained-contextualized-embeddings)  \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/named-entity-recognition-ncbi-disease)](https://paperswithcode.com/sota/named-entity-recognition-ncbi-disease?p=scibert-pretrained-contextualized-embeddings)  \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/sentence-classification-paper-field)](https://paperswithcode.com/sota/sentence-classification-paper-field?p=scibert-pretrained-contextualized-embeddings)  \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/citation-intent-classification-scicite)](https://paperswithcode.com/sota/citation-intent-classification-scicite?p=scibert-pretrained-contextualized-embeddings)  \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/sentence-classification-sciencecite)](https://paperswithcode.com/sota/sentence-classification-sciencecite?p=scibert-pretrained-contextualized-embeddings)  \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/relation-extraction-scierc)](https://paperswithcode.com/sota/relation-extraction-scierc?p=scibert-pretrained-contextualized-embeddings)  \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/named-entity-recognition-scierc)](https://paperswithcode.com/sota/named-entity-recognition-scierc?p=scibert-pretrained-contextualized-embeddings)  \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/citation-intent-classification-acl-arc)](https://paperswithcode.com/sota/citation-intent-classification-acl-arc?p=scibert-pretrained-contextualized-embeddings)  \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/sentence-classification-acl-arc)](https://paperswithcode.com/sota/sentence-classification-acl-arc?p=scibert-pretrained-contextualized-embeddings)  \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/dependency-parsing-genia-las)](https://paperswithcode.com/sota/dependency-parsing-genia-las?p=scibert-pretrained-contextualized-embeddings)  \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/dependency-parsing-genia-uas)](https://paperswithcode.com/sota/dependency-parsing-genia-uas?p=scibert-pretrained-contextualized-embeddings)    \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/named-entity-recognition-jnlpba)](https://paperswithcode.com/sota/named-entity-recognition-jnlpba?p=scibert-pretrained-contextualized-embeddings)   \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/sentence-classification-pubmed-20k-rct)](https://paperswithcode.com/sota/sentence-classification-pubmed-20k-rct?p=scibert-pretrained-contextualized-embeddings)  \n[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/scibert-pretrained-contextualized-embeddings/sentence-classification-scicite)](https://paperswithcode.com/sota/sentence-classification-scicite?p=scibert-pretrained-contextualized-embeddings)\n\n\n# \u003cp align=center\u003e`SciBERT`\u003c/p\u003e\n`SciBERT` is a `BERT` model trained on scientific text.\n\n* `SciBERT` is trained on papers from the corpus of [semanticscholar.org](https://semanticscholar.org). Corpus size is 1.14M papers, 3.1B tokens. We use the full text of the papers in training, not just abstracts.\n\n* `SciBERT` has its own vocabulary (`scivocab`) that's built to best match the training corpus. We trained cased and uncased versions. We also include models trained on the original BERT vocabulary (`basevocab`) for comparison.\n\n* It results in state-of-the-art performance on a wide range of scientific domain nlp tasks. The details of the evaluation are in the [paper](https://arxiv.org/abs/1903.10676). Evaluation code and data are included in this repo. \n\n### Downloading Trained Models\nUpdate! SciBERT models now installable directly within Huggingface's framework under the `allenai` org:\n```\nfrom transformers import *\n\ntokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_uncased')\nmodel = AutoModel.from_pretrained('allenai/scibert_scivocab_uncased')\n\ntokenizer = AutoTokenizer.from_pretrained('allenai/scibert_scivocab_cased')\nmodel = AutoModel.from_pretrained('allenai/scibert_scivocab_cased')\n```\n\n------\n\nWe release the tensorflow and the pytorch version of the trained models. The tensorflow version is compatible with code that works with the model from [Google Research](https://github.com/google-research/bert). The pytorch version is created using the [Hugging Face](https://github.com/huggingface/pytorch-pretrained-BERT) library, and this repo shows how to use it in AllenNLP.  All combinations of `scivocab` and `basevocab`, `cased` and `uncased` models are available below. Our evaluation shows that `scivocab-uncased` usually gives the best results.\n\n#### Tensorflow Models\n* __[`scibert-scivocab-uncased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/tensorflow_models/scibert_scivocab_uncased.tar.gz) (Recommended)__\n* [`scibert-scivocab-cased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/tensorflow_models/scibert_scivocab_cased.tar.gz)\n* [`scibert-basevocab-uncased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/tensorflow_models/scibert_basevocab_uncased.tar.gz)\n* [`scibert-basevocab-cased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/tensorflow_models/scibert_basevocab_cased.tar.gz)\n\n#### PyTorch AllenNLP Models\n* __[`scibert-scivocab-uncased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_uncased.tar) (Recommended)__\n* [`scibert-scivocab-cased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_scivocab_cased.tar)\n* [`scibert-basevocab-uncased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_basevocab_uncased.tar)\n* [`scibert-basevocab-cased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/pytorch_models/scibert_basevocab_cased.tar)\n\n#### PyTorch HuggingFace Models\n* __[`scibert-scivocab-uncased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/huggingface_pytorch/scibert_scivocab_uncased.tar) (Recommended)__\n* [`scibert-scivocab-cased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/huggingface_pytorch/scibert_scivocab_cased.tar)\n* [`scibert-basevocab-uncased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/huggingface_pytorch/scibert_basevocab_uncased.tar)\n* [`scibert-basevocab-cased`](https://s3-us-west-2.amazonaws.com/ai2-s2-research/scibert/huggingface_pytorch/scibert_basevocab_cased.tar)\n\n### Using SciBERT in your own model\n\nSciBERT models include all necessary files to be plugged in your own model and are in same format as BERT.\nIf you are using Tensorflow, refer to Google's [BERT repo](https://github.com/google-research/bert) and if you use PyTorch, refer to [Hugging Face's repo](https://github.com/huggingface/pytorch-pretrained-BERT) where detailed instructions on using BERT models are provided. \n\n### Training new models using AllenNLP\n\nTo run experiments on different tasks and reproduce our results in the [paper](https://arxiv.org/abs/1903.10676), you need to first setup the Python 3.6 environment:\n\n```pip install -r requirements.txt```\n\nwhich will install dependencies like [AllenNLP](https://github.com/allenai/allennlp/).\n\nUse the `scibert/scripts/train_allennlp_local.sh` script as an example of how to run an experiment (you'll need to modify paths and variable names like `TASK` and `DATASET`).\n\nWe include a broad set of scientific nlp datasets under the `data/` directory across the following tasks. Each task has a sub-directory of available datasets.\n```\n├── ner\n│   ├── JNLPBA\n│   ├── NCBI-disease\n│   ├── bc5cdr\n│   └── sciie\n├── parsing\n│   └── genia\n├── pico\n│   └── ebmnlp\n└── text_classification\n    ├── chemprot\n    ├── citation_intent\n    ├── mag\n    ├── rct-20k\n    ├── sci-cite\n    └── sciie-relation-extraction\n```\n\nFor example to run the model on the Named Entity Recognition (`NER`) task and on the `BC5CDR` dataset (BioCreative V CDR), modify the `scibert/train_allennlp_local.sh` script according to:\n```\nDATASET='bc5cdr'\nTASK='ner'\n...\n```\n\nDecompress the PyTorch model that you downloaded using  \n`tar -xvf scibert_scivocab_uncased.tar`  \nThe results will be in the `scibert_scivocab_uncased` directory containing two files:\nA vocabulary file (`vocab.txt`) and a weights file (`weights.tar.gz`).\nCopy the files to your desired location and then set correct paths for `BERT_WEIGHTS` and `BERT_VOCAB` in the script:\n```\nexport BERT_VOCAB=path-to/scibert_scivocab_uncased.vocab\nexport BERT_WEIGHTS=path-to/scibert_scivocab_uncased.tar.gz\n```\n\nFinally run the script:\n\n```\n./scibert/scripts/train_allennlp_local.sh [serialization-directory]\n```\n\nWhere `[serialization-directory]` is the path to an output directory where the model files will be stored. \n\n### Citing\n\nIf you use `SciBERT` in your research, please cite [SciBERT: Pretrained Language Model for Scientific Text](https://arxiv.org/abs/1903.10676).\n```\n@inproceedings{Beltagy2019SciBERT,\n  title={SciBERT: Pretrained Language Model for Scientific Text},\n  author={Iz Beltagy and Kyle Lo and Arman Cohan},\n  year={2019},\n  booktitle={EMNLP},\n  Eprint={arXiv:1903.10676}\n}\n```\n\n`SciBERT` is an open-source project developed by [the Allen Institute for Artificial Intelligence (AI2)](http://www.allenai.org).\nAI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.\n\n\n\n\n","funding_links":[],"categories":["domain specific BERT:","Python","Uncategorized","BERT优化","Language Processing and Information Extraction","Techniques and Models","🧪 Scientific Pretraining, SFT, Reasoning, and Agent Datasets"],"sub_categories":["Uncategorized","BERT models","🔭 General Science"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fallenai%2Fscibert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fallenai%2Fscibert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fallenai%2Fscibert/lists"}