{"id":13535296,"url":"https://github.com/charles9n/bert-sklearn","last_synced_at":"2025-04-05T08:03:45.501Z","repository":{"id":42183994,"uuid":"171210574","full_name":"charles9n/bert-sklearn","owner":"charles9n","description":"a sklearn wrapper for Google's BERT model","archived":false,"fork":false,"pushed_at":"2022-10-26T10:35:20.000Z","size":522,"stargazers_count":300,"open_issues_count":13,"forks_count":70,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-03-29T07:01:51.455Z","etag":null,"topics":["bert","conll-2003","language-model","named-entity-recognition","natural-language-processing","ner","nlp","pytorch","scikit-learn","transfer-learning"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/charles9n.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-02-18T03:48:10.000Z","updated_at":"2025-03-14T06:36:36.000Z","dependencies_parsed_at":"2022-08-04T02:00:22.515Z","dependency_job_id":null,"html_url":"https://github.com/charles9n/bert-sklearn","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/charles9n%2Fbert-sklearn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/charles9n%2Fbert-sklearn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/charles9n%2Fbert-sklearn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/charles9n%2Fbert-sklearn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/charles9n","download_url":"https://codeload.github.com/charles9n/bert-sklearn/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247305930,"owners_count":20917207,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","conll-2003","language-model","named-entity-recognition","natural-language-processing","ner","nlp","pytorch","scikit-learn","transfer-learning"],"created_at":"2024-08-01T08:00:52.941Z","updated_at":"2025-04-05T08:03:45.481Z","avatar_url":"https://github.com/charles9n.png","language":"Jupyter Notebook","readme":"# scikit-learn wrapper to finetune BERT\n\n\n A scikit-learn wrapper to finetune [Google's BERT](https://github.com/google-research/bert) model for text and token sequence tasks based on the [huggingface pytorch](https://github.com/huggingface/pytorch-pretrained-BERT) port.\n \n* Includes configurable MLP as final classifier/regressor for text and text pair tasks\n* Includes token sequence classifier for NER, PoS, and chunking tasks\n* Includes  [**`SciBERT`**](https://github.com/allenai/scibert) and [**`BioBERT`**](https://github.com/dmis-lab/biobert) pretrained models for scientific  and biomedical domains.\n\n\nTry in [Google Colab](https://colab.research.google.com/drive/1-wTNA-qYmOBdSYG7sRhIdOrxcgPpcl6L)!\n\n\n## installation\n\nrequires python \u003e= 3.5 and pytorch \u003e= 0.4.1\n\n```bash\ngit clone -b master https://github.com/charles9n/bert-sklearn\ncd bert-sklearn\npip install .\n```\n\n## basic operation\n\n**`model.fit(X,y)`**  i.e finetune **`BERT`**\n\n* **`X`**: list, pandas dataframe, or numpy array of text, text pairs, or token lists\n\n* **`y`** : list, pandas dataframe, or numpy array of labels/targets\n\n```python3\nfrom bert_sklearn import BertClassifier\nfrom bert_sklearn import BertRegressor\nfrom bert_sklearn import load_model\n\n# define model\nmodel = BertClassifier()         # text/text pair classification\n# model = BertRegressor()        # text/text pair regression\n# model = BertTokenClassifier()  # token sequence classification\n\n# finetune model\nmodel.fit(X_train, y_train)\n\n# make predictions\ny_pred = model.predict(X_test)\n\n# make probabilty predictions\ny_pred = model.predict_proba(X_test)\n\n# score model on test data\nmodel.score(X_test, y_test)\n\n# save model to disk\nsavefile='/data/mymodel.bin'\nmodel.save(savefile)\n\n# load model from disk\nnew_model = load_model(savefile)\n\n# do stuff with new model\nnew_model.score(X_test, y_test)\n```\nSee [demo](https://github.com/charles9n/bert-sklearn/blob/master/demo.ipynb) notebook.\n\n## model options\n\n```python3\n# try different options...\nmodel.bert_model = 'bert-large-uncased'\nmodel.num_mlp_layers = 3\nmodel.max_seq_length = 196\nmodel.epochs = 4\nmodel.learning_rate = 4e-5\nmodel.gradient_accumulation_steps = 4\n\n# finetune\nmodel.fit(X_train, y_train)\n\n# do stuff...\nmodel.score(X_test, y_test)\n```\nSee [options](https://github.com/charles9n/bert-sklearn/blob/master/Options.md)\n\n\n## hyperparameter tuning\n\n```python3\nfrom sklearn.model_selection import GridSearchCV\n\nparams = {'epochs':[3, 4], 'learning_rate':[2e-5, 3e-5, 5e-5]}\n\n# wrap classifier in GridSearchCV\nclf = GridSearchCV(BertClassifier(validation_fraction=0), \n                    params,\n                    scoring='accuracy',\n                    verbose=True)\n\n# fit gridsearch \nclf.fit(X_train ,y_train)\n```\nSee [demo_tuning_hyperparameters](https://github.com/charles9n/bert-sklearn/blob/master/demo_tuning_hyperparams.ipynb) notebook.\n\n## GLUE datasets\nThe train and dev data sets from the [GLUE(Generalized Language Understanding Evaluation) ](https://github.com/nyu-mll/GLUE-baselines) benchmarks were used with `bert-base-uncased` model and compared againt the reported results in the Google paper and [GLUE leaderboard](https://gluebenchmark.com/leaderboard).\n\n|    | MNLI(m/mm)| QQP   | QNLI | SST-2| CoLA | STS-B | MRPC | RTE |\n| - | - | - | - | - |- | - | - | - |\n|BERT base(leaderboard) |84.6/83.4  | 89.2 | 90.1 | 93.5 | 52.1 | 87.1  | 84.8 | 66.4 | \n| bert-sklearn  |83.7/83.9| 90.2 |88.6 |92.32 |58.1| 89.7 |86.8 | 64.6 |\n\nIndividual runs can be found can be found [here](https://github.com/charles9n/bert-sklearn/tree/master/glue_examples).\n\n## CoNLL-2003 Named Entity Recognition(NER)\n\nNER results for [**`CoNLL-2003`**](https://www.clips.uantwerpen.be/conll2003/ner/)  shared task\n\n|    | dev f1 | test f1   |\n| - | - | - |\n| BERT paper| 96.4 | 92.4|\n| bert-sklearn | 96.04 | 91.97|\n\nSpan level stats on test:\n```bash\nprocessed 46666 tokens with 5648 phrases; found: 5740 phrases; correct: 5173.\naccuracy:  98.15%; precision:  90.12%; recall:  91.59%; FB1:  90.85\n              LOC: precision:  92.24%; recall:  92.69%; FB1:  92.46  1676\n             MISC: precision:  78.07%; recall:  81.62%; FB1:  79.81  734\n              ORG: precision:  87.64%; recall:  90.07%; FB1:  88.84  1707\n              PER: precision:  96.00%; recall:  96.35%; FB1:  96.17  1623\n```\nSee [ner_english notebook](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/ner_english.ipynb) for a demo using `'bert-base-cased'` model.\n\n## NCBI Biomedical NER\n\nNER results using bert-sklearn with **`SciBERT`** and **`BioBERT`** on the  the [**`NCBI disease Corpus`**](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3951655/) name recognition task.\n\nPrevious [SOTA](https://arxiv.org/pdf/1711.07908.pdf) for this task is **87.34** for f1 on the test set.\n\n\n\n|    |  test f1 (bert-sklearn) | test f1 (from papers)  |\n| - | - | - |\n| BERT base cased| 85.09 | 85.49|\n| SciBERT basevocab cased| 88.29 | 86.91|\n| SciBERT scivocab cased| 87.73 |  86.45|\n| BioBERT pubmed_v1.0 |  87.86  | 87.38|\n| BioBERT pubmed_pmc_v1.0 | 88.26 |  89.36|\n| BioBERT pubmed_v1.1 |87.26  | NA|\n\nSee [ner_NCBI_disease_BioBERT_SciBERT notebook](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/ner_NCBI_disease_BioBERT_SciBERT.ipynb) for a demo using **`SciBERT`** and **`BioBERT`** models.\n\nSee [SciBERT paper](https://arxiv.org/pdf/1903.10676.pdf) and [BioBERT paper](https://arxiv.org/pdf/1901.08746.pdf) for more info on the respective models.\n\n## Other examples\n\n* See [IMDb notebook](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/IMDb.ipynb) for a text classification demo on the Internet Movie Database review sentiment task.\n\n* See [chunking_english notebook](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/chunker_english.ipynb) for a demo on syntactic chunking using the [**`CoNLL-2000`**](https://www.clips.uantwerpen.be/conll2003/ner/) chunking task data.\n\n* See [ner_chinese notebook](https://github.com/charles9n/bert-sklearn/blob/master/other_examples/ner_chinese.ipynb) for a demo using `'bert-base-chinese'` for Chinese NER.\n\n\n## tests\n\nRun tests with pytest :\n```bash\npython -m pytest -sv tests/\n```\n\n## references\n\n* [Google `BERT` github](https://github.com/google-research/bert)  and [paper: \"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\" (10/2018) by\nJ. Devlin, M. Chang, K. Lee, and K. Toutanova](https://arxiv.org/abs/1810.04805)\n\n* [huggingface `pytorch-pretrained-BERT` github](https://github.com/huggingface/pytorch-pretrained-BERT)\n\n* [`SciBERT` github](https://github.com/allenai/scibert) and [paper: \"SCIBERT: Pretrained Contextualized Embeddings for Scientific Text\" (3/2019) by I. Beltagy, A. Cohan, and  K. Lo](https://arxiv.org/pdf/1903.10676.pdf)\n\n* [`BioBERT` github](https://github.com/dmis-lab/biobert) and [paper: \"BioBERT: a pre-trained biomedical language representation model for biomedical text mining\" (2/2019) by J. Lee, W. Yoon, S. Kim , D. Kim, S. Kim , C.H. So, and J. Kang\n](https://arxiv.org/pdf/1901.08746.pdf)  \n","funding_links":[],"categories":["BERT language model and embedding:","Uncategorized"],"sub_categories":["Uncategorized"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcharles9n%2Fbert-sklearn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcharles9n%2Fbert-sklearn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcharles9n%2Fbert-sklearn/lists"}