{"id":21923116,"url":"https://github.com/26hzhang/bert_classification","last_synced_at":"2025-10-12T13:43:32.832Z","repository":{"id":92547379,"uuid":"184044829","full_name":"26hzhang/bert_classification","owner":"26hzhang","description":"Token and Sentence Level Classification with Google's BERT (TensorFlow)","archived":false,"fork":false,"pushed_at":"2019-07-11T01:28:43.000Z","size":27736,"stargazers_count":10,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-29T09:21:53.935Z","etag":null,"topics":["bert-bilstm-crf","bert-embeddings","bert-model","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/26hzhang.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-04-29T09:51:50.000Z","updated_at":"2025-01-03T03:56:50.000Z","dependencies_parsed_at":"2023-03-07T02:15:36.839Z","dependency_job_id":null,"html_url":"https://github.com/26hzhang/bert_classification","commit_stats":null,"previous_names":["26hzhang/bert_classification"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/26hzhang%2Fbert_classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/26hzhang%2Fbert_classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/26hzhang%2Fbert_classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/26hzhang%2Fbert_classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/26hzhang","download_url":"https://codeload.github.com/26hzhang/bert_classification/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249725131,"owners_count":21316123,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert-bilstm-crf","bert-embeddings","bert-model","tensorflow"],"created_at":"2024-11-28T21:09:12.094Z","updated_at":"2025-10-12T13:43:27.795Z","avatar_url":"https://github.com/26hzhang.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BERT Classification\n\nUse google BERT (tensorflow-based) to do token-level and sentence-level classification.\n\n## Requirements\n- tensorflow\u003e=1.11.0 (or tensorflow-gpu\u003e=1.11.0)\n- numpy\u003e=1.14.4\n- official tensorflow based bert code, get the code [`https://github.com/google-research/bert.git`](\nhttps://github.com/google-research/bert.git) and place it under this repository.\n- pre-trained bert models (according to the tasks), download and place to the `checkpoint/` directory.\n\n```\nbert_classification/\n    |____ bert/\n    |____ bert_ckpt/\n    |____ checkpoint/\n    |____ datasets/\n    |____ .gitignore\n    |____ conlleval.pl\n    |____ data_cls_helper.py\n    |____ data_seq_helper.py\n    |____ README.md\n    |____ run_sequence_tagger.py\n    |____ run_text_classifier.py\n```\n\n## Dataset Overview\n\n**Token level classification datasets (POS, Chunk and NER)**:\n\nDataset | Language | Classes | Training tokens | Dev tokens | Test tokens\n:---: | :---: | :---: | :---: | :---: | :---:\nCoNLL2000 Chunk (en) | English | 23 | 211,727 | _N.A._ | 47,377\nCoNLL2002 NER (es) | Spanish | 9 | 207,484 (18,797) | 51,645 (4,351) | 52098 (3,558)\nCoNLL2002 NER (nl) | Dutch | 9 | 20,2931 (13,344) | 37,761 (2,616) | 68,994 (3,941)\nCoNLL2003 NER (en) | English | 9 | 20,4567 (23,499) | 51,578 (5,942) | 46,666 (5,648)\nCoNLL2003 NER (de 1) | German | 9 |  208,836 (16,839) | 51,444 (6,588) | 51,943 (5,171)\nGermEval2014 NER (de 2) | German | 25 | 452,853 (42,089) | 41,653 (3,960) | 96,499 (8,969)\nChinese NER 1 (zh 1) | Chinese | 21 | 1,044,967 (311,637) | 86,454 (24,444) | 119,467 (38,854)\nChinese NER 2 (Zh 2) | Chinese | 7 | 979,180 (110,093) | 109,870 (12,059) | 219,197 (25,012)\n\n\u003e All the lines in those datasets are convert to `(word, label)` pairs with `\\t` as separator and drop all the lines\nstart with `-DOCSTART-` and other undesired lines, while the label is in BIO2 format (Begin, Inside, Others).\n\n**Sentence level classification datasets**:\n\nDataset | Classes | Average sentence length | Train size | Dev size | Test size\n:---: | :---: | :---: | :---: | :---: | :---:\nCR | 2 | 19 | 3,395 | _N.A._ | 377\nMR | 2 | 20 | 9,595 | _N.A._ | 1,066\nSST2 | 2 | 11 | 67,349 | 872 | 1,821\nSST5 | 5 | 18 | 8,544 | 1,101 | 2,210\nSUBJ | 2 | 23 | 9,000 | _N.A._ | 1,000\nTREC | 6 | 10 | 5,452 | _N.A._ | 500\n\n\u003e All the datasets are converted to `utf-8` format via `iconv -f \u003csrc format\u003e -t utf-8 filename -o save_name`. For the \n_SUBJ_, _MR_ and _CR_ datasets, `90%` for train, `10%` for test, while the dev dataset is the duplicate of test dataset. \nFor _TREC_ dataset, the dev dataset is the duplicate of test dataset.\n\n**Natural language inference (sentence pair classification) datasets**:\n\nDataset | Classes | Train size | Dev size | Test size\n:---: | :---: | :---: | :---: | :---:\nMRPC | 2 | 4,077 | 1,726 | 1,726\nSICK | 3 | 4,501 | 501 | 4,928\nSNLI | 3 | 549,367 | 9,842 | 9,824\nCoLA | 2 | 8,551 | 527 | 516\n\n\u003e [_MNLI_](https://www.nyu.edu/projects/bowman/multinli/) and [_XNLI_](\nhttps://www.nyu.edu/projects/bowman/xnli/) datasets are implemented by the official BERT already, see \n`run_classifier.py` in [[google-research/bert]](https://github.com/google-research/bert).\n\n## Usage\nFor token-level classification, run:\n```bash\npython3 run_sequence_tagger.py --task_name ner  \\  # task name\n                               --data_dir datasets/CoNLL2003_en  \\  # dataset folder\n                               --output_dir checkpoint/conll2003_en  \\  # path to save outputs and trained params\n                               --bert_config_file bert_ckpt/cased_L-12_H-768_A-12/bert_config.json  \\  # pre-trained BERT configs\n                               --init_checkpoint bert_ckpt/cased_L-12_H-768_A-12/bert_model.ckpt  \\  # pre-trained BERT params\n                               --vocab_file bert_ckpt/cased_L-12_H-768_A-12/vocab.txt  \\  # BERT vocab file\n                               --do_lower_case False  \\  # whether lowercase the input tokens\n                               --max_seq_length 128  \\  # maximal sequence allowed\n                               --do_train True  \\  # if training\n                               --do_eval True  \\  # if evaluation\n                               --do_predict True  \\  # if prediction\n                               --batch_size 32  \\  # batch_size, change to `16` if OOM happens\n                               --num_train_epochs 6  \\  # number of epochs\n                               --use_crf True  # if use CRF for decoding\n```\n\nThe token-level classification model contains two modules, one is using CRF for decode while another use a classifier \ndirectly. The output sequence of bert model is first fed into a dense layer and then decode by CRF/classifier, no \nintermediate RNN layers are used.\n\nFor sentence-level classification, run:\n```bash\npython3 run_text_classifier.py --task_name mrpc  \\  # task name\n                               --data_dir datasets/MRPC  \\  # dataset folder\n                               --output_dir checkpoint/mrpc  \\  # path to save outputs and trained params\n                               --bert_config_file bert_ckpt/uncased_L-12_H-768_A-12/bert_config.json  \\  # pre-trained BERT configs\n                               --init_checkpoint bert_ckpt/uncased_L-12_H-768_A-12/bert_model.ckpt  \\  # pre-trained BERT params\n                               --vocab_file bert_ckpt/uncased_L-12_H-768_A-12/vocab.txt  \\  # BERT vocab file\n                               --do_lower_case True  \\  # whether lowercase the input tokens\n                               --max_seq_length 128  \\  # maximal sequence allowed\n                               --do_train True  \\  # if training\n                               --do_eval True  \\  # if evaluation\n                               --do_predict True  \\  # if prediction\n                               --batch_size 32  \\  # batch_size, change to `16` if OOM happens\n                               --num_train_epochs 6  # number of epochs\n```\n\nThe sentence-level classification directly take the pooled output of bert model and feed it into a classifier for \ndecode.\n\n## Experiment Results\n\n\u003e All the experiments are running on `1` GeForce GTX 1080 Ti GPU.\n\n**Token level classification datasets**\n\nDataset | en Chunk | es NER | nl NER | en NER | de NER 1 | de NER 2 | zh NER 1 | zh NER 2\n:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---:\nPrecision (%) | 96.8 | 89.0 | 89.8 | 92.0 | 82.0 | 86.2 | 77.9 | 95.7\nRecall (%) | 96.4 | 88.6 | 90.0 | 90.8 | 86.4 | 85.4 | 73.1 | 95.7\nF1 (%) | 96.6 | 88.8 | 89.9 | 91.4 | 84.2 | 85.8 | 75.5 | 95.7\n\n\u003e CoNLL2002 Spanish/Dutch, CoNLL2003 German NER and GermEval2014 German NER use [`multi_cased_L-12_H-768_A-12.zip`](\nhttps://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip) pre-trained model (base, \nmultilingual, cased)\n\n\u003e CoNLL2000 Chunk and CoNLL2003 NER utilize [`cased_L-12_H-768_A-12.zip`](\nhttps://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip) pre-trained model (base, English, \ncased)\n\n\u003e Chinese NER uses [`chinese_L-12_H-768_A-12.zip`](\nhttps://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip) pre-trained model (base, Chinese).\n\nThe testing results on CoNLL-2003 English NER are lower than the reported score of the [paper](\nhttps://arxiv.org/pdf/1810.04805.pdf) (`91.4%` v.s. `92.4%`). As the paper says a `0.2%` difference is reasonable, \nhowever, I got `1.0%` error. I think maybe some tricks are missing, for example, the parameters setting in \noutput classifier or data pre-processing strategies.\n\n**Sentence level classification datasets**\n\nDataset | CR | MR | SST2 | SST5 | SUBJ | TREC\n:---: | :---: | :---: | :---: | :---: | :---: | :---:\nDev Accuracy (%) | _N.A._ | _N.A._ | 91.3 | 50.1 | _N.A._ | _N.A._\nTest Accuracy (%) | 89.2 | 85.4 | 93.5 | 53.3 | 97.3 | 96.6\n\n\u003e All the tasks use [`uncased_L-12_H-768_A-12.zip`](\nhttps://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) pre-trained model (base, English, \nuncased).\n\n**Natural language inference datasets**\n\nDataset | MRPC | SICK | SNLI | CoLA\n:---: | :---: | :---: | :---: | :---:\nDev Accuracy (%) | _N.A._ | 86.4 | 91.1 | 83.1\nTest Accuracy (%) | 84.7 | 87.0 | 90.7 | 78.9\n\n\u003e All the tasks use [`uncased_L-12_H-768_A-12.zip`](\nhttps://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip) pre-trained model (base, English, \nuncased). \n\nThe results may differ from the reported results, since I do not use the _GLUE version_ datasets.\n\n## Reference\n- [[google-research/bert]](https://github.com/google-research/bert).\n- [[macanv/BERT-BiLSTM-CRF-NER]](https://github.com/macanv/BERT-BiLSTM-CRF-NER).\n- [[Kyubyong/bert_ner]](https://github.com/Kyubyong/bert_ner).\n- [[kyzhouhzau/BERT-NER]](https://github.com/kyzhouhzau/BERT-NER).\n- Natural language inference datasets are obtained from [[nyu-mll/GLUE-baselines]](\nhttps://github.com/nyu-mll/GLUE-baselines), _MRPC_ data can be download [[here]](\nhttps://github.com/jaisong87/prDetect/tree/master/Preprocess).\n- Sentence level classification datasets are obtained from [[facebookresearch/SentEval]](\nhttps://github.com/facebookresearch/SentEval).\n- Chinese NER 1 is obtained from [[lancopku/Chinese-Literature-NER-RE-Dataset]](\nhttps://github.com/lancopku/Chinese-Literature-NER-RE-Dataset).\n- Chinese NER 2 is obtained from[[zjy-ucas/ChineseNER]](https://github.com/zjy-ucas/ChineseNER).\n- CoNLL-2003 German NER dataset is obtained from [[MaviccPRP/ger_ner_evals]](https://github.com/MaviccPRP/ger_ner_evals).\n- [GermEval 2014 Named Entity Recognition Shared Task](https://sites.google.com/site/germeval2014ner/data).\n- CoNLL-2000 Chunk and CoNLL-2002 NER datasets are obtained from [[teropa/nlp]](https://github.com/teropa/nlp).\n- CoNLL-2003 English NER dartaset is obtained from [[synalp/NER/corpus/CoNLL-2003]](\nhttps://github.com/synalp/NER/tree/master/corpus/CoNLL-2003).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F26hzhang%2Fbert_classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2F26hzhang%2Fbert_classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2F26hzhang%2Fbert_classification/lists"}