{"id":18533659,"url":"https://github.com/lonepatient/bert-multi-label-text-classification","last_synced_at":"2025-05-16T18:06:45.798Z","repository":{"id":37359549,"uuid":"169991845","full_name":"lonePatient/Bert-Multi-Label-Text-Classification","owner":"lonePatient","description":"This repo contains a PyTorch implementation of a pretrained BERT model for multi-label text classification.","archived":false,"fork":false,"pushed_at":"2023-04-18T10:47:17.000Z","size":191,"stargazers_count":907,"open_issues_count":42,"forks_count":208,"subscribers_count":20,"default_branch":"master","last_synced_at":"2025-04-12T16:54:51.945Z","etag":null,"topics":["albert","bert","fine-tuning","multi-label-classification","nlp","pytorch","pytorch-implmention","text-classification","transformers","xlnet"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lonePatient.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2019-02-10T15:19:42.000Z","updated_at":"2025-04-11T12:48:23.000Z","dependencies_parsed_at":"2024-03-25T00:42:57.232Z","dependency_job_id":null,"html_url":"https://github.com/lonePatient/Bert-Multi-Label-Text-Classification","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lonePatient%2FBert-Multi-Label-Text-Classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lonePatient%2FBert-Multi-Label-Text-Classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lonePatient%2FBert-Multi-Label-Text-Classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lonePatient%2FBert-Multi-Label-Text-Classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lonePatient","download_url":"https://codeload.github.com/lonePatient/Bert-Multi-Label-Text-Classification/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254582905,"owners_count":22095518,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["albert","bert","fine-tuning","multi-label-classification","nlp","pytorch","pytorch-implmention","text-classification","transformers","xlnet"],"created_at":"2024-11-06T19:12:33.498Z","updated_at":"2025-05-16T18:06:45.774Z","avatar_url":"https://github.com/lonePatient.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Bert multi-label text classification by PyTorch\r\n\r\nThis repo contains a PyTorch implementation of the pretrained BERT and XLNET model for multi-label text classification.\r\n\r\n###  Structure of the code\r\n\r\nAt the root of the project, you will see:\r\n\r\n```text\r\n├── pybert\r\n|  └── callback\r\n|  |  └── lrscheduler.py　　\r\n|  |  └── trainingmonitor.py　\r\n|  |  └── ...\r\n|  └── config\r\n|  |  └── basic_config.py #a configuration file for storing model parameters\r\n|  └── dataset　　　\r\n|  └── io　　　　\r\n|  |  └── dataset.py　　\r\n|  |  └── data_transformer.py　　\r\n|  └── model\r\n|  |  └── nn　\r\n|  |  └── pretrain　\r\n|  └── output #save the ouput of model\r\n|  └── preprocessing #text preprocessing \r\n|  └── train #used for training a model\r\n|  |  └── trainer.py \r\n|  |  └── ...\r\n|  └── common # a set of utility functions\r\n├── run_bert.py\r\n├── run_xlnet.py\r\n```\r\n### Dependencies\r\n\r\n- csv\r\n- tqdm\r\n- numpy\r\n- pickle\r\n- scikit-learn\r\n- PyTorch 1.1+\r\n- matplotlib\r\n- pandas\r\n- transformers=2.5.1\r\n\r\n### How to use the code\r\n\r\nyou need download pretrained bert model and xlnet model.\r\n\r\n\u003cdiv class=\"note info\"\u003e\u003cp\u003e BERT:  bert-base-uncased\u003c/p\u003e\u003c/div\u003e\r\n\r\n\u003cdiv class=\"note info\"\u003e\u003cp\u003e XLNET:  xlnet-base-cased\u003c/p\u003e\u003c/div\u003e\r\n\r\n1. Download the Bert pretrained model from [s3](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin) \r\n2. Download the Bert config file from [s3](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json) \r\n3. Download the Bert vocab file from [s3](https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt) \r\n4. Rename:\r\n\r\n    - `bert-base-uncased-pytorch_model.bin` to `pytorch_model.bin`\r\n    - `bert-base-uncased-config.json` to `config.json`\r\n    - `bert-base-uncased-vocab.txt` to `bert_vocab.txt`\r\n5. Place `model` ,`config` and `vocab` file into  the `/pybert/pretrain/bert/base-uncased` directory.\r\n6. `pip install pytorch-transformers` from [github](https://github.com/huggingface/pytorch-transformers).\r\n7. Download [kaggle data](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data) and place in `pybert/dataset`.\r\n    -  you can modify the `io.task_data.py` to adapt your data.\r\n8. Modify configuration information in `pybert/configs/basic_config.py`(the path of data,...).\r\n9. Run `python run_bert.py --do_data` to preprocess data.\r\n10. Run `python run_bert.py --do_train --save_best --do_lower_case` to fine tuning bert model.\r\n11. Run `run_bert.py --do_test --do_lower_case` to predict new data.\r\n\r\n### training \r\n\r\n```text\r\n[training] 8511/8511 [\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e\u003e] -0.8s/step- loss: 0.0640\r\ntraining result:\r\n[2019-01-14 04:01:05]: bert-multi-label trainer.py[line:176] INFO  \r\nEpoch: 2 - loss: 0.0338 - val_loss: 0.0373 - val_auc: 0.9922\r\n```\r\n### training figure\r\n\r\n![]( https://lonepatient-1257945978.cos.ap-chengdu.myqcloud.com/20190214210111.png)\r\n\r\n### result\r\n\r\n```python\r\n---- train report every label -----\r\nLabel: toxic - auc: 0.9903\r\nLabel: severe_toxic - auc: 0.9913\r\nLabel: obscene - auc: 0.9951\r\nLabel: threat - auc: 0.9898\r\nLabel: insult - auc: 0.9911\r\nLabel: identity_hate - auc: 0.9910\r\n---- valid report every label -----\r\nLabel: toxic - auc: 0.9892\r\nLabel: severe_toxic - auc: 0.9911\r\nLabel: obscene - auc: 0.9945\r\nLabel: threat - auc: 0.9955\r\nLabel: insult - auc: 0.9903\r\nLabel: identity_hate - auc: 0.9927\r\n```\r\n\r\n## Tips\r\n\r\n- When converting the tensorflow checkpoint into the pytorch, it's expected to choice the \"bert_model.ckpt\", instead of \"bert_model.ckpt.index\", as the input file. Otherwise, you will see that the model can learn nothing and give almost same random outputs for any inputs. This means, in fact, you have not loaded the true ckpt for your model\r\n- When using multiple GPUs, the non-tensor calculations, such as accuracy and f1_score, are not supported by DataParallel instance\r\n- As recommanded by Jocob in his paper \u003curl\u003ehttps://arxiv.org/pdf/1810.04805.pdf\u003curl/\u003e, in fine-tuning tasks, the hyperparameters are expected to set as following: **Batch_size**: 16 or 32, **learning_rate**: 5e-5 or 2e-5 or 3e-5, **num_train_epoch**: 3 or 4\r\n- The pretrained model has a limit for the sentence of input that its length should is not larger than 512, the max position embedding dim. The data flows into the model as: Raw_data -\u003e WordPieces -\u003e Model. Note that the length of wordPieces is generally larger than that of raw_data, so a safe max length of raw_data is at ~128 - 256 \r\n- Upon testing, we found that fine-tuning all layers could get much better results than those of only fine-tuning the last classfier layer. The latter is actually a feature-based way \r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flonepatient%2Fbert-multi-label-text-classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flonepatient%2Fbert-multi-label-text-classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flonepatient%2Fbert-multi-label-text-classification/lists"}