{"id":13753213,"url":"https://github.com/ShannonAI/Neural-Semi-Supervised-Learning-for-Text-Classification","last_synced_at":"2025-05-09T20:35:00.432Z","repository":{"id":103223436,"uuid":"313168896","full_name":"ShannonAI/Neural-Semi-Supervised-Learning-for-Text-Classification","owner":"ShannonAI","description":"Semi-supervised Learning for Sentiment Analysis","archived":false,"fork":false,"pushed_at":"2020-11-18T07:07:24.000Z","size":50,"stargazers_count":53,"open_issues_count":3,"forks_count":11,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-11-16T05:32:36.103Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ShannonAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-16T02:17:23.000Z","updated_at":"2024-01-16T05:25:52.000Z","dependencies_parsed_at":"2024-01-20T22:01:40.957Z","dependency_job_id":null,"html_url":"https://github.com/ShannonAI/Neural-Semi-Supervised-Learning-for-Text-Classification","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShannonAI%2FNeural-Semi-Supervised-Learning-for-Text-Classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShannonAI%2FNeural-Semi-Supervised-Learning-for-Text-Classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShannonAI%2FNeural-Semi-Supervised-Learning-for-Text-Classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ShannonAI%2FNeural-Semi-Supervised-Learning-for-Text-Classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ShannonAI","download_url":"https://codeload.github.com/ShannonAI/Neural-Semi-Supervised-Learning-for-Text-Classification/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253321837,"owners_count":21890476,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:18.419Z","updated_at":"2025-05-09T20:34:55.856Z","avatar_url":"https://github.com/ShannonAI.png","language":"Python","funding_links":[],"categories":["文本分类"],"sub_categories":[],"readme":"# Neural-Semi-supervised-Learning-for-Text-Classification-Under-Large-Scale-Pretraining\nCode, models and Datasets for[《Neural Semi-supervised Learning for Text Classification Under Large-Scale Pretraining》](https://arxiv.org/pdf/2011.08626.pdf).\n\n## Download Models and Dataset\nDatasets and Models are found in the follwing list.\n\n- Download 3.4M IMDB movie reviews. Save the data at `[REVIEWS_PATH]`.\nYou can download the dataset [HERE](https://drive.google.com/drive/folders/1YX-CzocJe32DK8j2RBVyhYhxrbgE1l1S?usp=sharing).  \n- Download the vanilla RoBERTa-large model released by HuggingFace. Save the model at `[VANILLA_ROBERTA_LARGE_PATH]`. \nYou can download the model [HERE](https://huggingface.co/roberta-large).  \n- Download in-domain pretrained models in the paper and save the model at `[PRETRAIN_MODELS]`. We provide three following models.\nYou can download [HERE](https://drive.google.com/drive/folders/1rBjtxVWGlrdEg2XJwBjbPb1Vf2d3Csb9?usp=sharing).\n    - `init-roberta-base`: RoBERTa-base model(U) trained over 3.4M movie reviews from scratch.\n    - `semi-roberta-base`: RoBERTa-base model(Large U + U) trained over 3.4M movie reviews from the open-domain pretrained model [RoBERTa-base model](https://huggingface.co/roberta-base).\n    - `semi-roberta-large`: RoBERTa-large model(Large U + U)  trained over 3.4M movie reviews from the open-domain pretrained model [RoBERTa-large model](https://huggingface.co/roberta-large).\n- Download the 1M (D\\` + D) training dataset for the student model, save the data at `[STUDENT_DATA_PATH]`.\nYou can download it [HERE](https://drive.google.com/drive/folders/1wu76V3LgJIZjNtpfscLVYTvcAJ2RuqJX?usp=sharing).\n    - `student_data_base`: student training data generated by roberta-base teacher model \n    - `student_data_large`: student training data generated by roberta-large teacher model \n- Download the IMDB dataset from Andrew Maas' paper. Save the data at `[IMDB_DATA_PATH]`. For IMDB,\nThe training data and test data are saved in two separate files, each line in the file corresponds to one IMDB sample.\nYou can download [HERE](https://drive.google.com/drive/folders/1zShIK9n3HCZRjfE6311MhZ2Z3Jf1C6x2?usp=sharing).\n- Download shannon_preprocssor.whl to install a binarize tool. Save the .whl file at `[SHANNON_PREPROCESS_WHL_PATH]`.\nYou can download [HERE](https://drive.google.com/file/d/1wjH7hdSRL_QQj0OouBsN_O8Ng6m8bQiN/view?usp=sharing)\n- Download the teacher model and student model that we trained. Save them at `[CHECKPOINTS]`.\nYou can download [HERE](https://drive.google.com/drive/folders/1eiwS-0620S4H3yZUlrjvNAeze8JWWVu6?usp=sharing)\n    - `roberta-base`: teacher and student model checkpoint for roberta-base \n    - `roberta-large`: teacher and student model checkpoint for roberta-large \n\n## Installation\n`pip install -r requirements.txt`  \n`pip install [SHANNON_PREPROCESS_WHL_PATH]` \n\n## Quick Tour\n\n### train the roberta-large teacher model\nUse the roberta model we pretrained over 3.4M reviews data to train teacher model.  \nOur teacher model had an accuracy rate of 96.2% on the test set.\n```bash\ncd sstc/tasks/semi-roberta\npython trainer.py \\\n--mode train_teacher \\\nroberta_path [PRETRAIN_MODELS]\\semi-roberta-large \\\n--imdb_data_path [IMDB_DATA_PATH]/bin \\\n--gpus=0,1,2,3 \\\n--save_path [ROOT_SAVE_PATH] \\\n--precision 16 \\\n--batch_size 10 \\\n--min_epochs 10 \\\n--patience 3 \\\n--lr 3e-5  \n```\n\n### train the roberta-large student model\nUse the roberta model we pretrained over 3.4M reviews data to train student model.  \nOur student model had an accuracy rate of 96.8% on the test set.\n```bash\ncd sstc/tasks/semi-roberta\npython trainer.py \\\n--mode train_student \\\n--roberta_path [PRETRAIN_MODELS]\\semi-roberta-large \\\n--imdb_data_path [IMDB_DATA_PATH]/bin \\\n--student_data_path [STUDENT_DATA_PATH]/student_data_large/bin \\\n--save_path [ROOT_SAVE_PATH] \\\n--batch_size=10 \\\n--precision 16 \\\n--lr=2e-5 \\\n--warmup_steps 40000 \\\n--gpus=0,1,2,3,4,5,6,7 \\\n--accumulate_grad_batches=50\n```\n\n### evaluate the student model on the test set\nLoad student model checkpoint to evaluate over test set to reproduce our result.\n```bash\ncd sstc/tasks/semi-roberta\npython evaluate.py \\\n--checkpoint_path [CHECKPOINTS]/roberta-large/train_student_checkpoint/***.ckpt \\\n--roberta_path [PRETRAIN_MODELS]\\semi-roberta-large \\\n--imdb_data_path [IMDB_DATA_PATH]/bin \\\n--batch_size=10 \\\n--gpus=0,\n```\n\n## Reproduce paper results step by step\n### 1.Train in-domain LM based on RoBERTa\n#### 1.1 binarize 3.4M reviews data  \nYou should modify the shell according to your paths. The result binarize data will be saved in `[REVIEWS_PATH]/bin`\n```bash\ncd sstc/tasks/roberta_lm\nbash binarize.sh\n```\n#### 1.2 train RoBERTa-large (or small, as you wish) over 3.4M reviews data\n```bash\ncd sstc/tasks/roberta_lm\npython trainer.py \\\n--roberta_path [VANILLA_ROBERTA_LARGE_PATH] \\\n--data_dir [REVIEWS_PATH]/bin \\\n--gpus=0,1,2,3 \\\n--save_path [PRETRAIN_ROBERTA_CK_PATH] \\\n--val_check_interval 0.1 \\\n--precision 16 \\\n--batch_size 10 \\\n--distributed_backend=ddp \\\n--accumulate_grad_batches=50 \\\n--adam_epsilon 1e-6 \\\n--weight_decay 0.01 \\\n--warmup_steps 10000 \\\n--workers 8 \\\n--lr 2e-5\n```\nTraining checkpoints will be saved in `[PRETRAIN_ROBERTA_CK_PATH]`, \nfind the best checkpoint and convert it to HuggingFace bin format, \nThe relevant code can be found in `sstc/tasks/roberta_lm/trainer.py`.\nSave the pretrain bin model at `[PRETRAIN_MODELS]\\semi-roberta-large`, \nor you can just download the model we trained.\n\n### 2.train the teacher model\n#### 2.1 binarize IMDB dataset.\n```bash\ncd sstc/tasks/semi_roberta/scripts\nbash binarize_imdb.sh\n```\nYou can run the above code to binarize IMDB data, or you can just use the file we binarized in  `[IMDB_DATA_PATH]\\bin`  \n#### 2.2 train the teacher model\n```bash\ncd sstc/tasks/semi_roberta\npython trainer.py \\\n--mode train_teacher \\\n--roberta_path [PRETRAIN_MODELS]\\semi-roberta-large \\\n--imdb_data_path [IMDB_DATA_PATH]/bin \\\n--gpus=0,1,2,3 \\\n--save_path [ROOT_SAVE_PATH] \\\n--precision 16 \\\n--batch_size 10 \\\n--min_epochs 10 \\\n--patience 3 \\\n--lr 3e-5  \n```\nAfter training, teacher model checkpoint will be save in `[ROOT_SAVE_PATH]/train_teacher_checkpoint`. \nThe teacher model we trained had an accuracy rate of 96.2% on the test set.\nThe download link of teacher model checkpoint can be found in quick tour part.\n\n### 3.label the unlabeled in-domain data U\n#### 3.1 label 3.4M data\nUse the teacher model that you trained in previous step to label 3.4M reviews data, \nnotice that `[ROOT_SAVE_PATH]` should be the same as previous setting.\nThe labeled data will be save in `[ROOT_SAVE_PATH]\\predictions`.\n```bash\ncd sstc/tasks/roberta_lm\npython trainer.py \\\n--mode train_teacher \\\n--roberta_path [PRETRAIN_ROBERTA_PATH] \\\n--reviews_data_path [REVIEWS_PATH]/bin \\\n--best_teacher_checkpoint_path [CHECKPOINTS]/roberta-large/train_teacher_checkpoint/***.ckpt \\\n--gpus=0,1,2,3 \\\n--save_path [ROOT_SAVE_PATH] \n```\n#### 3.2 select the top-K data points \nFirstly, we random sample 3M data from 3.4M reviews data as U', \nthen we select 1M data from U' with the highest score as D',\nfinally, we concat the IMDB train data(D) and D' as train data for student model.\nThe student train data will be saved in `[ROOT_SAVE_PATH]\\student_data\\train.txt`,\nor you can use the data we provide in `[STUDENT_DATA_PATH]/student_data_large`\n```bash\ncd sstc/tasks/roberta_lm\npython data_selector.py \\\n--imdb_data_path [IMDB_DATA_PATH] \\\n--save_path [ROOT_SAVE_PATH] \n```\n\n### 4.train the student model\n#### 4.1 binarize the dataset\nYou can use the same script in 3.1 to binarize student train data in `[ROOT_SAVE_PATH]\\student_data\\train.txt`\n\n#### 4.1 train the student model \nuse can use the training data we provide in `[STUDENT_DATA_PATH]/student_data_large/bin` or use your own training data in\n`[ROOT_SAVE_PATH]\\student_data\\bin`, make sure you set the right `student_data_path`.\n```bash\ncd sstc/tasks/semi-roberta\npython trainer.py \\\n--mode train_student \\\n--roberta_path [PRETRAIN_MODELS]\\semi-roberta-large \\\n--imdb_data_path [IMDB_DATA_PATH]/bin \\\n--student_data_path [STUDENT_DATA_PATH]/student_data_large/bin \\\n--save_path [ROOT_SAVE_PATH] \\\n--batch_size=10 \\\n--precision 16 \\\n--lr=2e-5 \\\n--warmup_steps 40000 \\\n--gpus=0,1,2,3,4,5,6,7 \\\n--accumulate_grad_batches=50\n```\nAfter training, student model checkpoint will be save in `[ROOT_SAVE_PATH]/train_student_checkpoint`. \nThe student model we trained had an accuracy rate of 96.6% on the test set.\nThe download link of student model checkpoint can be found in Quick tour part.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FShannonAI%2FNeural-Semi-Supervised-Learning-for-Text-Classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FShannonAI%2FNeural-Semi-Supervised-Learning-for-Text-Classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FShannonAI%2FNeural-Semi-Supervised-Learning-for-Text-Classification/lists"}