{"id":13679406,"url":"https://github.com/yumeng5/LOTClass","last_synced_at":"2025-04-29T19:31:15.973Z","repository":{"id":44651283,"uuid":"304092513","full_name":"yumeng5/LOTClass","owner":"yumeng5","description":"[EMNLP 2020] Text Classification Using Label Names Only: A Language Model Self-Training Approach","archived":false,"fork":false,"pushed_at":"2022-02-02T03:11:08.000Z","size":31,"stargazers_count":296,"open_issues_count":3,"forks_count":62,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-11-11T22:35:24.299Z","etag":null,"topics":["language-model","text-classification","weakly-supervised-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yumeng5.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-10-14T17:47:49.000Z","updated_at":"2024-10-17T03:58:06.000Z","dependencies_parsed_at":"2022-07-14T17:00:37.653Z","dependency_job_id":null,"html_url":"https://github.com/yumeng5/LOTClass","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yumeng5%2FLOTClass","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yumeng5%2FLOTClass/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yumeng5%2FLOTClass/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yumeng5%2FLOTClass/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yumeng5","download_url":"https://codeload.github.com/yumeng5/LOTClass/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251569548,"owners_count":21610575,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["language-model","text-classification","weakly-supervised-learning"],"created_at":"2024-08-02T13:01:05.202Z","updated_at":"2025-04-29T19:31:15.646Z","avatar_url":"https://github.com/yumeng5.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# LOTClass\n\nThe source code used for [**Text Classification Using Label Names Only: A Language Model Self-Training Approach**](https://arxiv.org/abs/2010.07245), published in EMNLP 2020.\n\n## Requirements\n\nAt least one GPU is required to run the code.\n\nBefore running, you need to first install the required packages by typing following commands:\n\n```\n$ pip3 install -r requirements.txt\n```\n\nAlso, you need to download the stopwords in the NLTK library:\n\n```\nimport nltk\nnltk.download('stopwords')\n```\n\nPython 3.6 or above is strongly recommended; using older python versions might lead to package incompatibility issues.\n\n## Reproducing the Results\n\nWe provide four ```get_data.sh``` scripts for downloading the datasets used in the paper under ```datasets``` and four training bash scripts [```agnews.sh```](agnews.sh), [```dbpedia.sh```](dbpedia.sh), [```imdb.sh```](imdb.sh) and [```amazon.sh```](amazon.sh) for running the model on the four datasets.\n\n**Note: Our model does not use training labels; we provide the training/test set ground truth labels only for completeness and evaluation.**\n\nThe training bash scripts assume you have two 10GB GPUs. If you have different number of GPUs, or GPUs of different memory sizes, refer to [the next section](#command-line-arguments) for how to change the following command line arguments appropriately (while keeping other arguments unchanged): ```train_batch_size```, ```accum_steps```, ```eval_batch_size``` and ```gpus```.\n\n## Command Line Arguments\n\nThe meanings of the command line arguments will be displayed upon typing\n```\npython src/train.py -h\n```\nThe following arguments directly affect the performance of the model and need to be set carefully:\n\n* ```train_batch_size```, ```accum_steps```, ```gpus```: These three arguments should be set together. You need to make sure that the **effective training batch size**, calculated as ```train_batch_size * accum_steps * gpus```, is around **128**. For example, if you have 4 GPUs, then you can set ```train_batch_size = 32, accum_steps = 1, gpus = 4```; if you have 1 GPU, then you can set ```train_batch_size = 32, accum_steps = 4, gpus = 1```. If your GPUs have different memory sizes, you might need to change ```train_batch_size``` while adjusting ```accum_steps``` and ```gpus``` at the same time to keep the **effective training batch size** around **128**.\n* ```eval_batch_size```: This argument only affects the speed of the algorithm; use as large evaluation batch size as your GPUs can hold.\n* ```max_len```: This argument controls the maximum length of documents fed into the model (longer documents will be truncated). Ideally, ```max_len``` should be set to the length of the longest document (```max_len``` cannot be larger than ```512``` under BERT architecture), but using larger ```max_len``` also consumes more GPU memory, resulting in smaller batch size and longer training time. Therefore, you can trade model accuracy for faster training by reducing ```max_len```.\n* ```mcp_epochs```, ```self_train_epochs```: They control how many epochs to train the model on masked category prediction task and self-training task, respectively. Setting ```mcp_epochs = 3, self_train_epochs = 1``` will be a good starting point for most datasets, but you may increase them if your dataset is small (less than ```100,000``` documents).\n\nOther arguments can be kept as their default values.\n\n## Running on New Datasets\n\nTo execute the code on a new dataset, you need to \n\n1. Create a directory named ```your_dataset``` under ```datasets```.\n2. Prepare a text corpus ```train.txt``` (one document per line) under ```your_dataset``` for training the classification model (no document labels are needed).\n3. Prepare a label name file ```label_names.txt``` under ```your_dataset``` (each line contains the label name of one category; if multiple words are used as the label name of a category, put them in the same line and separate them with whitespace characters).\n4. (Optional) You can choose to provide a test corpus ```test.txt``` (one document per line) with ground truth labels ```test_labels.txt``` (each line contains an integer denoting the category index of the corresponding document, index starts from ```0``` and the order must be consistent with the category order in ```label_names.txt```). If the test corpus is provided, the code will write classification results to ```out.txt``` under ```your_dataset``` once the training is complete. If the ground truth labels of the test corpus are provided, test accuracy will be displayed during self-training, which is useful for hyperparameter tuning and model cherry-picking using a small test set.\n5. Run the code with appropriate command line arguments (I recommend creating a new bash script by referring to the four example scripts).\n6. The final trained classification model will be saved as ```final_model.pt``` under ```your_dataset```.\n\n**Note: The code will cache intermediate data and model checkpoints as .pt files under your dataset directory for continued training. If you change your training corpus or label names and re-run the code, you will need to first delete all .pt files to prevent the code from loading old results.**\n\nYou can always refer to the example datasets when preparing your own datasets.\n\n## Citations\n\nPlease cite the following paper if you find the code helpful for your research.\n```\n@inproceedings{meng2020text,\n  title={Text Classification Using Label Names Only: A Language Model Self-Training Approach},\n  author={Meng, Yu and Zhang, Yunyi and Huang, Jiaxin and Xiong, Chenyan and Ji, Heng and Zhang, Chao and Han, Jiawei},\n  booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},\n  year={2020},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyumeng5%2FLOTClass","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyumeng5%2FLOTClass","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyumeng5%2FLOTClass/lists"}