{"id":22066389,"url":"https://github.com/jaketae/auto-tagger","last_synced_at":"2025-10-11T16:31:33.558Z","repository":{"id":100150930,"uuid":"319031606","full_name":"jaketae/auto-tagger","owner":"jaketae","description":"Fine-tuning and zero-shot learning with transformers to automatically tag my study blog posts","archived":false,"fork":false,"pushed_at":"2021-01-29T10:25:04.000Z","size":2405,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-04T19:09:02.992Z","etag":null,"topics":["bart","bert","multilabel-classification","nli","nlp","pytorch","roberta","tagging","transformer","zero-shot-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jaketae.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-06T12:48:19.000Z","updated_at":"2024-11-05T08:37:31.000Z","dependencies_parsed_at":"2023-04-07T20:46:57.383Z","dependency_job_id":null,"html_url":"https://github.com/jaketae/auto-tagger","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jaketae/auto-tagger","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaketae%2Fauto-tagger","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaketae%2Fauto-tagger/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaketae%2Fauto-tagger/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaketae%2Fauto-tagger/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jaketae","download_url":"https://codeload.github.com/jaketae/auto-tagger/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaketae%2Fauto-tagger/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279007834,"owners_count":26084368,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-11T02:00:06.511Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bart","bert","multilabel-classification","nli","nlp","pytorch","roberta","tagging","transformer","zero-shot-learning"],"created_at":"2024-11-30T19:27:56.641Z","updated_at":"2025-10-11T16:31:33.222Z","avatar_url":"https://github.com/jaketae.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Auto Tagger\n\nAuto Tagger is a project containing a collection of transformer models that can automatically generate tags for posts on my [study blog](http://jaketae.github.io/).\n\n## Motivation\n\nWhile maintaining my study blog, I realized that tag attribution was a multi-label classification task that could potentially be automated. For instance, the standard YAML format for tagging in Jekyll looks as follows:\n\n```\ntags:\n  - deep_learning\n  - pytorch\n```\n\n Since the blog was already maintained through a semi-automated publication pipeline in which posts were converted from Jupyter notebooks to markdown, this project was conceived as a useful addition to that preexisting workflow, with the goal of automatic blog tag attribution through the use of BERT and other variant transformer models.\n\n## Requirements\n\nThe project can be subdivided into two segments. The first segment concerns data collection and preprocessing. This process requires the following dependencies.\n\n```\nbeautifulsoup4==4.9.1\npandas==1.1.3\nrequests==2.24.0\nscikit-learn==0.23.2\ntqdm==4.49.0\n```\n\nThe model experimentation and training portion of the project requires the following:\n\n```\npytorch==1.6.0\ntransformers==3.5.1\n```\n\nAll dependencies are specified in `requirements.txt`.\n\n## Directory\n\nRaw labeled datasets scraped from the website reside in the `./data/` directory. The script also expects a `./checkpoints/` directory to be able to save and load model weights. Below is a tree directory that demonstrates a sample structure.\n\n```\n.\n├── checkpoints\n│   ├── roberta-unfreeze.json\n│   └── roberta-unfreeze.pt\n├── data\n│   ├── all_tags.json\n│   ├── train.csv\n│   ├── train.csv\n│   └── val.csv\n├── dataset.py\n├── eda.ipynb\n├── logs\n├── model.py\n├── requirements.txt\n├── scrape.py\n├── test.py\n├── train.py\n├── zero_shot.py\n└── utils.py\n```\n\n## Methodologies\n\nThe project implements two different methodologies to multi-label text classification: fine-tuning pretrained models and [zero-shot learning](https://en.wikipedia.org/wiki/Zero-shot_learning).\n\n### Fine-Tuning\n\nThe repository comes with convenience scripts to allow for fine-tuning, saving, and testing different transformer models. \n\n#### Training\n\nThe example below demonstrates how to train a RoBERTa model with minimal custom configurations.\n\n```\npython train.py --model_name=\"roberta-base\" --save_title=\"roberta-unfreeze\" --unfreeze_bert --num_epochs=20 --batch_size=32\n```\n\nThe full list of training arguments is provided below.\n\n```\nusage: train.py [-h]\n                [--model_name {bert-base,distilbert-base,roberta-base,distilroberta-base,allenai/longformer-base-4096}]\n                [--save_title SAVE_TITLE] [--load_title LOAD_TITLE]\n                [--num_epochs NUM_EPOCHS] [--log_interval LOG_INTERVAL]\n                [--batch_size BATCH_SIZE] [--patience PATIENCE]\n                [--max_len MAX_LEN] [--min_len MIN_LEN] [--freeze_bert]\n                [--unfreeze_bert]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --model_name {bert-base,distilbert-base,roberta-base,distilroberta-base,allenai/longformer-base-4096}\n  --save_title SAVE_TITLE\n  --load_title LOAD_TITLE\n  --num_epochs NUM_EPOCHS\n  --log_interval LOG_INTERVAL\n  --batch_size BATCH_SIZE\n  --patience PATIENCE\n  --max_len MAX_LEN     maximum length of each text\n  --min_len MIN_LEN     minimum length of each text\n  --freeze_bert\n  --unfreeze_bert\n```\n\n#### Testing\n\nThe example below demonstrates how to test a RoBERTa model whose weights were saved as ``\"roberta-unfreeze\"``.\n\n```\npython test.py --model_name=\"roberta-base\" --save_title=\"roberta-unfreeze\" --batch_size=32 \n```\n\nThe full list of testing arguments is provided below.\n\n```\nusage: test.py [-h]\n               [--model_name {bert-base,distilbert-base,roberta-base,distilroberta-base,allenai/longformer-base-4096}]\n               [--max_len MAX_LEN] [--min_len MIN_LEN]\n               [--save_title SAVE_TITLE] [--batch_size BATCH_SIZE]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --model_name {bert-base,distilbert-base,roberta-base,distilroberta-base,allenai/longformer-base-4096}\n  --max_len MAX_LEN     maximum length of each text\n  --min_len MIN_LEN     minimum length of each text\n  --save_title SAVE_TITLE\n                        title of saved file\n  --batch_size BATCH_SIZE\n```\n\n### Zero-shot Learning\n\nWhile fine-tuning works well, it has a number of clear disadvantages:\n\n* Difficulty of adding new, unseen tags\n* Possibility of catastrophic forgetting during retraining\n* In a multi-class, multi-label setting, too much labels can lead to adverse results\n\nIn short, fine-tuning a model in a supervised context necessarily means that it is difficult to dynamically add or remove dataset labels once the model has been fully trained.\n\nOn the other hand, a zero-shot learner is able to predict labels, even those it has not seen before in training; therefore, labels can be modified dynamically without constraints. Specifically, we utilize the fact that models trained on [NLI tasks](https://microsoft.github.io/nlp-recipes/examples/entailment/) are good at identifying relationships between text pairs composed of a hypothesis and a premise. [Yin et. al](https://arxiv.org/abs/1909.00161) demonstrated that pretrained MNLI models can act as performant out-of-the-box text classifiers. We use `transformers.pipeline`, which includes an implementation of this idea.\n\nSince this approach does not require any additional model training for fine-tuning, inference can be performed off-the-shelf simply by supplying a `--text` flag to the script.\n\n```\npython zero_shot.py --text \"This is some dummy text\"\n```\n\nThe full list of script arguments is provided below.\n\n```\nusage: zero_shot.py [-h]\n                    [--model_name {facebook/bart-large-mnli,roberta-large-mnli}]\n                    [--text TEXT]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --model_name {facebook/bart-large-mnli,roberta-large-mnli}\n  --text TEXT\n```\n\n## License\n\nReleased under the [MIT License](https://github.com/jaketae/auto-tagger/blob/master/LICENSE).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaketae%2Fauto-tagger","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjaketae%2Fauto-tagger","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaketae%2Fauto-tagger/lists"}