{"id":13754385,"url":"https://github.com/clovaai/ssmix","last_synced_at":"2025-10-06T15:30:32.233Z","repository":{"id":63124571,"uuid":"377059164","full_name":"clovaai/ssmix","owner":"clovaai","description":"Official PyTorch Implementation of SSMix (Findings of ACL 2021)","archived":false,"fork":false,"pushed_at":"2021-06-16T01:11:46.000Z","size":137,"stargazers_count":62,"open_issues_count":2,"forks_count":6,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-01-15T03:54:19.613Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/clovaai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-06-15T06:27:43.000Z","updated_at":"2024-12-11T07:52:34.000Z","dependencies_parsed_at":"2022-11-13T11:01:02.709Z","dependency_job_id":null,"html_url":"https://github.com/clovaai/ssmix","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clovaai%2Fssmix","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clovaai%2Fssmix/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clovaai%2Fssmix/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clovaai%2Fssmix/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/clovaai","download_url":"https://codeload.github.com/clovaai/ssmix/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235534197,"owners_count":19005468,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T09:01:57.811Z","updated_at":"2025-10-06T15:30:26.846Z","avatar_url":"https://github.com/clovaai.png","language":"Python","funding_links":[],"categories":["其他_NLP自然语言处理"],"sub_categories":["其他_文本生成、文本对话"],"readme":"# SSMix: Saliency-based Span Mixup for Text Classification (Findings of ACL 2021)\n\nOfficial PyTorch Implementation of SSMix | [Paper](https://arxiv.org/abs/2106.08062) \n* * *\n\n![SSMix](illustration.png)\n\n## Abstract\nData augmentation with mixup has shown to be effective on various computer vision tasks. \nDespite its great success, there has been a hurdle to apply mixup to NLP tasks since text consists of discrete tokens with variable length. \nIn this work, we propose *SSMix*, a novel mixup method where the operation is performed on input text rather than on hidden vectors like previous approaches. \n*SSMix* synthesizes a sentence while preserving the locality of two original texts by span-based mixing and keeping more tokens related to the prediction relying on saliency information. \nWith extensive experiments, we empirically validate that our method outperforms hidden-level mixup methods on the wide range of text classification benchmarks, including textual entailment, sentiment classification, and question-type classification. \n\n\n## Code Structure\n```\n|__ augmentation/ --\u003e augmentation methods by method type\n    |__ __init__.py --\u003e wrapper for all augmentation methods. Contains metric used for single \u0026 paired sentence tasks\n    |__ saliency.py --\u003e Calculates saliency by L2 norm gradient backpropagation\n    |__ ssmix.py --\u003e Output ssmix sentence with options such as no span and no saliency given two input sentence with additional information\n    |__ unk.py --\u003e Output randomly replaced unk sentence \n|__ read_data/ --\u003e Module used for loading data\n    |__ __init__.py --\u003e wrapper function for getting data split by train and valid depending on dataset type\n    |__  dataset.py --\u003e Class to get NLU dataset\n    |__ preprocess.py --\u003e preprocessor that makes input, label, and accuracy metric depending on dataset type\n|__ trainer.py --\u003e Code that does actual training \n|__ run_train.py --\u003e Load hyperparameter, initiate training, pipeline\n|__ classifiation_model.py -\u003e Augmented from huggingface modeling_bert.py. Define BERT architectures that can handle multiple inputs for Tmix\n```\nPart of code is modified from the [MixText](https://github.com/GT-SALT/MixText) implementation.\n\n\n## Getting Started\n```\npip install -r requirements.txt\n```\n\nCode is runnable on both CPU and GPU, but we highly recommended to run on GPU.\nStrictly following the versions specified in the `requirements.txt` file is desirable to sucessfully execute our code without errors.\n\n\n## Model Training\n\n```\npython run_train.py --batch_size ${BSZ} --seed ${SEED} --dataset {DATASET} --optimizer_lr ${LR} ${MODE}\n```\n\nFor all our experiments, we use 32 as the batch size (`BSZ`), and perform five different runs by changing the seed (`SEED`) from 0 to 4.\nWe experiment on a wide range of text classifiction datasets (`DATASET`): 'sst2', 'qqp', 'mnli', 'qnli', 'rte', 'mrpc', 'trec-coarse', 'trec-fine', 'anli'.\nYou should set `--anli_round` argument to one of 1, 2, 3 for the ANLI dataset.\n\nOnce you run the code, trained checkpoints are created under `checkpoints` directory.\nTo train a model without mixup, you have to set `MODE` to 'normal'.\nTo run with mixup approaches including our SSMix, you should set `MODE` as the name of the mixup method ('ssmix', 'tmix', 'embedmix', 'unk').\nWe load the checkpoint trained without mixup before training with mixup.\nWe use 5e-5 for the normal mode and 1e-5 for mixup methods as the learning rate (`LR`).\n\nYou can modify the argument values (e.g., `embed_alpha`, `hidden_alpha`, etc) to adjust to your training hyperparameter needs.\nFor ablation study of SSMix, you can exclude salieny constraint (`--ss_no_saliency`) or span constraint (`--ss_no_span`).\nType `python run_train.py --help` or check `run_train.py` to see the full list of available hyperparameters.\nFor debugging or analysis, you can turn on verbose options (`--verbose` and `--verbose_show_augment_example`).\n\n\n## License\n\n```\nCopyright 2021-present NAVER Corp.\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclovaai%2Fssmix","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclovaai%2Fssmix","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclovaai%2Fssmix/lists"}