{"id":13564686,"url":"https://github.com/google-research/uda","last_synced_at":"2025-05-15T15:07:07.019Z","repository":{"id":46331060,"uuid":"192635589","full_name":"google-research/uda","owner":"google-research","description":"Unsupervised Data Augmentation (UDA)","archived":false,"fork":false,"pushed_at":"2021-08-28T07:16:56.000Z","size":342,"stargazers_count":2188,"open_issues_count":70,"forks_count":313,"subscribers_count":44,"default_branch":"master","last_synced_at":"2025-04-07T20:11:14.587Z","etag":null,"topics":["computer-vision","cv","natural-language-processing","nlp","semi-supervised-learning","tensorflow"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/1904.12848","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-06-19T01:21:45.000Z","updated_at":"2025-03-25T14:56:02.000Z","dependencies_parsed_at":"2022-08-12T12:50:43.081Z","dependency_job_id":null,"html_url":"https://github.com/google-research/uda","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fuda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fuda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fuda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google-research%2Fuda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google-research","download_url":"https://codeload.github.com/google-research/uda/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254364270,"owners_count":22058878,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computer-vision","cv","natural-language-processing","nlp","semi-supervised-learning","tensorflow"],"created_at":"2024-08-01T13:01:34.418Z","updated_at":"2025-05-15T15:07:02.004Z","avatar_url":"https://github.com/google-research.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Unsupervised Data Augmentation\n\n## Overview\n\nUnsupervised Data Augmentation or UDA is a semi-supervised learning method which\nachieves state-of-the-art results on a wide variety of language and vision\ntasks.\n\nWith only 20 labeled examples, UDA outperforms the previous state-of-the-art on\nIMDb trained on 25,000 labeled examples.\n\nModel                  | Number of labeled examples | Error rate\n---------------------- | :------------------------: | :--------:\nMixed VAT (Prev. SOTA) | 25,000                     | 4.32\nBERT                   | 25,000                     | 4.51\nUDA                    | **20**                     | **4.20**\n\nIt reduces more than 30% of the error rate of state-of-the-art methods on\nCIFAR-10 with 4,000 labeled examples and SVHN with 1,000 labeled examples:\n\nModel            | CIFAR-10     | SVHN\n---------------- | :----------: | :----------:\nICT (Prev. SOTA) | 7.66±.17     | 3.53±.07\nUDA              | **4.31±.08** | **2.28±.10**\n\nIt leads to significant improvements on ImageNet with 10% labeled data.\n\nModel     | top-1 accuracy | top-5 accuracy\n--------- | :------------: | :------------:\nResNet-50 | 55.09          | 77.26\nUDA       | **68.78**      | **88.80**\n\n## How it works\n\nUDA is a method of *semi-supervised learning*, that reduces the need for labeled\nexamples and better utilizes unlabeled ones.\n\n## What we are releasing\n\nWe are releasing the following:\n\n*   Code for text classifications based on BERT.\n*   Code for image classifications on CIFAR-10 and SVHN.\n*   Code and checkpoints for our back translation augmentation system.\n\nAll of the code in this repository works out-of-the-box with GPU and Google\nCloud TPU.\n\n## Requirements\n\nThe code is tested on Python 2.7 and Tensorflow 1.13. After installing\nTensorflow, run the following command to install dependencies:\n```shell\npip install --user absl-py\n```\n\n## Image classification\n\n### Preprocessing\n\nWe generate 100 augmented examples for every original example. To download all\nthe augmented data, go to the *image* directory and run\n\n```shell\nAUG_COPY=100\nbash scripts/download_cifar10.sh ${AUG_COPY}\n```\n\nNote that you need 120G disk space for all the augmented data. To save space,\nyou can set AUG_COPY to a smaller number such as 30.\n\nAlternatively, you can generate the augmented examples yourself by running\n\n```shell\nAUG_COPY=100\nbash scripts/preprocess.sh --aug_copy=${AUG_COPY}\n```\n\n### CIFAR-10 with 250, 500, 1000, 2000, 4000 examples on GPUs\n\nGPU command:\n\n```shell\n# UDA accuracy: \n# 4000: 95.68 +- 0.08\n# 2000: 95.27 +- 0.14\n# 1000: 95.25 +- 0.10\n# 500: 95.20 +- 0.09\n# 250: 94.57 +- 0.96\nbash scripts/run_cifar10_gpu.sh --aug_copy=${AUG_COPY}\n```\n\n### SVHN with 250, 500, 1000, 2000, 4000 examples on GPUs\n\n\n```shell\n# UDA accuracy:\n# 4000: 97.72 +- 0.10\n# 2000: 97.80 +- 0.06\n# 1000: 97.77 +- 0.07\n# 500: 97.73 +- 0.09\n# 250: 97.28 +- 0.40\n\nbash scripts/run_svhn_gpu.sh --aug_copy=${AUG_COPY}\n```\n\n## Text classifiation\n\n### Run on GPUs\n\n#### Memory issues\n\nThe movie review texts in IMDb are longer than many classification tasks so\nusing a longer sequence length leads to better performances. The sequence\nlengths are limited by the TPU/GPU memory when using BERT (See the\n[Out-of-memory issues of BERT](https://github.com/google-research/bert#out-of-memory-issues)).\nAs such, we provide scripts to run with shorter sequence lengths and smaller\nbatch sizes.\n\n#### Instructions\n\nIf you want to run UDA with BERT base on a GPU with 11 GB memory, go to the\n*text* directory and run the following commands:\n\n```shell\n# Set a larger max_seq_length if your GPU has a memory larger than 11GB\nMAX_SEQ_LENGTH=128\n\n# Download data and pretrained BERT checkpoints\nbash scripts/download.sh\n\n# Preprocessing\nbash scripts/prepro.sh --max_seq_length=${MAX_SEQ_LENGTH}\n\n# Baseline accuracy: around 68%\nbash scripts/run_base.sh --max_seq_length=${MAX_SEQ_LENGTH}\n\n# UDA accuracy: around 90%\n# Set a larger train_batch_size to achieve better performance if your GPU has a larger memory.\nbash scripts/run_base_uda.sh --train_batch_size=8 --max_seq_length=${MAX_SEQ_LENGTH}\n\n```\n\n### Run on Cloud TPU v3-32 Pod to achieve SOTA performance\n\nThe best performance in the paper is achieved by using a max_seq_length of 512\nand initializing with BERT large finetuned on in-domain unsupervised data. If\nyou have access to Google Cloud TPU v3-32 Pod, try:\n\n```shell\nMAX_SEQ_LENGTH=512\n\n# Download data and pretrained BERT checkpoints\nbash scripts/download.sh\n\n# Preprocessing\nbash scripts/prepro.sh --max_seq_length=${MAX_SEQ_LENGTH}\n\n# UDA accuracy: 95.3% - 95.9%\nbash train_large_ft_uda_tpu.sh\n```\n\n## Run back translation data augmentation for your dataset\n\nFirst of all, install the following dependencies:\n\n```shell\npip install --user nltk\npython -c \"import nltk; nltk.download('punkt')\"\npip install --user tensor2tensor==1.13.4\n```\n\nThe following command translates the provided example file. It automatically\nsplits paragraphs into sentences, translates English sentences to French and\nthen translates them back into English. Finally, it composes the paraphrased\nsentences into paragraphs. Go to the *back_translate* directory and run:\n\n```shell\nbash download.sh\nbash run.sh\n```\n\n### Guidelines for hyperparameters:\n\nThere is a variable *sampling_temp* in the bash file. It is used to control the\ndiversity and quality of the paraphrases. Increasing sampling_temp will lead to\nincreased diversity but worse quality. Surprisingly, diversity is more important\nthan quality for many tasks we tried.\n\nWe suggest trying to set sampling_temp to 0.7, 0.8 and 0.9. If your task is very\nrobust to noise, sampling_temp=0.9 or 0.8 should lead to improved performance.\nIf your task is not robust to noise, setting sampling temp to 0.7 or 0.6 should\nbe better.\n\nIf you want to do back translation to a large file, you can change the replicas\nand worker_id arguments in run.sh. For example, when replicas=3, we divide the\ndata into three parts, and each run.sh will only process one part according to\nthe worker_id.\n\n\n## General guidelines for setting hyperparameters:\n\nUDA works out-of-box and does not require extensive hyperparameter tuning, but\nto really push the performance, here are suggestions about hyperparamters:\n\n*   It works well to set the weight on unsupervised objective *'unsup_coeff'*\n    to 1.\n*   Use a lower learning rate than pure supervised learning because there are\n    two loss terms computed on labeled data and unlabeled data respecitively.\n*   If your have an extremely small amount of data, try to tweak\n    'uda_softmax_temp' and 'uda_confidence_thresh' a bit. For more details about\n    these two hyperparameters, search the \"Confidence-based masking\" and\n    \"Softmax temperature control\" in the paper.\n*   Effective augmentation for supervised learning usually works well for UDA.\n*   For some tasks, we observed that increasing the batch size for the\n    unsupervised objective leads to better performance. For other tasks, small\n    batch sizes also work well. For example, when we run UDA with GPU on\n    CIFAR-10, the best batch size for the unsupervised objective is 160.\n\n## Acknowledgement\n\nA large portion of the code is taken from\n[BERT](https://github.com/google-research/bert) and\n[RandAugment](https://github.com/tensorflow/models/tree/master/research/autoaugment).\nThanks!\n\n## Citation\n\nPlease cite this paper if you use UDA.\n\n```\n@article{xie2019unsupervised,\n  title={Unsupervised Data Augmentation for Consistency Training},\n  author={Xie, Qizhe and Dai, Zihang and Hovy, Eduard and Luong, Minh-Thang and Le, Quoc V},\n  journal={arXiv preprint arXiv:1904.12848},\n  year={2019}\n}\n```\n\nPlease also cite this paper if you use UDA for images.\n\n```\n@article{cubuk2019randaugment,\n  title={RandAugment: Practical data augmentation with no separate search},\n  author={Cubuk, Ekin D and Zoph, Barret and Shlens, Jonathon and Le, Quoc V},\n  journal={arXiv preprint arXiv:1909.13719},\n  year={2019}\n}\n```\n\n## Disclaimer\n\nThis is not an officially supported Google product.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fuda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle-research%2Fuda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle-research%2Fuda/lists"}