{"id":19771980,"url":"https://github.com/danieldacosta/finetunedbert-data-augmentation","last_synced_at":"2026-07-16T07:32:07.607Z","repository":{"id":207879157,"uuid":"720328799","full_name":"DanielDaCosta/FineTunedBERT-Data-Augmentation","owner":"DanielDaCosta","description":"Enhance model performance in out-of-distribution contexts using fine-tuned BERT and data augmentation. Comparative analysis showcases efficiency gains with a 40% smaller model.","archived":false,"fork":false,"pushed_at":"2024-10-14T14:55:36.000Z","size":3441,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-25T06:37:41.310Z","etag":null,"topics":["bert","data-augmentation","distilbert","gpt-3","nlp"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DanielDaCosta.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-18T06:15:56.000Z","updated_at":"2024-10-14T14:56:26.000Z","dependencies_parsed_at":"2025-01-11T01:09:27.919Z","dependency_job_id":"c8a8180c-dc09-462d-bb76-ce439c0d66d0","html_url":"https://github.com/DanielDaCosta/FineTunedBERT-Data-Augmentation","commit_stats":{"total_commits":6,"total_committers":1,"mean_commits":6.0,"dds":0.0,"last_synced_commit":"65a249874266a67379bbfdd72b871139236143ae"},"previous_names":["danieldacosta/finetunedbert-data-augmentation"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DanielDaCosta/FineTunedBERT-Data-Augmentation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2FFineTunedBERT-Data-Augmentation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2FFineTunedBERT-Data-Augmentation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2FFineTunedBERT-Data-Augmentation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2FFineTunedBERT-Data-Augmentation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DanielDaCosta","download_url":"https://codeload.github.com/DanielDaCosta/FineTunedBERT-Data-Augmentation/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DanielDaCosta%2FFineTunedBERT-Data-Augmentation/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":35535880,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-16T02:00:06.687Z","response_time":83,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","data-augmentation","distilbert","gpt-3","nlp"],"created_at":"2024-11-12T05:04:59.099Z","updated_at":"2026-07-16T07:32:07.585Z","avatar_url":"https://github.com/DanielDaCosta.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Optimizing Language Models through Enhanced Fine-Tuning with Data Augmentation Techniques\n\nPaper: [Optimizing Language Models through Enhanced Fine-Tuning with Data Augmentation Techniques.pdf](https://github.com/DanielDaCosta/FineTunedBERT-Data-Augmentation/blob/main/Optimizing%20Language%20Models%20through%20Enhanced%20Fine-Tuning%20with%20Data%20Augmentation%20Techniques.pdf)\n\n# Abstract\nText classification, one of the core tasks of Natural Language Processing (NLP), encounters challenges when evaluating models in out-of-distribution (OOD) contexts. Addressing these challenges requires the application of specialized techniques to enhance model performance. This paper analyzes the efficacy of a fine-tuned iteration of BERT on a custom OOD dataset, utilizing data augmentation techniques to bolster its performance and showcasing the efficacy of this technique. Through a comparative analysis with DistilBERT and GPT-3.5, the paper demonstrates that comparable results can be achieved with a 40\\% smaller model, emphasizing the potential for efficiency gains without sacrificing performance.\n\n# Introduction\nFine-tuning a model a pre-trained model on a downstream task is a common procedure in the NLP space, as it facilitates achieving higher performance with minimal effort. However, one important aspect to consider is that, in real-world scenarios, test data often deviates from the training data distribution. As a result, ensuring that the model exhibits robust performance on datasets with both similar and divergent distributions is crucial.\n\nIn this paper, we go over fine-tuning a BERT model on binary classification tasks, testing its performance on a specifically crafted out-of-distribution dataset and discussing the reasons behind the observed decline in the model's effectiveness under these circumstances. Furthermore, the paper encompasses the application of a data augmentation technique involving expanding the training set with out-of-distribution data, followed by a subsequent round of fine-tuning.\n\nWe further extend our investigation by applying the previously outlined procedure to DistilBERT, a model that is 40% smaller, highlighting the trade-off between efficiency and performance.  To validate the model accuracy, we use GPT-3.5 as a baseline in a zero-shot setting on a small subset of the dataset to verify the model's performance.\n\nThe results showcase an enhancement in performance on the out-of-distribution (OOD) dataset after the integration of data augmentation. However, this improvement is accompanied by a comparatively modest decrease in performance on the original dataset.  Moreover, the study emphasizes that employing DestilBERT, a smaller model that can be trained 50% faster, enables the preservation of the model's performance in a similar setting.\n\n# Getting Started\n\n## Dataset\nIMDB Dataset: Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.\n\nhttps://huggingface.co/datasets/stanfordnlp/imdb\n\n## Installation\nPython 3.11.5:\n- torch==2.1.0\n- datasets==2.14.6\n- tqdm==4.66.1\n- transformers==4.35.0\n- evaluate==0.4.1\n- gensim==4.3.2\n- nltk==3.8.1\n\nOr you can install them by running:\n\n```\npip install -r requirements.txt\n```\n\n## Files\n- `main.py`: script for fine-tuning and evaluation BERT on the original or transformed dataset.\n- `main_distilBERT.py`: script for fine-tuning and evaluation DistilBERT on the original or transformed dataset.\n- `utils.py`: support script that has all of the transformations to created the out-of-distributions dataset\n- `word2vec_model.bin`: word2vec embeddings used for synonym replacement\n- `main_GPT.ipynb`: Jupyter Notebook for running GPT-3.5 evalaluations of Original (Sample) and Transformed (Sample) datasets as well as BERT and DistilBERt\n\n**Predicton files**\nFiles within `./CARC_output` folder\n\nBERT:\n- `out_original.txt`: Fine-tuned BERT on original dataset\n- `out_original_transformed.txt`: Fine-tuned BERT on transformed dataset\n- `out_augmented_original.txt`: Fine-tuned augmented BERT on original dataset\n- `out_augmented_transformed`: Fine-tuned augmented BERT on transformed dataset\n- `out_100_original.txt`:  Fine-tuned BERT predictions on the first 100 rows of the original dataset\n- `out_augmented_100_transformed.txt`:  Fine-tuned augmented BERT predictions on the first 100 rows of the transformed dataset\n\nDistilBERT:\n- `out_distilbert_original.txt`: Fine-tuned DistilBERT on original dataset\n- `out_distilbert_original_transformed.txt`: Fine-tuned DistilBERT on transformed dataset\n- `out_distilbert_augmented_original.txt`: Fine-tuned augmented DistilBERT on original dataset\n- `out_distilbert_augmented_transformed.txt`: Fine-tuned augmented DistilBERT on transformed dataset\n- `out_distilbert_100_original.txt`:  Fine-tuned DistilBERT predictions on the first 100 rows of the original dataset\n- `out_distilbert_augmented_100_transformed.txt`:  Fine-tuned augmented DistilBERT predictions on the first 100 rows of the transformed dataset\n\nGPT3.5 (zero-shot):\n- `gpt_out_original.txt`: prediction on the first 100 rows of the original dataset\n- `gpt_out_transformed.txt`: prediction on the first 100 rows of the transformed dataset\n\n\n**CARC Output Files**\n`./CARC_output/`: contain all of CARC outputs for each training and evaluation that were executed\n\n# Usage\n\n## Fine-Tuning and Evaluating on Original Dataset\n```python\npython3 main.py --train --eval\n```\nOutputs: \n- out/:  model tensors\n- out_original.txt: predictions\n\n```python\npython3 main_distilBERT.py --train --eval\n```\nOutputs: \n- out_distilbert/:  model tensors\n- out_distilbert_original.txt: predictions\n\n## Fine-Tuning and Evaluating on Transformed Dataset\n```python\npython3 main.py --train_augmented --eval_augmented\n```\nOutputs: \n- out_augmented/:  model tensors\n- out_augmented_original.txt: predictions\n\n\n```python\npython3 main_distilBERT.py --train_augmented --eval_augmented\n```\nOutputs: \n- out_distilbert_augmented/:  model tensors\n- out_distilbert_augmented_original.txt: predictions\n\n## Evaluations\n```python\n# Evaluation original BERT model on transformed data\npython3 main.py --eval_augmented --model_dir ./out\n\n# Evaluation augmented BERT model on original data\npython3 main.py --eval_augmented --model_dir ./out_augmented\n```\n\n```python\n# Evaluation of the original DistilBERT model on transformed data\npython3 main_distilBERT.py --eval_augmented --model_dir ./out_distilbert\n\n# Evaluation augmented DistilBERT model on original data\npython3 main_distilBERT.py --eval_augmented --model_dir ./out_distilbert_augmented\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanieldacosta%2Ffinetunedbert-data-augmentation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdanieldacosta%2Ffinetunedbert-data-augmentation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdanieldacosta%2Ffinetunedbert-data-augmentation/lists"}