{"id":20630814,"url":"https://github.com/cluebbers/nlp_deeplearning_spring2023","last_synced_at":"2026-05-07T14:40:52.847Z","repository":{"id":228339935,"uuid":"745222342","full_name":"cluebbers/NLP_DeepLearning_Spring2023","owner":"cluebbers","description":"Implementing and fine-tuning BERT for sentiment analysis, paraphrase detection, and semantic textual similarity tasks. Includes code, data, and detailed results.","archived":false,"fork":false,"pushed_at":"2025-05-26T07:41:40.000Z","size":62681,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-10T04:15:38.808Z","etag":null,"topics":["adamw-optimizer","bert","deep-learning","natural-language-processing","paraphrase-detection","python","pytorch","semantic-similarity","sentiment-analysis","sophia","tensorflow","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cluebbers.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-01-18T21:58:14.000Z","updated_at":"2025-05-26T07:40:43.000Z","dependencies_parsed_at":"2024-03-18T10:39:33.784Z","dependency_job_id":null,"html_url":"https://github.com/cluebbers/NLP_DeepLearning_Spring2023","commit_stats":null,"previous_names":["cluebbers/deepnlp-ss23","cluebbers/nlp_deeplearning_spring2023"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cluebbers/NLP_DeepLearning_Spring2023","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cluebbers%2FNLP_DeepLearning_Spring2023","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cluebbers%2FNLP_DeepLearning_Spring2023/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cluebbers%2FNLP_DeepLearning_Spring2023/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cluebbers%2FNLP_DeepLearning_Spring2023/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cluebbers","download_url":"https://codeload.github.com/cluebbers/NLP_DeepLearning_Spring2023/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cluebbers%2FNLP_DeepLearning_Spring2023/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281826831,"owners_count":26568347,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-30T02:00:06.501Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adamw-optimizer","bert","deep-learning","natural-language-processing","paraphrase-detection","python","pytorch","semantic-similarity","sentiment-analysis","sophia","tensorflow","transformers"],"created_at":"2024-11-16T14:09:38.177Z","updated_at":"2025-10-30T15:05:34.080Z","avatar_url":"https://github.com/cluebbers.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# G05 Language Ninjas\n\nThis repository is the Project for the Module M.Inf.2202 Deep Learning for Natural Language Processing of Group G05 Language Ninjas. \nThe course description can be found [here](https://gipplab.org/deep-learning-for-natural-language-processing/). \nThe project description can be found in SS23_DNLP_ProjectDescription.pdf\n\nThe goal for Part 1 is to implement a base BERT version including the AdamW optimizer and train it for sentiment analysis on Stanford Sentiment Treebank (SST). \nThe goal for Part 2 is to implement multitask training for sentiment analysis on Stanford Sentiment Treebank (SST), paraphrase detection on Quora Question Pairs Dataset (QQP) and semantic textual similarity on SemEval STS Benchmark (STS).\n\n## Methodology\n### Part 1\nWe followed the instructions in the project description.\n\n### sBERT\nWe first implemented sBERT and focused on improving the accuracy for the three tasks.\n\nTo create a baseline, we used the provided template and implemented a very basic model for all tasks. \nAll tasks are trained on seperately. \nWe achieved a training accuracy of nearly 100 %.\nBut dev_accuracy stopped improving early. \nSo generalization is a problem.\n\nBetter generalization is typically achieved by regularization. \nFirst easy things to try are dropout and weight_decay. \nAll tasks in the baseline share a common dropout layer. \nSince paraphrase detection and textual similarity are both about similarity, we tried to let them share an additional dropout layer for the second embeddings. \n\nAnother approach for regularization is additional data. \nThe provided datasets are imbalanced in the sense that paraphrase is by far the largest one and has the best dev accuracy in the baseline. \nSimilarity and paraphrase are similar tasks, so we tried to compute cosine similarity and used this layer also in computing paraphrase detection. \nThis way the similarity layer gets updated when training for paraphrase detection.\n\nThe training order in baseline is sts -\u003e sst -\u003e qqp. \nSince paraphrase has the largest dataset and performs best, we changed the training order to train on paraphrase first qqp -\u003e sts -\u003e sst.\n\nSMART is an approach for regularization and uses adverserial learning. \nIt adds noise to the original embeddings, calculates logits and an adverserial loss to the unperturbed logits. \nThis adverserial loss is added to the original training loss. \nThe parameters of the added noise, and therefore adverserial loss, are optimized during training.\n\nSophia is a new optimizer challenging the domination of Adam. \nWe tried it and compare it to AdamW.\n\nAnother possibilty is to combine losses instead of training seperately. \nThis can be as simple as adding them together. \nSince gradients for different tasks can lead in different directions, Gradient slicing\n\nWe used Optuna for hyperparameter tuning. We recorded regular trainings in Tensorboard. \n```\ntensorboard --logdir ./minbert-default-final-project/runs\n```\n## Experiments\n\n### Part 1\n\n```\npython classifier.py --use_gpu --batch_size 10 --lr 1e-5 --epochs 10 --option finetune\n```\nTensorboard: Jul19_21-50-55_Part1\n| Model name         | SST accuracy |\n| ------------------ |---------------- | \n| BERT Base |     51.41 %         |    \n\n\n### Part 2 sBERT\n\nWe started with sBERT. \nFor creating the baseline, we simply trained the in part one implemented Bert model on all data sets using the AdamW optimizer from part one with the standard hyperparameters ($lr = 1e-05$, $(\\beta_{1},\\beta_{2}) = (0.9, 0.999)$). \nIn each epoch we trained first on the whole Quora trainset, then on the whole SemEval trainset and finally on the whole SST trainset. \nWe used Cross-Entropy loss on the Quora and SST trainset and on the SemEval set we used MSE-loss applied to the cosine similarity of the bert embeddings of the two input sentences.\nTo perform the paraphrasing and sentiment anaylsis task, a simple linear classifier layer was added on top of the BERT embeddings.\nWe changed the code, so you have to run it on a commit before 2023-07-24.\n```\npython multitask_classifier.py --use_gpu --batch_size 20 --lr 1e-5 --epochs 30 --option finetune --optimizer adamw\n```\nTensorboard: Jul23_21-38-22_Part2_baseline\n\nAfter 5 epochs no significant improvements in dev metrics. Train accuracy is nearly 100 % for every task.\nThe conclusion is overfitting.\nWe did another run to record the dev loss.\nPlease take care to use a commit from 2023-08-25 to reproduce the results.\n\n```\npython -u multitask_classifier.py --use_gpu --option finetune --lr 1e-5 --batch_size 64 --comment \"baseline\" --epochs 30\n```\nTensorboard: Aug25_10-01-58_ggpu136baseline\n\nThe dev metrics are a bit different this time. \nThe dev loss is going up after 5 epochs. This confirms overfitting.\n\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---\n| sBERT-Baseline_1 |     51.14 %         |      85.23 %       | 52.15 % |\n| sBERT-Baseline_2  |     51.41 %         |      77. 32 %       | 43.35 %  |\n\n### Sophia Optimizer\n\n#### Implementation\n\n[Paper](https://arxiv.org/abs/2305.14342) and [code](https://github.com/Liuhong99/Sophia)\n\nThe code for Sophia can be found in `optimizer.py`\nWe did one run with standard Sophia parameters and the same learning rate as AdamW\n\n```\npython -u multitask_classifier.py --use_gpu --option finetune --lr 1e-5 --optimizer \"sophiag\" --epochs 20 --comment \"sophia\" --batch_size 64\n```\nTensorboard: Aug25_10-50-25_ggpu115sophia\n\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---\n| Sophia Baseline |     36.69 %         |      80.81 %       | 44.67 % |\n| Sophia Baseline (Finn) |     45%          |      77,8%        | 32 % |\n\nThe training performs very different for the different tasks.\n- STS: the metrics and curves are similar to the baselines\n- SST: training loss is similar to baseline. Other training metrics are worse.\n- QQP: training metrics are similar to our first baseline. Dev metrics are more similar to the second baseline.\n\nTwo conclusions:\n1. all tasks behave different and should therefor be trained with different parameters\n2. AdamW and Sophia need different parameters\n\n#### Comparison to AdamW\n\nTo compare both optimizers, we did an optuna study.\nTraining of three epochs in 100 trials with pruning. \nComparison of Adam (learning rate, weight decay) and Sophia (learning rate, weight decay, rho, k) and their parameters.\n```\npython optuna_optimizer.py --use_gpu\n```\nOptuna: `./optuna/optimizer-*`\n\n\u003cimg src=\"minbert-default-final-project/optuna/sBERT/optimizer-slice.png\" alt=\"alt text\" width=\"900\" height=\"300\"\u003e\n\nThe slice plot shows that learning rate and weight decay should be larger for Sophia.\n\n#### Tuning of Sophia\n\nTo find better Sophia parameters, we did an Optuna study to find suitable hyperparameters. We used the bayesian hyperparameter optimization of the Optuna library.\nIn the Optuna study we used only a tiny fraction of the para dataset. Otherwise the study, would have taken several days to complete. \nTraining of three epochs in 100 trials with pruning. \nA seperate optimizer for every task and tuning of learning rate, rho and weight decay.\n```\npython -u optuna_sophia.py --use_gpu --batch_size 64 --objective all\npython -u optuna_sophia.py --use_gpu --batch_size 64 --objective para\npython -u optuna_sophia.py --use_gpu --batch_size 64 --objective sst\npython -u optuna_sophia.py --use_gpu --batch_size 64 --objective sts\n``` \nOptuna: `./optuna/Sophia-*`\n| Model name         | learning rate  | weight decay  | rho  |\n| ------------------ |---------------- | -------------- | -------------- | \n| SST |     2.59e-5       |      0.2302     | 0.0449 |\n| QQP |     3.45e-5       |      0.1267     | 0.0417  |\n| STS |     4.22e-4       |      0.1384     | 0.0315 |\n\nTraining with the parameters:\n```\npython -u multitask_classifier.py --use_gpu --option finetune  --epochs 20 --comment \"_sophia-opt\" --batch_size 64 --optimizer \"sophiag\" --weight_decay_para 0.1267 --weight_decay_sst 0.2302 --weight_decay_sts 0.1384 --rho_para 0.0417 --rho_sst 0.0449 --rho_sts 0.0315 --lr_para 3.45e-5 --lr_sst 2.5877e-5 --lr_sts 0.0004\n```\nTensorboard: Sep01_22-58-01_ggpu135sophia\n\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---\n| Sophia Tuned |     26.25 %         |      62.74 %       | 3.061 % |\n\nThis did not work as expected. Learning did not happen. Manual experimentation showed that the learning rate was likely too high and that the default learning rate of 1e-5 works fairly well. Resetting the learning rates but keeping the other hyperparameters from above improves the performance on all three tasks compared to the sophia baseline:\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---\n| Sophia Tuned standard lr |     47,6 %         |      78,8%       | 36,7 % |\n\n#### Adding Dropout Layers\nSince the overfitting problem remained after the hyperparameter tuning, we added an individual loss layer for every task to reduce the overfitting. So, before the BERT embeddings were passed to the linear classifier layer of a task a dropout on the embeddings was applied. The dropout probability can be chosen differently for the different tasks. We tuned the dropout probabilities together with the learning rate and weight decay in another optuna study. We received the following dropout probabilities:\n| Para Dropout       | SST Dropout | STS Dropout\n| ------------------ |---------------- | -------------- |  \n|  15%  |     5.2 %         |      22 %       \n\nWe obtained the following results\n| Model name         | SST task | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---\n| Sophia dropout  |     38,1%         |      73,1%       | 28,8%  |\n\nTo reproduce this result run: \n```\npython -u multitask_classifier.py --use_gpu --option finetune  --optimizer \"sophiag\" --epochs 10 --hidden_dropout_prob_para 0.15 --hidden_dropout_prob_sst 0.052 --hidden_dropout_prob_sts 0.22 \n--lr_para 1.8e-05 --lr_sst 5.6e-06 --lr_sts 1.1e-05 --weight_decay_para 0.038 --weight_decay_sst 0.17 --weight_decay_sts 0.22\n--comment individual_dropout\n```\nThe dropout layers made the performance on all three tasks actually worse. We also tested different drop out values with the base optimizer parameters ( $lr = 1e-05$, $w_decay=0$), but in that case the performance was even more worse. So, we decided to not further investigate this approach.\n\n\n#### Seperate QQP training and weighted loss\nWe observed two problems with the data:\n\n1.The QQP dataset is way bigger than the other two datasets. Thus, we might overfit on the SemEval and SST dataset before the model is trained out on the QQP dataset. \n2. The distribution of the different classes in the QQP and SST dataset is not equal (for example class one contains over two times more samples than class zero).  As we see in the confusion matrix of the sophia base model, many datapoints from class 0 are falsely predicted to be in class one (same problem with classes five and four). \n\n\u003cimg src=\"confusion_matrix_sst.png\" alt=\"alt text\" width=\"300\" height=\"300\"\u003e\n\nTo tackle the first problem, we train the first 5 epochs only on the QQP dataset. The last epochs are trained on all datasets, but we only train on a randomly sampled tiny fraction of the QQP dataset, which has the same size as the other two datasets. \n\nTo balance the QQP and SST trainset we add weights to our Cross-Entropy loss function such that a training sample from a small class is assigned with an higher weight.\n\nIn the training the model parameters from the Tuning Sophia section were kept with standard learning rate.\nThose two adjustments of the datasets worked out and improved the performance on all three datasets. Especially the performance on the QQP dataset improved a lot: \nThe following results were obtained:\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---\n| Sophia Tuned standard lr |     78,8 %         |      47,6%       | 36,7 % |\n| Sophia balanced data  |     81,8 %         |      47,8%       | 45,5%  |\n\nUse the same command as in the Tuning Sophia section (with standard learning rate and no dropout) and add the argument  ```--para_sep True --weights True``` for reproducing the results.\n\n### AdamW\n\n#### Additional layers\nAnother problem we earlier observed was that the task contradict each other, i.e. in separating QQP training the paraphrasing accuracy increased but the other to accuracies decreased. We try to solve these conflicts by adding a simple neural network with one hidden layer as classifier for each task instead of only a linear classifier. The idea is that each task gets more parameters to adjust which are not influenced by the other tasks. As activation function in the neuronal network we tested ReLu and tanh activation layers between the hidden layer and the output. The ReLu activation function performed better.  Furthermore, we tried to freeze the BERT parameters in the last trainings epohs and only train the classifier parameters. This improved the performance especially on the SST dataset.\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---\n| Adam new base |     50,3 %         |      86,4 %       | 84,7 % |\n| Adam additional layer|     50%          |      88,4%        | 84,4 % |\n| Adam extra classifier training|     51,6%          |      88,5%        | 84,3 % |\n\nRun the following command for the adam baseline:\n```\npython -u multitask_classifier.py --use_gpu --option finetune  --optimizer \"adamw\" --epochs 4 --one_embed True --freeze_bert True --add_layers True --filepath final_freeze\n```\nFor using the non linear classifier with ReLu activation add the argument ```--add_layers``` and for freezing the BERT parameters in the last epochs add the argument ```--freeze_bert``` \n\nWe also tested some dropout and weight decay values, but those couldn't improve the performance. Furthermore, the weighted loss function, which improved the Models performance with the Sophia optimizer didn't help here.\n### SMART\n\n#### Implementation\n\n[Paper](https://aclanthology.org/2020.acl-main.197/) and [code](https://github.com/namisan/mt-dnn)\n\nThe perturbation code is in `smart_perturbation.py` with additional utilities in `smart_utils.py`. Training with standard parameters:\n```\npython -u multitask_classifier.py --use_gpu --option finetune --lr 1e-5 --optimizer \"adamw\" --epochs 20 --comment \"smart\" --batch_size 32 --smart\n```\nTensorboard: Aug25_11-01-31_ggpu136smart\n\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | -------------- | \n| sBERT-SMART Baseline |     50.41 %         |      79.64 %       | 52.60 % |\n\nThe training metrics are similar to the baselines. The dev metrics are a bit better than the second baseline. \n\n#### Tuning \n\nParameter (epsilon, step_size, noise_var, norm_p) tuning for SMART with optuna\nTraining of three epochs in 100 trials with pruning. \n\n```\npython -u optuna_smart.py --use_gpu --batch_size 50 --objective all\npython -u optuna_smart.py --use_gpu --batch_size 50 --objective para\npython -u optuna_smart.py --use_gpu --batch_size 50 --objective sst\npython -u optuna_smart.py --use_gpu --batch_size 50 --objective sts\n```\nOptuna: `./optuna/smart-*`\n\n| Model name         | accuracy | epsilon  | step size  | noise_var  | norm_p\n| ------------------ | -------------- |---------------- | -------------- | ---------------- |---------------- |\n| sBERT-SST |     51.31 | 3.93e-6       |      0.0001    | 4.21e-6 | inf |\n| sBERT-QQP |    79.34 |  1.88e-7      |      0.0012     | 1.31e-5 | L2 |\n| sBERT-STS |     49.64 | 4.38e-7      |      0.0024    | 1.67e-5 | L2 |\n\nTraining with these parameters:\n```\npython -u multitask_classifier.py --use_gpu --option finetune --lr 1e-5 --optimizer \"adamw\" --epochs 20 --comment \"_smart\" --batch_size 32 --smart --multi_smart True\n```\nTensorboard: Sep01_22-53-32_ggpu135smart\n\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---------------- |\n| sBERT-SMART Baseline |     50.41 %         |      79.64 %       | 52.60 % |\n| sBERT-SMART Tuned |     51.41 %         |      80.58 %       | 48.46 % |\n\n### Regularization\n\n```\npython -u optuna_regularization.py --use_gpu --batch_size 80\n```\n`./optuna/regularization-*`\n\nTODO regularization with seperate dropout and weight_decays for each task\n\n### Shared similarity layer\nOne layer of cosine similarity is used for both paraphrase detection and sentence similarity.\n\n```\npython -u multitask_classifier.py --use_gpu --option finetune --lr 1e-5 --shared --optimizer \"adamw\" --epochs 20 --comment \"shared\" --batch_size 64\n```\nTensorboard: Aug25_09-53-27_ggpu137shared\n\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---------------- |\n| sBERT-Shared similarity |     50.14 %         |      71.08 %       | 47.68 % |\n\n### Custom Attention\n\nWe tried changing the normal custom attention formula:\n\n1) Generalize $QK^T$ with symmetric linear combination of both $Q, K$ and learn the combination:\n\n$$attention(Q, K, V) = softmax\\left(\\frac{(\\alpha_1 * Q + \\alpha_2 * K + \\alpha_3I)(\\beta_1 * Q + \\beta_2 * K + \\beta_3I)^T}{\\sqrt{d_k}}\\right)V$$\n\n2) Replace softmax with sparsemax (see https://arxiv.org/abs/1602.02068v2):\n\n$$attention(Q, K, V) = sparsemax\\left(\\frac{QK^T}{\\sqrt{d_j}}\\right)V$$\n\n3) Add an additional learnable center matrix in between:\n\n$$attention(Q, K, V) = softmax\\left(\\frac{QWK^T}{\\sqrt{d_j}}\\right)V$$\n\nFor ideas 1, 3 we get the original self attention by having specific parameters. We also found a paper that showed the second idea. The goal was that the model uses the original parameters but having more freedom in manipulating them by adding few extra parameters inside all the bert layers. We later realized that all 3 ideas could be combined resulting in 8 different models (1 baseline + 7 extra):\n\n| Model name                 | SST accuracy | QQP accuracy | STS correlation |\n| -------------------------- | ------------ | ------------ | --------------- |\n| sBERT-BertSelfAttention (baseline)                 | 44.6% | 77.2% | 48.3% |\n| sBERT-LinearSelfAttention                          | 40.5% | 75.6% | 37.8% |\n| sBERT-NoBiasLinearSelfAttention                    | 40.5% | 75.6% | 37.8% |\n| sBERT-SparsemaxSelfAttention                       | 39.0% | 70.7% | 56.8% |\n| sBERT-CenterMatrixSelfAttention                    | 39.1% | 76.4% | 43.4% |\n| sBERT-LinearSelfAttentionWithSparsemax             | 40.1% | 75.3% | 40.8% |\n| sBERT-CenterMatrixSelfAttentionWithSparsemax       | 39.1% | 75.6% | 40.4% |\n| sBERT-CenterMatrixLinearSelfAttention              | 42.4% | 76.2% | 42.4% |\n| sBERT-CenterMatrixLinearSelfAttentionWithSparsemax | 39.7% | 76.4% | 39.2% |\n\nOur baseline was different because we used other starting parameters (greater batch size, fewer parameters). We did this to reduce the training time for this experiment, see also ``submit_custom_attention.sh``:\n\n```\npython -B multitask_classifier.py --use_gpu --epochs=10 --lr=1e-5 --custom_attention=$CUSTOM_ATTENTION\n```\n\nExcept for the SparsemaxSelfAttention STS correlation, all values declined. The problem is highly due to overfitting. Making the model even more complex makes overfitting worse, thus we get worse performance.\n\n### Splitted and reordered batches\n[Splitted and reordererd batches](https://gitlab.gwdg.de/lukas.niegsch/language-ninjas/-/milestones/12#tab-issues)\n\nThe para dataset is much larger than the other two. Originally, we trained para last and then evaluate all 3 independent from each other. This has the effect that the model is optimized towards para, but forgets information from sst and sts. We moved para first and then did the other two last.\n\nFurthermore, all 3 datasets are learned one after another. This means that the gradiants may point in 3 different directions which we follow one after another. However, our goal is to move in the general direction for all 3 tasks together. We tried splitting the datasets into 6 different chunks (large para), (tiny sst, tiny para), (sts_size sts, sts_size para, sts_size sst). Important here is that the last 3 batches are the same size. Thus we can train all tasks without having para dominate the others.\n\nLastly, we tried training the batches for the last 3 steps in a round robin way (sts, para, sst, sts, para, sst, ...).\n\n| Model name                 | SST accuracy | QQP accuracy | STS correlation |\n| -------------------------- | ------------ | ------------ | --------------- |\n| sBERT-BertSelfAttention (baseline)                 | 44.6% | 77.2% | 48.3% |\n| sBERT-ReorderedTraining (BertSelfAttention)        | 45.9% | 79.3% | 49.8% |\n| sBERT-RoundRobinTraining (BertSelfAttention)       | 45.5% | 77.5% | 50.3% |\n\nWe used the same script as for the custom attention, but only used the orignal self attention. The reordered training is enabled by default because it gave the best performance. The round robin training can be enabled using the ``--cyclic_finetuning`` flag.\n\n```\npython -B multitask_classifier.py --use_gpu --epochs=10 --lr=1e-5 --cyclic_finetuning=True\n```\n\nThe reordering improved the performance, most likely just because the para comes first. The round robin did not improve it further, maybe switching after each batch is too much.\n\n### Combined Loss\n\nThis could work as a kind of regularization, because it is not training on a single task and overfitting, but it uses all losses to optimize. \nSo no single task is trained as best as it could.\nLoss for every task is calculated. All losses are summed up and optimized.\n```\npython multitask_combined_loss.py --use_gpu\n```\nTensorboard Aug23_17-45-56_combined_loss\n\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---------------- |\n| sBERT-Combined Loss |     38.33 %         |      81.12 %       | 44.68 % |\n\nThe tasks seem to be too different to work well in this setup. \nThe loss is going down as it should, but the predicted values are not good, seen in the dev_loss and dev_acc. \nWe guess because of the large training set for paraphrase detection, this dominates the learning process.\n\n### Gradient Surgery\nImplementation from [Paper](https://arxiv.org/pdf/2001.06782.pdf) and [code](https://github.com/WeiChengTseng/Pytorch-PCGrad)\n\n```\npython -u multitask_combined_loss.py --use_gpu --batch_size 10 --pcgrad --epochs 15 --comment \"pcgrad\" --lr 1e-5 --optim \"adamw\" --batch_size 40\n```\nIt fails because some logits are NA.\n\n## Part 2 BERT\n\nSince we were not particulary successfull with our sBERT, we also did some regular Base BERT training.\nSimilarity is now calculated by combining the input and then getting BERT embeddings.\nThen we use a linear classifier to output logits.\nThe logits are multiplied by 0.2 to get a similarity score between 0 and 5.\n### dataloader\nWe noticed that the dataloader for the sts dataset converts the lables to integers. We fixed it by setting the option isRegression to True in `datasets.py`\n```\nsts_train_data = SentencePairDataset(sts_train_data, args, isRegression = True)\nsts_dev_data = SentencePairDataset(sts_dev_data, args, isRegression = True)\n```\nThis improves training by a few percent.\n\n### Baseline\n\nFür die baseline mit AdamW und einem embedding:\n```\nsubmit_multi_adamw_one_embed.sh\n```\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---\n| BERT Baseline |     50,3 %         |      86,4 %       | 84,7 % |\n\n### non-linear classifier\n\nUm nicht linearen classifier zu verwenden nutze:\n```\nsubmit_multi_adamw_add_layers.sh\n```\n\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---\n| BERT additional layer|     50%          |      88,4%        | 84,4 % |\n\n\n### freeze\nUm zuerst vier epochen alles zu trainieren (bert+nicht linearer classifier) und danach 10 epochen nur den nicht linearen classifier lasse folgendes laufen:\n```\npython -u multitask_classifier.py --use_gpu --option finetune  --optimizer \"adamw\" --epochs 4 --one_embed True --freeze_bert True --add_layers True \n```\ndas verbessert das ergebnis nochmal etwas ( dritte Zeile) (man muss scheinbar nur eine epoche den nicht linearen classifier trainieren um schon das beste ergebnis zu bekommen, da er auch schon davor in diesem Fall mittrainiert wurde).\n\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---\n| BERT extra classifier training|     51,6%          |      88,5%        | 84,3 % |\n\n### SMART\n\nUsing standard SMART parameters\n```\npython -u multitask_classifier.py --use_gpu --option finetune  --optimizer \"adamw\" --epochs 10 --one_embed True  --add_layers True --comment adam_add_layers_one_embed_smart --smart --batch_size 64 --lr 1e-5\n```\nTensorboard: Sep03_11-23-24_bert_smart\n\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---------------- |\n| BERT-SMART |     51.6 %         |      88.8 %       | 43.8 % |\n\nThe bad sts correlation is because SMART uses MSE loss for its calculation of adverserial loss. \nWe did not change it **yet**.\n\n### Tuning SMART\nWe did another Optuna SMART run for base BERT.\nCurrently only works on branch 47.\n\n```\npython -u optuna_smart.py --use_gpu --batch_size 50 --objective sst --one_embed True --add_layers --n_trials 50 --epochs 3\npython -u optuna_smart.py --use_gpu --batch_size 50 --objective sts --one_embed True --add_layers --n_trials 50 --epochs 3\npython -u optuna_smart.py --use_gpu --batch_size 50 --objective para --one_embed True --add_layers --n_trials 50 --epochs 3\n```\n\n| Model name         | accuracy | epsilon  | step size  | noise_var  | norm_p\n| ------------------ | -------------- |---------------- | -------------- | ---------------- |---------------- |\n| sBERT-SST |     51.31 | 3.93e-6       |      0.0001    | 4.21e-6 | inf |\n| BERT-SST |     49.59 | 2.95e-7       |      0.0067    | 1.41e-6 | L1 |\n| sBERT-QQP |    79.34 |  1.88e-7      |      0.0012     | 1.31e-5 | L2 |\n| BERT-QQP |     67.00 | 1.83e-7      |      0.0014     | 2.32e-6 | L2 |\n| sBERT-STS |     49.64 | 4.38e-7      |      0.0024    | 1.67e-5 | L2 |\n| BERT-STS |     27.29 | 6.65e-6     |      0.0002    | 7.84e-6 | L1 |\n\nThe bad sts correlation is because SMART uses MSE loss for its calculation of adverserial loss. \nWe did not change it **yet**.\n\n### Final model\nWe combined some of our results in the final model. \n\n```\npython -u multitask_classifier.py --use_gpu --option finetune  --optimizer \"adamw\" --epochs 30 --one_embed True  --add_layers True --comment adam_add_layers_one_embed --batch_size 64 --lr 1e-5\n```\nTensorboard: Sep03_21-15-31_bert_final_30\n\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | ---------------- |\n| BERT-Final |     51.3 %         |      88.9 %       | 85.1% |\n\n## Requirements\n\nYou can use `setup.sh` or `setup_gwdg.sh` to create an environment and install the needed packages. Added to standard project ones:\n\n```\npip install tensorboard\npip install torch-tb-profiler\npip install optuna\n```\n\n## Training\n- `multitask_classifier.py` is baseline training with seperate training for every task: sts -\u003e sst -\u003e qqp\n- `multitask_combined_loss.py` combines losses by summing them up\n- `multitask_order.py` trains paraphrase detection first: qqp -\u003e sts -\u003e sst\n- `models.py`\n    - `models.MultitaskBERT` class with basic layers for three tasks\n    - `models.SharedMultitaskBERT` class where the similarity layer of the similarity task is also used for paraphrase detection\n    - `models.SmartMultitaskBERT` class with basic multitask model modified to work with SMART\n\n## Evaluation\n- `evaluation.model_eval_multitask()`\n- `evaluation.smart_eval()` function for evaluation modified to work with SMART\n- `evaluation.optuna_eval()` function for basic evaluation to work with Optuna\n- `evaluation.test_model_multitask()` and `evaluation. model_eval_test_multitask()` functions for submitting final results\n\n## Pre-trained Models\n\nYou can download pretrained models in the original [Project repository](https://github.com/truas/minbert-default-final-project) \n\n## Results\n\nOur model achieves the following performance:\n\n| Model name         | SST accuracy | QQP accuracy | STS correlation |\n| ------------------ |---------------- | -------------- | -------------- |\n| State-of-the-Art                             | 59.8% | 90.7% |   93% |\n| sBERT-Baseline_1  |     51.14 %         |      85.23 %       | 52.15 % |\n| sBERT-Baseline_2 |     51.41 %         |      77. 32 %       | 43.35 %  |\n| sBERT-Sophia Baseline|     36.69 %         |      80.81 %       | 44.67 % |\n| sBERT-Sophia Tuned |     26.25 %         |      62.74 %       | 3.061 % |\n| sBERT-SMART Baseline |     50.41 %         |      79.64 %       | 52.60 % |\n| sBERT-SMART Tuned |     51.41 %         |      80.58 %       | 48.46 % |\n| sBERT-Shared Similarity |     50.14 %         |      71.08 %       | 47.68 % |\n| sBERT-Combined Loss |     38.33 %         |      81.12 %       | 44.68 % |\n| sBERT-BertSelfAttention (no augmentation)          | 44.6% | 77.2% | 48.3% |\n| sBERT-ReorderedTraining (BertSelfAttention)        | 45.9% | 79.3% | 49.8% |\n| sBERT-RoundRobinTraining (BertSelfAttention)       | 45.5% | 77.5% | 50.3% |\n| sBERT-LinearSelfAttention                          | 40.5% | 75.6% | 37.8% |\n| sBERT-NoBiasLinearSelfAttention                    | 40.5% | 75.6% | 37.8% |\n| sBERT-SparsemaxSelfAttention                       | 39.0% | 70.7% | 56.8% |\n| sBERT-CenterMatrixSelfAttention                    | 39.1% | 76.4% | 43.4% |\n| sBERT-LinearSelfAttentionWithSparsemax             | 40.1% | 75.3% | 40.8% |\n| sBERT-CenterMatrixSelfAttentionWithSparsemax       | 39.1% | 75.6% | 40.4% |\n| sBERT-CenterMatrixLinearSelfAttention              | 42.4% | 76.2% | 42.4% |\n| sBERT-CenterMatrixLinearSelfAttentionWithSparsemax | 39.7% | 76.4% | 39.2% |\n| BERT Baseline |     50,3 %         |      86,4 %       | 84,7 % |\n| BERT-SMART |     51.6 %         |      88.8 %       | 43.8 % |\n| BERT additional layer|     50%          |      88,4%        | 84,4 % |\n| BERT extra classifier training|     51,6%          |      88,5%        | 84,3 % |\n| BERT-Final |     51.3 %         |      88.9 %       | 85.1% |\n\n[Leaderboard](https://docs.google.com/spreadsheets/d/1Bq21J3AnxyHJ9Wb9Ik9OXvtX6O4L2UdVX9Y9sBg7v8M/edit#gid=0)\n\n[State-of-the-Art](https://paperswithcode.com/sota/sentiment-analysis-on-sst-5-fine-grained)\n\n\u003e📋  Include a table of results from your paper, and link back to the leaderboard for clarity and context. If your main result is a figure, include that figure and link to the command or notebook to reproduce it. \n\n## Future work\n- Since the huge size of the para dataset (comparing) to both of the sizes of the sst and sts datasets is leading to overfitting, then an enlargemnt of the sizes of the datasets sst and sts should reduce the possibilty of overfitting.  This could be achieved be generating more (true) data from the datasets sst and sts, which is possible by adding another additional Task, see issue #60 for more information. \n- give other losses different weights. \n- with or without combined losses. \n- maybe based in dev_acc performance in previous epoch.\n- implement SMART for BERT-STS\n- Dropout and weight decay tuning for BERT (AdamW and Sophia)\n- CAPTUM implementation for deeper error analysis\n- low confidence prediction analysis\n- length vs metric score\n\n## Member Contributions\nDawor, Moataz: Generalisations on Custom Attention, Splitted and reordererd batches, analysis_dataset\n\nLübbers, Christopher L.: Part 1 complete; Part 2: sBERT, Tensorboard (metrics + profiler), sBERT-Baseline, SOPHIA, SMART, Optuna, sBERT-Optuna for Optimizer, Optuna for sBERT and BERT-SMART, Optuna for sBERT-regularization, sBERT with combinded losses, sBERT with gradient surgery, README-Experiments for those tasks, README-Methodology, final model, ai usage card\n\nNiegsch, Lukas*: Generalisations on Custom Attention, Splitted and reordererd batches, repository maintenance (merging, lfs, some code refactoring)\n\nSchmidt, Finn Paul: sBert multi_task training, Sophia dropout layers, Sophia seperated paraphrasing training, Sophia weighted loss, Optuna study on the dropout and hyperparameters, BERT baseline adam, BERT additional layers, error_analysis\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcluebbers%2Fnlp_deeplearning_spring2023","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcluebbers%2Fnlp_deeplearning_spring2023","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcluebbers%2Fnlp_deeplearning_spring2023/lists"}