{"id":24822257,"url":"https://github.com/lazerlambda/team09applieddl","last_synced_at":"2025-03-25T21:23:47.807Z","repository":{"id":73797774,"uuid":"495441964","full_name":"LazerLambda/Team09AppliedDL","owner":"LazerLambda","description":"Model Distillation for Unlabeled and Imbalanced Data for Amino-Acid-Strings","archived":false,"fork":false,"pushed_at":"2023-08-15T14:52:45.000Z","size":1751,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-30T18:46:00.956Z","etag":null,"topics":["data-science","deep-learning","distillation","imbalanced-data","lmu","munich","python","statistics","transformer","unlabeled-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LazerLambda.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-23T14:18:57.000Z","updated_at":"2022-10-06T13:57:35.000Z","dependencies_parsed_at":"2025-01-30T18:40:03.089Z","dependency_job_id":"b4eeadd8-1928-4c7a-9e5e-b2c6c2db1a02","html_url":"https://github.com/LazerLambda/Team09AppliedDL","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LazerLambda%2FTeam09AppliedDL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LazerLambda%2FTeam09AppliedDL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LazerLambda%2FTeam09AppliedDL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LazerLambda%2FTeam09AppliedDL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LazerLambda","download_url":"https://codeload.github.com/LazerLambda/Team09AppliedDL/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245544216,"owners_count":20632780,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","deep-learning","distillation","imbalanced-data","lmu","munich","python","statistics","transformer","unlabeled-data"],"created_at":"2025-01-30T18:39:58.518Z","updated_at":"2025-03-25T21:23:47.800Z","avatar_url":"https://github.com/LazerLambda.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Group09\n==============================\n\n# Sequence-level Knowledge Distillation\n\n---\n\n## Introduction\n\nIn this Git repository a new way of knowlegde distillation for data with sequence-level input and binary output on an inbalanced dataset was implemented. This work was inspired by the papers [1] and [2].\n\n## Concept \n\nThe idea behind this particullar distillation approach is for the student model to learn the distribution of the teacher model whilst learning from the original data itself. This is done using an online learning approach (teacher and student are trained simultaneously).\n\n\n## Methodology\n\nIn order for the student model to learn from the teacher model a combined loss for training was used. The first part of the loss contains the original data and the loss for unlabled positive data (implemented in the loss class, the algorithm is discribed in [3]). The second part is a cross-entropy loss on data labled by the teacher model. These two losses are combined in a convex combination with the hyperparameter $\\alpha \\in (0,1)$. \u003cbr\u003e\n\nWith the above discribed loss the student model is trained. The teacher model is trained using the loss for unlabled positive data. \n\n## Algorithm\n\n**Input:**  \u003cbr\u003e\n           training data (subset of original data); \u003cbr\u003e\n           hyperparameters for loss: $\\alpha \\in (0,1)$, $\\beta \\in (0,1)$; \u003cbr\u003e\n           hyperparameters for epochs: meta_epoch, teacher_epoch, student_epoch; \u003cbr\u003e\n           models: student model **S**, teacher model **T** (both untrained) \u003cbr\u003e\n           \n**Output:**  \u003cbr\u003e\n           trained models **S** and **T**\n\n1. **For** each meta_epoch **do**:\n2. \u003e **For** each teacher_epoch **do**: \n3. \u003e\u003e Train teacher model with training data\n3. \u003e **For** each student_epoch **do**:\n4. \u003e\u003e Shuffle data and take a batch for training iteration.\n5. \u003e\u003e Split batch into two disjoint data sets $data_s$ and $data_t$ with $n_{data_s} = \\beta * n_{data}$ and $n_{data_t} = (1-\\beta) * n_{data}$\n6. \u003e\u003e Make predictions with **T** for $data_t$\n7. \u003e\u003e Train **S** with both data sets (use predictions from **T** for $data_t$ and true labels for $data_s$) using a combined loss weighted with $\\alpha$ for $L_t$ and $(1- \\alpha)$ for $L_s$\n8. Save **S** and **T**\n\n\n## Code Structure\n\nIn the folder src one can find the folders distillation, loss, models and visualization. \u003cbr\u003e\nAll losses are implemented in the folder loss. \u003cbr\u003e\nThe training and the distillation algorithm is located in the folder distillation. \u003cbr\u003e\nThe models used for this work can be found in the folder models. \u003cbr\u003e\nThe Config file (hyperparameters.yml) can be found in Config. \u003cbr\u003e\nResults and graphics can be found in the Wiki part of this Github repository.\n\n## Reproduce our results\n\nIn order to reproduce our results, adjust the file run_file_google_colab.py in the folder 'notebooks' with your own data path and git key, and run it on GoogleColab. If you are using another device than GoogleColab, please execute:\n1. \u003e `!python3 -m pip install -r requirements.txt`\n2. \u003e \\# Adjust your path to data (e.g. connect to google drive)\n2. \u003e `os.chdir('./src')`\n3. \u003e `from main import main`\n4. \u003e `\n        main({\n            'config_path' :'/Team09AppliedDL/config/hyperparameters.yml',\n            'data_path' : 'path to data',\n            'wandb' : True})\n     `\n\n## References\n[1] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, 2015. *Distilling the Knowledge in a Neural Network*. https://arxiv.org/abs/1503.02531 \u003cbr\u003e\n[2] Jianping Gou, Baosheng Yu, Stephen J. Maybank, Dacheng Tao, 2021. *Knowledge Distillation: A Survey*. https://arxiv.org/abs/2006.05525v7 \u003cbr\u003e\n[3] Guangxin Su, Weitong Chen, Miao Xu, 2021. *Positive-Unlabeled Learning from Imbalanced Data*. https://www.ijcai.org/proceedings/2021/412 \u003cbr\u003e\n\nProject Organization\n------------\n\n    ├── config \n    │   ├── hyperparameters.yml  \u003c- YML-File for hyperparameters and model specification.\n    │    \n    ├── notebooks          \u003c- Jupyter notebooks.\n    │   ├── run_file_google_colab.ipynb  \u003c- Notebook for running the code on GoogleColab\n    │\n    ├── reports            \u003c- folder with images containing reported final results           \n    │   ├── figures        \u003c- folder with figures of the results with seed 123\n    │   │   ├── auc_student_test.png\n    │   │   ├── auc_student_train.png\n    │   │   ├── auc_teacher_test.png\n    │   │   ├── auc_teacher_test.png\n    │   │\n    │   ├── tables         \u003c- folder with results of seed=123\n    │   │   ├── adl_seed_123.txt\n    │\n    ├── src                \u003c- Source code to use in this project\n    │   │\n    │   ├── data           \u003c- Scripts to preprocess data\n    │   │   ├── Dataset.py        \u003c- Script for data preparation (read in and one hot encoding of the original dataset) + preparation of data for DNABert\n    │   │   ├── make_dataset.py   \u003c- Script for generating random test data\n    │   │\n    │   ├── distillation          \u003c- Script for the distillation class with evaluation and train loop\n    │   │   ├── Distillation.py\n    │   │   ├── Train.py\n    │\n    │   ├── loss  \n    │   │   ├── DistillationLoss.py  \u003c- Script for our distillation loss\n    │   │   ├── ImbalancedLoss.py    \u003c- Script for the imbalanced loss\n    │\n    │   ├── models             \u003c- Scripts for teacher and student models\n    │   │   ├── Students.py    \u003c- transformer and mlp's\n    │   │   ├── Teachers.py    \u003c- mlp's\n    │\n    │   ├── ConfigReader.py    \u003c- Script to read configurations\n    │   ├── Logger.py          \u003c- Script for including mlflow and wandb\n    │   ├── main.py            \u003c- Launch script\n    │\n    ├── tests\n    │   ├── test_ImbalancedLoss.py   \u003c- testing the imbalanced loss for correct parameters, properties, type and shape of outouts and for correct behaviour\n    │   \n    ├── LICENSE\n    ├── Makefile               \u003c- Makefile with commands like `make data` or `make train`\n    ├── README.md              \u003c- The top-level README for developers using this project.\n    ├── requirements.txt       \u003c- The requirements file for reproducing the analysis environment and installing all required packages\n    ├── setup.py               \u003c- makes project pip installable (pip install -e .) so src can be imported\n    ├── test_environment.py    \u003c- Test for correct python version\n    ├── tox.ini                \u003c- Tox file. Run for tests and linting\n   \n \n\n--------\n\n\u003cp\u003e\u003csmall\u003eProject based on the \u003ca target=\"_blank\" href=\"https://drivendata.github.io/cookiecutter-data-science/\"\u003ecookiecutter data science project template\u003c/a\u003e. #cookiecutterdatascience\u003c/small\u003e\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flazerlambda%2Fteam09applieddl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flazerlambda%2Fteam09applieddl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flazerlambda%2Fteam09applieddl/lists"}