{"id":28156115,"url":"https://github.com/igopalakrishna/dyt-nonorm-llms-rewild","last_synced_at":"2025-07-03T06:04:59.766Z","repository":{"id":292467595,"uuid":"981001395","full_name":"igopalakrishna/DyT-NoNorm-LLMs-REWILD","owner":"igopalakrishna","description":"Replacing LayerNorm with Dynamic Tanh (DyT) in DistilGPT2 + LoRA, evaluated on RE-WILD, Alpaca, and ShareGPT.","archived":false,"fork":false,"pushed_at":"2025-05-10T06:26:06.000Z","size":0,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-10T06:28:59.673Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/igopalakrishna.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-10T05:48:32.000Z","updated_at":"2025-05-10T06:26:09.000Z","dependencies_parsed_at":"2025-05-10T06:39:07.202Z","dependency_job_id":null,"html_url":"https://github.com/igopalakrishna/DyT-NoNorm-LLMs-REWILD","commit_stats":null,"previous_names":["igopalakrishna/dyt-nonorm-llms-rewild"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/igopalakrishna/DyT-NoNorm-LLMs-REWILD","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/igopalakrishna%2FDyT-NoNorm-LLMs-REWILD","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/igopalakrishna%2FDyT-NoNorm-LLMs-REWILD/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/igopalakrishna%2FDyT-NoNorm-LLMs-REWILD/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/igopalakrishna%2FDyT-NoNorm-LLMs-REWILD/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/igopalakrishna","download_url":"https://codeload.github.com/igopalakrishna/DyT-NoNorm-LLMs-REWILD/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/igopalakrishna%2FDyT-NoNorm-LLMs-REWILD/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263271502,"owners_count":23440396,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-15T07:15:35.447Z","updated_at":"2025-07-03T06:04:59.743Z","avatar_url":"https://github.com/igopalakrishna.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Fine-Tuning LLMs Without Normalization Layers: A DyT-Based Approach Using RE-WILD\n\nThis repository contains the codebase, results, and plots for our final project in **ECE-GY 9143: High-Performance Machine Learning (HPML)** at NYU.\n\n**Team:**\n\n* Richard Zhong ([rhz2020@nyu.edu](mailto:rhz2020@nyu.edu))\n* Gopala Krishna Abba ([ga2664@nyu.edu](mailto:ga2664@nyu.edu))\n\n---\n\n## Problem Overview\n\nPost-training large LLMs is computationally expensive, and normalization layers like LayerNorm add complexity to training and inference. We investigate whether these layers can be replaced with a simpler alternative — **Dynamic Tanh (DyT)** — while maintaining performance.\n\n---\n## Motivation\n- **Challenge**: Fine-tuning large LLMs is expensive and normalization layers like LayerNorm add architectural and runtime complexity.\n- **Goal**: Explore whether **DyT (Dynamic Tanh)** can replace LayerNorm and still allow effective post-training.\n- **Setup**: DistilGPT2 + PEFT (LoRA), trained across Alpaca, ShareGPT, and RE-WILD datasets.\n\n---\n## Key Contributions\n\n* Replaced all `LayerNorm` layers in DistilGPT2 and Pythia with a learnable **Dynamic Tanh (DyT)** activation: `DyT(x) = tanh(\\alpha x)`\n* Integrated **LoRA (Low-Rank Adaptation)** via HuggingFace PEFT to enable parameter-efficient fine-tuning\n* Explored:\n\n  * Fully frozen DyT\n  * **Selective unfreezing** of DyT layers\n  * **Full supervised fine-tuning (SFT)**\n* Fine-tuned and evaluated across **Alpaca**, **ShareGPT**, and **RE-WILD** datasets\n\n---\n\n## Experimental Setup\n\n**Models:**\n\n* DistilGPT2 (80M)\n* Pythia 410M (limited due to memory)\n\n**Frameworks:**\n\n* HuggingFace Transformers\n* PEFT (LoRA)\n* Colab Pro and NYU HPC (A100)\n\n**Datasets:**\n\n* Alpaca: Small-scale instruction tuning (\\~52k)\n* ShareGPT: Medium-scale real dialogue (\\~90k)\n* RE-WILD: Open-ended QA (\\~35k used due to constraints)\n\n**Logged:**\n\n* Training and validation loss per 500 steps\n* Prompt response outputs\n* Inference time (Vanilla vs DyT)\n\n---\n\n## Key Results\n\n| Dataset  | DyT Val Loss | Vanilla Val Loss | Loss Gap |\n| -------- | ------------ | ---------------- | -------- |\n| Alpaca   | \\~8.3        | \\~1.5            | 🔺6.8    |\n| ShareGPT | \\~8.3        | \\~2.3            | 🔺6.0    |\n| RE-WILD  | \\~8.3        | \\~0.9            | 🔺7.4    |\n\n* **Inference Time**: DyT = 77.05s, Vanilla = 77.46s → \\~0.5% speedup\n* **Prompt Quality**: DyT generates literal, unstructured completions; vanilla preserves instruction-following and formatting better\n---\n\n## Repository Structure\n```bash\n├── data_utils/                # Dataset preprocessing, e.g. ShareGPT JSON\n├── notebooks/                # Training notebooks for all setups\n├── scripts/                  # Executable training scripts (.py)\n├── results/                  # Saved checkpoints\n├── plots/                    # Visualizations and graphs\n├── report/Presentation.pdf   # Final submitted report\n└── README.md                 # You're here\n```\n\n---\n##  Workflow\n![Workflow Diagram](plots/workflow.png)\n\n---\n## Experimental Results\n\n### 1. RE-WILD (Selective DyT Unfreezing)\n\n![RE-WILD](plots/DistilGPT2%20%2B%20LoRA%20on%20RE-WILD%20DyT%20(Selective%20Unfreeze)%20vs%20Vanilla.png)\n\n\u003e DyT with selective unfreezing showed stagnated validation loss (~8.3), while vanilla continued to converge. Suggests DyT struggles under LoRA on high-entropy datasets.\n\n---\n\n### 2. ShareGPT\n\n![ShareGPT](plots/DistilGPT2%20Fine-Tuning%20on%20ShareGPT%20DyT%20vs%20Vanilla.png)\n\n\u003e DyT (blue/orange) converges slower, with higher loss than vanilla. Simulated vanilla training reaches ~2.0 loss with stable gradients, demonstrating the benefits of LayerNorm.\n\n---\n\n### 3. Alpaca\n\n![Alpaca](plots/Loss%20Comparison%20%20DyT%20vs%20Vanilla%20DistilGPT2.png)\n\n\u003e On a smaller instruction corpus, DyT retains basic convergence but exhibits noisy gradients and wider generalization gap compared to vanilla.\n\n---\n\n### 4. MT-Bench Inference Comparison\n\n![Inference Time](plots/Inference%20times.png)\n\n\u003e DyT showed **0.5% faster inference** but drastically reduced preference on MT-bench judged outputs.\n\n---\n\n### 5. Pythia 410M: Train Loss\n\n![Pythia Loss](plots/train%20loss.png)\n\n\u003e Larger models benefit more from DyT. Loss offset between DyT and vanilla reduces with model scale.\n\n---\n\n### 6. Gradient Norm (Pythia)\n\n![Gradient Norm](plots/train_grad_norm.png)\n\n\u003e DyT introduces smoother gradients compared to noisy LayerNorm-free baselines, but requires tighter α tuning.\n\n---\n\n### 7. Token Accuracy\n\n![Token Accuracy](plots/trainmean_token_accuracy.png)\n\n\u003e Vanilla maintains higher accuracy over training, but DyT still improves token-level predictions, especially in larger models.\n---\n\n## Repository Structure\n\n```\nDyT-NoNorm-LLMs-REWILD/\n├── notebooks/               # Jupyter notebooks for each experiment\n├── scripts/                 # Training scripts (vanilla, DyT, selective unfreeze)\n├── data_utils/              # Tokenizer, formatting, and dataset cleaning\n├── results/                 # Raw loss logs and saved metrics\n├── plots/                   # All graphs used in our report \u0026 slides\n├── report/                  # Presentation slides (HPML_Presentation.pdf)\n└── README.md\n```\n\n---\n\n### How to Run This Project\n\n#### Step 1: Install Requirements\n\nInstall the necessary Python packages:\n\n```bash\npip install -r requirements.txt\n```\n\n#### Step 2: Run the Notebooks \n\nNavigate to the `notebooks/` folder and run the following Jupyter notebooks in the recommended order:\n\n1. **Benchmarks.ipynb**\n   ⤷ Overview and comparison plots between DyT and LayerNorm across datasets\n\n2. **modReWILDcreate.ipynb**\n   ⤷ Prepares and reformats RE-WILD dataset from HuggingFace JSON\n\n3. **pythia17m.ipynb**\n   ⤷ Fine-tuning DyT-modified Pythia-17M model\n\n4. **pythia410m.ipynb**\n   ⤷ Fine-tuning DyT-modified Pythia-410M model\n\n5. **train\\_alpaca\\_distillgpt2.ipynb**\n   ⤷ Fine-tunes DyT-based DistilGPT2 on the Alpaca dataset\n\n6. **train\\_alpaca\\_distillgpt2\\_vanilla.ipynb**\n   ⤷ Fine-tunes baseline DistilGPT2 (LayerNorm) on Alpaca\n\n7. **train\\_sharegpt.ipynb**\n   ⤷ Trains DyT vs. vanilla on ShareGPT conversational data\n\n8. **train\\_selective\\_unfreeze\\_rewild.ipynb**\n   ⤷ Selective unfreezing DyT fine-tuning on RE-WILD\n\n\nEach notebook includes inline comments and cell outputs for reproducibility.\nIf you're running on Colab or an HPC, ensure appropriate runtime (A100 recommended).\n\nFor best results, execute all training notebooks sequentially and compare metrics in `Benchmarks.ipynb`.\n\nThese notebooks can be run using JupyterLab, VS Code, or Google Colab.\n---\n\n## Dependencies\n- `transformers`\n- `datasets`\n- `peft`\n- `torch`\n- `scipy`, `matplotlib`, `numpy`\n---\n\n## Observations\n\n* DyT struggles to generalize without normalization layers, especially on larger, diverse corpora like RE-WILD\n* Selective unfreezing helps, but performance gap remains significant\n* Vanilla DistilGPT2 shows clean convergence; DyT plateaus at high loss\n* Full SFT improves DyT, but undermines PEFT advantages\n\n---\n\n## Slides \u0026 Report\n\n* [HPML Final Slides (PDF)](./report/Presentation.pdf)\n\n---\n\n## Future Work\n\n* Try DyT with **LLaMA 3.2B** using larger batch sizes\n* Evaluate DyT with alternative norm-replacement functions\n* Integrate DyT into **quantized** or **sparsely activated** LLMs\n\n---\n##  Acknowledgements\n- HuggingFace Transformers \u0026 Datasets\n- Colab Pro for GPU access\n- HPML course instructors for project guidance\n\n---\n\n## License\nThis project is part of academic coursework at NYU and released for research and educational use only.\n\n---\n\n## Contact\n\nFor questions or collaborations, reach out to:\n\n* Richard Zhong: [rhz2020@nyu.edu](mailto:rhz2020@nyu.edu)\n* Gopala Krishna Abba: [ga2664@nyu.edu](mailto:ga2664@nyu.edu)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Figopalakrishna%2Fdyt-nonorm-llms-rewild","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Figopalakrishna%2Fdyt-nonorm-llms-rewild","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Figopalakrishna%2Fdyt-nonorm-llms-rewild/lists"}