https://github.com/igopalakrishna/dyt-nonorm-llms-rewild
Replacing LayerNorm with Dynamic Tanh (DyT) in DistilGPT2 + LoRA, evaluated on RE-WILD, Alpaca, and ShareGPT.
https://github.com/igopalakrishna/dyt-nonorm-llms-rewild
Last synced: 12 months ago
JSON representation
Replacing LayerNorm with Dynamic Tanh (DyT) in DistilGPT2 + LoRA, evaluated on RE-WILD, Alpaca, and ShareGPT.
- Host: GitHub
- URL: https://github.com/igopalakrishna/dyt-nonorm-llms-rewild
- Owner: igopalakrishna
- License: mit
- Created: 2025-05-10T05:48:32.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-10T06:26:06.000Z (about 1 year ago)
- Last Synced: 2025-05-10T06:28:59.673Z (about 1 year ago)
- Language: Jupyter Notebook
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Fine-Tuning LLMs Without Normalization Layers: A DyT-Based Approach Using RE-WILD
This repository contains the codebase, results, and plots for our final project in **ECE-GY 9143: High-Performance Machine Learning (HPML)** at NYU.
**Team:**
* Richard Zhong ([rhz2020@nyu.edu](mailto:rhz2020@nyu.edu))
* Gopala Krishna Abba ([ga2664@nyu.edu](mailto:ga2664@nyu.edu))
---
## Problem Overview
Post-training large LLMs is computationally expensive, and normalization layers like LayerNorm add complexity to training and inference. We investigate whether these layers can be replaced with a simpler alternative โ **Dynamic Tanh (DyT)** โ while maintaining performance.
---
## Motivation
- **Challenge**: Fine-tuning large LLMs is expensive and normalization layers like LayerNorm add architectural and runtime complexity.
- **Goal**: Explore whether **DyT (Dynamic Tanh)** can replace LayerNorm and still allow effective post-training.
- **Setup**: DistilGPT2 + PEFT (LoRA), trained across Alpaca, ShareGPT, and RE-WILD datasets.
---
## Key Contributions
* Replaced all `LayerNorm` layers in DistilGPT2 and Pythia with a learnable **Dynamic Tanh (DyT)** activation: `DyT(x) = tanh(\alpha x)`
* Integrated **LoRA (Low-Rank Adaptation)** via HuggingFace PEFT to enable parameter-efficient fine-tuning
* Explored:
* Fully frozen DyT
* **Selective unfreezing** of DyT layers
* **Full supervised fine-tuning (SFT)**
* Fine-tuned and evaluated across **Alpaca**, **ShareGPT**, and **RE-WILD** datasets
---
## Experimental Setup
**Models:**
* DistilGPT2 (80M)
* Pythia 410M (limited due to memory)
**Frameworks:**
* HuggingFace Transformers
* PEFT (LoRA)
* Colab Pro and NYU HPC (A100)
**Datasets:**
* Alpaca: Small-scale instruction tuning (\~52k)
* ShareGPT: Medium-scale real dialogue (\~90k)
* RE-WILD: Open-ended QA (\~35k used due to constraints)
**Logged:**
* Training and validation loss per 500 steps
* Prompt response outputs
* Inference time (Vanilla vs DyT)
---
## Key Results
| Dataset | DyT Val Loss | Vanilla Val Loss | Loss Gap |
| -------- | ------------ | ---------------- | -------- |
| Alpaca | \~8.3 | \~1.5 | ๐บ6.8 |
| ShareGPT | \~8.3 | \~2.3 | ๐บ6.0 |
| RE-WILD | \~8.3 | \~0.9 | ๐บ7.4 |
* **Inference Time**: DyT = 77.05s, Vanilla = 77.46s โ \~0.5% speedup
* **Prompt Quality**: DyT generates literal, unstructured completions; vanilla preserves instruction-following and formatting better
---
## Repository Structure
```bash
โโโ data_utils/ # Dataset preprocessing, e.g. ShareGPT JSON
โโโ notebooks/ # Training notebooks for all setups
โโโ scripts/ # Executable training scripts (.py)
โโโ results/ # Saved checkpoints
โโโ plots/ # Visualizations and graphs
โโโ report/Presentation.pdf # Final submitted report
โโโ README.md # You're here
```
---
## Workflow

---
## Experimental Results
### 1. RE-WILD (Selective DyT Unfreezing)
%20vs%20Vanilla.png)
> DyT with selective unfreezing showed stagnated validation loss (~8.3), while vanilla continued to converge. Suggests DyT struggles under LoRA on high-entropy datasets.
---
### 2. ShareGPT

> DyT (blue/orange) converges slower, with higher loss than vanilla. Simulated vanilla training reaches ~2.0 loss with stable gradients, demonstrating the benefits of LayerNorm.
---
### 3. Alpaca

> On a smaller instruction corpus, DyT retains basic convergence but exhibits noisy gradients and wider generalization gap compared to vanilla.
---
### 4. MT-Bench Inference Comparison

> DyT showed **0.5% faster inference** but drastically reduced preference on MT-bench judged outputs.
---
### 5. Pythia 410M: Train Loss

> Larger models benefit more from DyT. Loss offset between DyT and vanilla reduces with model scale.
---
### 6. Gradient Norm (Pythia)

> DyT introduces smoother gradients compared to noisy LayerNorm-free baselines, but requires tighter ฮฑ tuning.
---
### 7. Token Accuracy

> Vanilla maintains higher accuracy over training, but DyT still improves token-level predictions, especially in larger models.
---
## Repository Structure
```
DyT-NoNorm-LLMs-REWILD/
โโโ notebooks/ # Jupyter notebooks for each experiment
โโโ scripts/ # Training scripts (vanilla, DyT, selective unfreeze)
โโโ data_utils/ # Tokenizer, formatting, and dataset cleaning
โโโ results/ # Raw loss logs and saved metrics
โโโ plots/ # All graphs used in our report & slides
โโโ report/ # Presentation slides (HPML_Presentation.pdf)
โโโ README.md
```
---
### How to Run This Project
#### Step 1: Install Requirements
Install the necessary Python packages:
```bash
pip install -r requirements.txt
```
#### Step 2: Run the Notebooks
Navigate to the `notebooks/` folder and run the following Jupyter notebooks in the recommended order:
1. **Benchmarks.ipynb**
โคท Overview and comparison plots between DyT and LayerNorm across datasets
2. **modReWILDcreate.ipynb**
โคท Prepares and reformats RE-WILD dataset from HuggingFace JSON
3. **pythia17m.ipynb**
โคท Fine-tuning DyT-modified Pythia-17M model
4. **pythia410m.ipynb**
โคท Fine-tuning DyT-modified Pythia-410M model
5. **train\_alpaca\_distillgpt2.ipynb**
โคท Fine-tunes DyT-based DistilGPT2 on the Alpaca dataset
6. **train\_alpaca\_distillgpt2\_vanilla.ipynb**
โคท Fine-tunes baseline DistilGPT2 (LayerNorm) on Alpaca
7. **train\_sharegpt.ipynb**
โคท Trains DyT vs. vanilla on ShareGPT conversational data
8. **train\_selective\_unfreeze\_rewild.ipynb**
โคท Selective unfreezing DyT fine-tuning on RE-WILD
Each notebook includes inline comments and cell outputs for reproducibility.
If you're running on Colab or an HPC, ensure appropriate runtime (A100 recommended).
For best results, execute all training notebooks sequentially and compare metrics in `Benchmarks.ipynb`.
These notebooks can be run using JupyterLab, VS Code, or Google Colab.
---
## Dependencies
- `transformers`
- `datasets`
- `peft`
- `torch`
- `scipy`, `matplotlib`, `numpy`
---
## Observations
* DyT struggles to generalize without normalization layers, especially on larger, diverse corpora like RE-WILD
* Selective unfreezing helps, but performance gap remains significant
* Vanilla DistilGPT2 shows clean convergence; DyT plateaus at high loss
* Full SFT improves DyT, but undermines PEFT advantages
---
## Slides & Report
* [HPML Final Slides (PDF)](./report/Presentation.pdf)
---
## Future Work
* Try DyT with **LLaMA 3.2B** using larger batch sizes
* Evaluate DyT with alternative norm-replacement functions
* Integrate DyT into **quantized** or **sparsely activated** LLMs
---
## Acknowledgements
- HuggingFace Transformers & Datasets
- Colab Pro for GPU access
- HPML course instructors for project guidance
---
## License
This project is part of academic coursework at NYU and released for research and educational use only.
---
## Contact
For questions or collaborations, reach out to:
* Richard Zhong: [rhz2020@nyu.edu](mailto:rhz2020@nyu.edu)
* Gopala Krishna Abba: [ga2664@nyu.edu](mailto:ga2664@nyu.edu)