An open API service indexing awesome lists of open source software.

https://github.com/igopalakrishna/dyt-nonorm-llms-rewild

Replacing LayerNorm with Dynamic Tanh (DyT) in DistilGPT2 + LoRA, evaluated on RE-WILD, Alpaca, and ShareGPT.
https://github.com/igopalakrishna/dyt-nonorm-llms-rewild

Last synced: 12 months ago
JSON representation

Replacing LayerNorm with Dynamic Tanh (DyT) in DistilGPT2 + LoRA, evaluated on RE-WILD, Alpaca, and ShareGPT.

Awesome Lists containing this project

README

          

# Fine-Tuning LLMs Without Normalization Layers: A DyT-Based Approach Using RE-WILD

This repository contains the codebase, results, and plots for our final project in **ECE-GY 9143: High-Performance Machine Learning (HPML)** at NYU.

**Team:**

* Richard Zhong ([rhz2020@nyu.edu](mailto:rhz2020@nyu.edu))
* Gopala Krishna Abba ([ga2664@nyu.edu](mailto:ga2664@nyu.edu))

---

## Problem Overview

Post-training large LLMs is computationally expensive, and normalization layers like LayerNorm add complexity to training and inference. We investigate whether these layers can be replaced with a simpler alternative โ€” **Dynamic Tanh (DyT)** โ€” while maintaining performance.

---
## Motivation
- **Challenge**: Fine-tuning large LLMs is expensive and normalization layers like LayerNorm add architectural and runtime complexity.
- **Goal**: Explore whether **DyT (Dynamic Tanh)** can replace LayerNorm and still allow effective post-training.
- **Setup**: DistilGPT2 + PEFT (LoRA), trained across Alpaca, ShareGPT, and RE-WILD datasets.

---
## Key Contributions

* Replaced all `LayerNorm` layers in DistilGPT2 and Pythia with a learnable **Dynamic Tanh (DyT)** activation: `DyT(x) = tanh(\alpha x)`
* Integrated **LoRA (Low-Rank Adaptation)** via HuggingFace PEFT to enable parameter-efficient fine-tuning
* Explored:

* Fully frozen DyT
* **Selective unfreezing** of DyT layers
* **Full supervised fine-tuning (SFT)**
* Fine-tuned and evaluated across **Alpaca**, **ShareGPT**, and **RE-WILD** datasets

---

## Experimental Setup

**Models:**

* DistilGPT2 (80M)
* Pythia 410M (limited due to memory)

**Frameworks:**

* HuggingFace Transformers
* PEFT (LoRA)
* Colab Pro and NYU HPC (A100)

**Datasets:**

* Alpaca: Small-scale instruction tuning (\~52k)
* ShareGPT: Medium-scale real dialogue (\~90k)
* RE-WILD: Open-ended QA (\~35k used due to constraints)

**Logged:**

* Training and validation loss per 500 steps
* Prompt response outputs
* Inference time (Vanilla vs DyT)

---

## Key Results

| Dataset | DyT Val Loss | Vanilla Val Loss | Loss Gap |
| -------- | ------------ | ---------------- | -------- |
| Alpaca | \~8.3 | \~1.5 | ๐Ÿ”บ6.8 |
| ShareGPT | \~8.3 | \~2.3 | ๐Ÿ”บ6.0 |
| RE-WILD | \~8.3 | \~0.9 | ๐Ÿ”บ7.4 |

* **Inference Time**: DyT = 77.05s, Vanilla = 77.46s โ†’ \~0.5% speedup
* **Prompt Quality**: DyT generates literal, unstructured completions; vanilla preserves instruction-following and formatting better
---

## Repository Structure
```bash
โ”œโ”€โ”€ data_utils/ # Dataset preprocessing, e.g. ShareGPT JSON
โ”œโ”€โ”€ notebooks/ # Training notebooks for all setups
โ”œโ”€โ”€ scripts/ # Executable training scripts (.py)
โ”œโ”€โ”€ results/ # Saved checkpoints
โ”œโ”€โ”€ plots/ # Visualizations and graphs
โ”œโ”€โ”€ report/Presentation.pdf # Final submitted report
โ””โ”€โ”€ README.md # You're here
```

---
## Workflow
![Workflow Diagram](plots/workflow.png)

---
## Experimental Results

### 1. RE-WILD (Selective DyT Unfreezing)

![RE-WILD](plots/DistilGPT2%20%2B%20LoRA%20on%20RE-WILD%20DyT%20(Selective%20Unfreeze)%20vs%20Vanilla.png)

> DyT with selective unfreezing showed stagnated validation loss (~8.3), while vanilla continued to converge. Suggests DyT struggles under LoRA on high-entropy datasets.

---

### 2. ShareGPT

![ShareGPT](plots/DistilGPT2%20Fine-Tuning%20on%20ShareGPT%20DyT%20vs%20Vanilla.png)

> DyT (blue/orange) converges slower, with higher loss than vanilla. Simulated vanilla training reaches ~2.0 loss with stable gradients, demonstrating the benefits of LayerNorm.

---

### 3. Alpaca

![Alpaca](plots/Loss%20Comparison%20%20DyT%20vs%20Vanilla%20DistilGPT2.png)

> On a smaller instruction corpus, DyT retains basic convergence but exhibits noisy gradients and wider generalization gap compared to vanilla.

---

### 4. MT-Bench Inference Comparison

![Inference Time](plots/Inference%20times.png)

> DyT showed **0.5% faster inference** but drastically reduced preference on MT-bench judged outputs.

---

### 5. Pythia 410M: Train Loss

![Pythia Loss](plots/train%20loss.png)

> Larger models benefit more from DyT. Loss offset between DyT and vanilla reduces with model scale.

---

### 6. Gradient Norm (Pythia)

![Gradient Norm](plots/train_grad_norm.png)

> DyT introduces smoother gradients compared to noisy LayerNorm-free baselines, but requires tighter ฮฑ tuning.

---

### 7. Token Accuracy

![Token Accuracy](plots/trainmean_token_accuracy.png)

> Vanilla maintains higher accuracy over training, but DyT still improves token-level predictions, especially in larger models.
---

## Repository Structure

```
DyT-NoNorm-LLMs-REWILD/
โ”œโ”€โ”€ notebooks/ # Jupyter notebooks for each experiment
โ”œโ”€โ”€ scripts/ # Training scripts (vanilla, DyT, selective unfreeze)
โ”œโ”€โ”€ data_utils/ # Tokenizer, formatting, and dataset cleaning
โ”œโ”€โ”€ results/ # Raw loss logs and saved metrics
โ”œโ”€โ”€ plots/ # All graphs used in our report & slides
โ”œโ”€โ”€ report/ # Presentation slides (HPML_Presentation.pdf)
โ””โ”€โ”€ README.md
```

---

### How to Run This Project

#### Step 1: Install Requirements

Install the necessary Python packages:

```bash
pip install -r requirements.txt
```

#### Step 2: Run the Notebooks

Navigate to the `notebooks/` folder and run the following Jupyter notebooks in the recommended order:

1. **Benchmarks.ipynb**
โคท Overview and comparison plots between DyT and LayerNorm across datasets

2. **modReWILDcreate.ipynb**
โคท Prepares and reformats RE-WILD dataset from HuggingFace JSON

3. **pythia17m.ipynb**
โคท Fine-tuning DyT-modified Pythia-17M model

4. **pythia410m.ipynb**
โคท Fine-tuning DyT-modified Pythia-410M model

5. **train\_alpaca\_distillgpt2.ipynb**
โคท Fine-tunes DyT-based DistilGPT2 on the Alpaca dataset

6. **train\_alpaca\_distillgpt2\_vanilla.ipynb**
โคท Fine-tunes baseline DistilGPT2 (LayerNorm) on Alpaca

7. **train\_sharegpt.ipynb**
โคท Trains DyT vs. vanilla on ShareGPT conversational data

8. **train\_selective\_unfreeze\_rewild.ipynb**
โคท Selective unfreezing DyT fine-tuning on RE-WILD

Each notebook includes inline comments and cell outputs for reproducibility.
If you're running on Colab or an HPC, ensure appropriate runtime (A100 recommended).

For best results, execute all training notebooks sequentially and compare metrics in `Benchmarks.ipynb`.

These notebooks can be run using JupyterLab, VS Code, or Google Colab.
---

## Dependencies
- `transformers`
- `datasets`
- `peft`
- `torch`
- `scipy`, `matplotlib`, `numpy`
---

## Observations

* DyT struggles to generalize without normalization layers, especially on larger, diverse corpora like RE-WILD
* Selective unfreezing helps, but performance gap remains significant
* Vanilla DistilGPT2 shows clean convergence; DyT plateaus at high loss
* Full SFT improves DyT, but undermines PEFT advantages

---

## Slides & Report

* [HPML Final Slides (PDF)](./report/Presentation.pdf)

---

## Future Work

* Try DyT with **LLaMA 3.2B** using larger batch sizes
* Evaluate DyT with alternative norm-replacement functions
* Integrate DyT into **quantized** or **sparsely activated** LLMs

---
## Acknowledgements
- HuggingFace Transformers & Datasets
- Colab Pro for GPU access
- HPML course instructors for project guidance

---

## License
This project is part of academic coursework at NYU and released for research and educational use only.

---

## Contact

For questions or collaborations, reach out to:

* Richard Zhong: [rhz2020@nyu.edu](mailto:rhz2020@nyu.edu)
* Gopala Krishna Abba: [ga2664@nyu.edu](mailto:ga2664@nyu.edu)