An open API service indexing awesome lists of open source software.

https://github.com/naidezhujimo/yinghub-v2-a-sparse-moe-language-model

YingHub-v2 is an advanced language model built upon the Sparse Mixture of Experts (MoE) architecture. It leverages dynamic routing mechanisms, expert load balancing.incorporating state-of-the-art training and optimization strategies.
https://github.com/naidezhujimo/yinghub-v2-a-sparse-moe-language-model

llm moe nlp pytorch

Last synced: about 2 months ago
JSON representation

YingHub-v2 is an advanced language model built upon the Sparse Mixture of Experts (MoE) architecture. It leverages dynamic routing mechanisms, expert load balancing.incorporating state-of-the-art training and optimization strategies.

Awesome Lists containing this project

README

        

# YingHub-v2: A Sparse MoE Language Model

## Introduction

YingHub-v2 is an advanced language model built upon the Sparse Mixture of Experts (MoE) architecture. It leverages dynamic routing mechanisms, expert load balancing, and reinforcement learning techniques to achieve high performance in text generation tasks. This model is designed to be both efficient and effective, incorporating state-of-the-art training and optimization strategies.

## Model Architecture

### Mixture of Experts (MoE)

YingHub-v2 employs a sparse MoE structure, integrating a multi-head attention layer and an MoE layer within each Transformer block. The key features include:

- **Dynamic Routing Mechanism**: Utilizes a `NoisyTopkRouter` to dynamically select the top-k experts. This mechanism supports dynamic adjustment of the `top_k` value, allowing the model to adaptively choose the number of experts based on the input.
- **Expert Networks**: The model includes both simple (shallow FFN) and complex (deep FFN) experts. The experts are designed to handle different levels of complexity in the input data. Additionally, the model incorporates load balancing constraints and sliding window statistics to ensure efficient resource utilization.
- **Expert Load Balancing**: Implements a sliding window approach to calculate expert utilization combined with an expert load balancing loss (KL divergence). This ensures that the workload is evenly distributed among the experts, preventing any single expert from being overloaded.

### Transformer Core

The core of YingHub-v2 is based on the Transformer architecture, featuring:

- **Multi-Head Self-Attention**: Supports causal masking and Dropout to prevent information leakage and overfitting.
- **Layer Normalization**: Applied after each sub-layer to stabilize training.
- **Residual Connections**: Enhances the flow of gradients and information through the network.
- **Stochastic Depth**: Randomly drops entire layers during training to improve generalization.

## Training Optimization

### PPO Reinforcement Learning Fine-Tuning

YingHub-v2 incorporates Proximal Policy Optimization (PPO) for fine-tuning, leveraging:

- **Curriculum Learning**: Dynamically adjusts the length of text generated by the model during fine-tuning.
- **Loss Functions**: Combines clipped surrogate loss, value loss, entropy regularization, and KL divergence penalty to optimize the model's performance.
- **Generalized Advantage Estimation (GAE)**: Provides a more accurate estimation of the advantage function, leading to more stable and efficient training.
- **Dynamic KL Coefficient Adjustment**: Adjusts the KL coefficient dynamically to balance exploration and exploitation during training.

### Dynamic Adjustment Strategies

- **Learning Rate Scheduling**: Implements cosine annealing with warm restarts, supporting stage-wise gradient clipping.
- **Early Stopping**: Monitors the validation loss and stops training early if the model's performance does not improve, preventing overfitting.
- **Expert Utilization-Driven `top_k` Adjustment**: Dynamically adjusts the `top_k` value based on expert utilization to optimize resource allocation.
- **Gradient Clipping**: Applies different clipping norms to different modules (e.g., stricter clipping for routing layers) to prevent gradient explosion.

### Mixed Precision Training

Utilizes `GradScaler` and `autocast` to accelerate training while maintaining numerical stability.

## Text Generation

### Diverse Sampling Strategies

- **Top-p (Nucleus) Sampling**: Selects the next token from the top-p most probable tokens, ensuring diversity in generated text.
- **Temperature Annealing**: Applies cosine annealing to the temperature parameter, controlling the randomness of the sampling process.
- **Diversity Reward**: Implements a penalty for repeated tokens based on the history of generated tokens, promoting diversity in the output.

### Style Reward Function

The style reward function evaluates the generated text based on:

- **Grammar Score (Perplexity)**: Measures the grammatical coherence of the text.
- **Rhyme Pattern Detection (ABAB/AABB)**: Checks for specific rhyme patterns in the generated text.
- **Iambic Pentameter Detection**: Ensures the text adheres to the iambic pentameter structure.
- **Keyword Matching**: Rewards the presence of specific keywords related to the desired style (e.g., Shakespearean vocabulary).
- **Modern Vocabulary Penalty**: Penalizes the use of modern vocabulary that does not fit the desired style.
- **Sentence Length Reward**: Encourages the generation of sentences with an optimal length.

## Infrastructure and Tools

### MLflow Integration

YingHub-v2 integrates with MLflow to:

- Record hyperparameters, training and validation losses, gradient statistics, and expert heatmaps.
- Provide a comprehensive overview of the training process and model performance.

### Pre-trained Model Integration

Leverages pre-trained models from Hugging Face (e.g., GPT-2) for initialization and coherence scoring, enhancing the model's performance and stability.

### Data Processing

- Utilizes `tiktoken` (GPT-2 tokenizer) for efficient tokenization.
- Implements dynamic token masking by randomly replacing tokens with a special token, adding noise to the input data and improving the model's robustness.

## Other Key Technologies

### Network Parameter Freezing

During fine-tuning, freezes the parameters of the pre-trained network to reduce computational resource consumption, prevent overfitting, and enhance model stability.

### Efficient Parallel Computation

- Parallelizes the computation of expert outputs using `torch.stack` and weighted summation, improving computational efficiency.
- Supports efficient parallel processing on modern hardware.

### Model Checkpointing and Visualization

- Saves and loads model checkpoints to facilitate training resumption and model deployment.
- Generates heatmaps using Seaborn and loss curves using Matplotlib to visualize the training process and model performance.

## Conclusion

YingHub-v2 is a cutting-edge language model that combines the strengths of sparse MoE architecture, dynamic routing, expert load balancing, and advanced training techniques. It is designed to deliver high-quality text generation while optimizing resource utilization and training efficiency. Whether you are a researcher or a developer, YingHub-v2 offers a powerful tool for your natural language processing tasks.

## Getting Started

To get started with YingHub-v2, follow these steps:

1. Clone the repository:
```bash
git clone https://github.com/yourusername/YingHub-v2.git
cd YingHub-v2
```

2. Install the required dependencies:
```bash
pip install -r requirements.txt
```

3. Train the model:
```bash
python MoE.py --train
```

4. Generate text using the trained model:
```bash
python MoE.py --generate
```

5. Fine-tune the model using PPO:
```bash
python MoE.py --rlhf
```

6. Generate text using the fine-tuned model:
```bash
python MoE.py --ftgenerate
```

For more detailed instructions and information, please refer to the [documentation](https://github.com/yourusername/YingHub-v2/docs).

## Contributing

We welcome contributions from the community! If you have any ideas, suggestions, or bug reports, please feel free to open an issue or submit a pull request. Together, we can make YingHub-v2 even better!