An open API service indexing awesome lists of open source software.

https://github.com/naidezhujimo/yinghub-v3


https://github.com/naidezhujimo/yinghub-v3

Last synced: about 2 months ago
JSON representation

Awesome Lists containing this project

README

        

# YingHub-v3

**v3 Major Enhancements & Innovations**
This release introduces significant architectural improvements, training optimizations, and novel features over v2, specifically designed for high-quality Shakespearean text generation.

---

## 🚀 Key Advancements (v3 vs v2)

### 1. **Data Loading & Preprocessing Optimizations**
- **Sliding Window Pre-computation**
Implemented memory-efficient `unfold` + circular buffer strategies to handle variable-length sequences

- **Dynamic Mask Augmentation**
10% random token masking with `` during batch generation improves robustness

- **Streaming Dataset Iterator**
Memory-mapped data loading with zero-copy tensor conversion (4x faster than v2's disk I/O)

### 2. **Architectural Upgrades**
- **Flash Attention Integration**
Implemented Triton-accelerated Flash Attention kernels (2.1x faster than standard PyTorch attention).
- **Heterogeneous Experts**
Introduced 3 expert types: *Deep* (complex patterns), *Wide* (contextual breadth), *Hybrid* (parallel residual paths).
- **Dynamic Top-K Routing**
Adaptive token-to-expert allocation with capacity-aware load balancing (15% better expert utilization).

### 3. **Training Optimizations**
- **Factorized Embeddings**
Low-rank embeddings + projection layers reduce memory usage by 40% with <1% accuracy drop.
- **Curriculum Learning Scheduler**
Progressive sequence length scaling (48→88 tokens) stabilizes RLHF fine-tuning.
- **Structured Dropout**
Block-wise dropout (20%) + Structural embedding dropout (20%) + Attention dropout (30%) + gradient clipping (norm=1.2) prevents overfitting.

### 4. **Controlled Generation**
- **Dramatic Structure Enforcement**
State machine tracking for ``, ``, and `` tag consistency.
- **Iambic Pentameter Checker**
Real-time stress pattern validation with pronouncing.py integration.
- **Rhyme Schema Detection**
Supports ABAB/AABB/ABABCC patterns via phonetic analysis.

### 5. **Data Pipeline**
- **Enhanced Shakespeare Cleaning**
Specialized regex patterns for:
- Speaker turn management (`...`)
- Stage direction isolation (`[...]`)
- Act/scene boundary detection (`` → `III`)
- **Gutenberg Corpus Blending**
10% non-Shakespearean text injection improves linguistic diversity.

---

## 🛠 Usage Examples

### Training
```bash
python MoE.py --train --batch_size 32 --block_size 96
```

### Generation
```bash
# Base model
python MoE.py --generate --temperature 0.7 --top_p 0.85

# RLHF-tuned model
python MoE.py --ftgenerate --temperature 0.6 --top_p 0.9
```

### RLHF Fine-tuning
```bash
python MoE.py --rlhf --checkpoint model_checkpoint.pth
```

## 📊 Data Pipeline Performance

| Metric | v2 | v3 |
|-----------------------|---------|---------|
| Batch Preparation Time | 420ms | **85ms**|
| Memory Footprint | 8.2GB | **3.1GB**|
| Effective Data Reuse | 68% | **92%** |
| Augmentation Variety | 3 types | **7 types** |

## 📊 Performance Metrics

| Metric | v2 | v3 |
|-----------------------|---------|---------|
| Validation Loss | 5.8 | **5.1**|
| Expert Utilization | 73% | **88%** |
| PPL (Shakespeare) | 18.9 | **14.2**|
| Training Speed (tok/s)| 1,420 | **2,310**|