https://github.com/naidezhujimo/yinghub-v3

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/naidezhujimo/yinghub-v3
Owner: naidezhujimo
Created: 2025-04-02T09:16:15.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2025-04-02T09:22:21.000Z (about 2 months ago)
Last Synced: 2025-04-02T10:28:09.628Z (about 2 months ago)
Language: Python
Size: 8.06 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # YingHub-v3

**v3 Major Enhancements & Innovations**  

This release introduces significant architectural improvements, training optimizations, and novel features over v2, specifically designed for high-quality Shakespearean text generation.

---

## 🚀 Key Advancements (v3 vs v2)

### 1. **Data Loading & Preprocessing Optimizations**

- **Sliding Window Pre-computation**  

  Implemented memory-efficient `unfold` + circular buffer strategies to handle variable-length sequences

- **Dynamic Mask Augmentation**  

  10% random token masking with `` during batch generation improves robustness

- **Streaming Dataset Iterator**  

  Memory-mapped data loading with zero-copy tensor conversion (4x faster than v2's disk I/O)

### 2. **Architectural Upgrades**

- **Flash Attention Integration**  

  Implemented Triton-accelerated Flash Attention kernels (2.1x faster than standard PyTorch attention).

- **Heterogeneous Experts**  

  Introduced 3 expert types: *Deep* (complex patterns), *Wide* (contextual breadth), *Hybrid* (parallel residual paths).

- **Dynamic Top-K Routing**  

  Adaptive token-to-expert allocation with capacity-aware load balancing (15% better expert utilization).

### 3. **Training Optimizations**

- **Factorized Embeddings**  

  Low-rank embeddings + projection layers reduce memory usage by 40% with <1% accuracy drop.

- **Curriculum Learning Scheduler**  

  Progressive sequence length scaling (48→88 tokens) stabilizes RLHF fine-tuning.

- **Structured Dropout**  

  Block-wise dropout (20%) + Structural embedding dropout (20%) + Attention dropout (30%) + gradient clipping (norm=1.2) prevents overfitting. 

### 4. **Controlled Generation**

- **Dramatic Structure Enforcement**  

  State machine tracking for ``, ``, and `` tag consistency.

- **Iambic Pentameter Checker**  

  Real-time stress pattern validation with pronouncing.py integration.

- **Rhyme Schema Detection**  

  Supports ABAB/AABB/ABABCC patterns via phonetic analysis.

### 5. **Data Pipeline**

- **Enhanced Shakespeare Cleaning**  

  Specialized regex patterns for:  

  - Speaker turn management (`...`)  

  - Stage direction isolation (`[...]`)  

  - Act/scene boundary detection (`` → `III`)

- **Gutenberg Corpus Blending**  

  10% non-Shakespearean text injection improves linguistic diversity.

---

## 🛠 Usage Examples

### Training

```bash

python MoE.py --train --batch_size 32 --block_size 96

```

### Generation

```bash

# Base model

python MoE.py --generate --temperature 0.7 --top_p 0.85

# RLHF-tuned model  

python MoE.py --ftgenerate --temperature 0.6 --top_p 0.9

```

### RLHF Fine-tuning

```bash

python MoE.py --rlhf --checkpoint model_checkpoint.pth

```

## 📊 Data Pipeline Performance

| Metric                | v2      | v3      |

|-----------------------|---------|---------|

| Batch Preparation Time | 420ms   | **85ms**|

| Memory Footprint      | 8.2GB   | **3.1GB**|

| Effective Data Reuse  | 68%     | **92%** |

| Augmentation Variety  | 3 types | **7 types** |

## 📊 Performance Metrics

| Metric                | v2      | v3      |

|-----------------------|---------|---------|

| Validation Loss       | 5.8    | **5.1**|

| Expert Utilization    | 73%     | **88%** |

| PPL (Shakespeare)     | 18.9    | **14.2**|

| Training Speed (tok/s)| 1,420   | **2,310**|

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/naidezhujimo/yinghub-v3

Awesome Lists containing this project

README