https://github.com/yancotta/post_training_llms
Different post-training techniques for LLMs, including: SFT, DPO and Online RL
https://github.com/yancotta/post_training_llms
alignment dpo fine-tuning huggingface huggingface-transformers llm pytorch reinforcement-learning sft trl
Last synced: 4 days ago
JSON representation
Different post-training techniques for LLMs, including: SFT, DPO and Online RL
- Host: GitHub
- URL: https://github.com/yancotta/post_training_llms
- Owner: YanCotta
- License: mit
- Created: 2025-07-09T17:24:01.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-09-05T12:59:37.000Z (about 1 month ago)
- Last Synced: 2025-09-16T01:51:52.134Z (24 days ago)
- Topics: alignment, dpo, fine-tuning, huggingface, huggingface-transformers, llm, pytorch, reinforcement-learning, sft, trl
- Language: Python
- Homepage:
- Size: 96.7 KB
- Stars: 3
- Watchers: 0
- Forks: 1
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Post-Training Techniques for Large Language Models
A comprehensive implementation and educational resource for modern post-training techniques that enhance Large Language Model (LLM) capabilities and alignment.
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
[](https://github.com/psf/black)## 🎯 Overview
This repository provides production-ready implementations of three key post-training techniques:
- **🎓 Supervised Fine-Tuning (SFT)**: Enhance instruction-following capabilities
- **⚖️ Direct Preference Optimization (DPO)**: Align models with human preferences
- **🔄 Online Reinforcement Learning (GRPO)**: Improve task-specific performance with reward signalsAll implementations are based on the **DeepLearning.AI "Post-training LLMs" course**, enhanced with professional software engineering practices, comprehensive documentation, and extensible architecture.
## 🌟 Key Features
- **🏗️ Modular Architecture**: Clean, extensible codebase with clear separation of concerns
- **📚 Educational Notebooks**: Step-by-step tutorials with detailed explanations
- **⚡ Production Ready**: Professional implementations suitable for real-world applications
- **🔧 Easy Configuration**: YAML-based configuration for all training parameters
- **📊 Comprehensive Evaluation**: Built-in metrics and benchmarking tools
- **🚀 Multiple Interfaces**: Command-line scripts, Python API, and Jupyter notebooks
- **🎛️ Flexible Models**: Support for various model architectures and sizes## 📁 Repository Structure
```
post_training_llms/
├── src/ # Core implementation
│ ├── utils/ # Utility functions
│ │ ├── model_utils.py # Model loading, generation, evaluation
│ │ ├── data_utils.py # Dataset preparation and processing
│ │ ├── config.py # Unified configuration system
│ │ └── config_manager.py # Configuration management utilities
│ ├── training/ # Training pipelines
│ │ ├── sft_trainer.py # Supervised Fine-Tuning
│ │ ├── dpo_trainer.py # Direct Preference Optimization
│ │ └── rl_trainer.py # Online RL with GRPO
│ └── evaluation/ # Evaluation and metrics
│ ├── metrics.py # Performance metrics
│ └── benchmark.py # Comprehensive benchmarking
├── notebooks/ # Educational tutorials
│ ├── 01_supervised_fine_tuning.ipynb
│ ├── 02_direct_preference_optimization.ipynb
│ └── 03_online_reinforcement_learning.ipynb
├── examples/ # Example scripts
│ ├── run_sft.py # SFT training example
│ ├── run_dpo.py # DPO training example
│ ├── run_rl.py # RL training example
│ ├── run_benchmark.py # Model evaluation
│ └── config_utils.py # Configuration utilities
├── configs/ # Configuration files
│ ├── sft_config.yaml # SFT parameters
│ ├── dpo_config.yaml # DPO parameters
│ └── rl_config.yaml # RL parameters
├── data/ # Data storage (created at runtime)
└── models/ # Model storage (created at runtime)
```## ⚙️ Configuration System Architecture
The unified configuration system provides a robust, type-safe way to manage all training parameters:
### Core Components
- **`BaseConfig`**: Abstract base class with common configuration fields
- **`SFTConfig`**: Configuration for Supervised Fine-Tuning
- **`DPOConfig`**: Configuration for Direct Preference Optimization
- **`RLConfig`**: Configuration for Reinforcement Learning
- **`ConfigManager`**: Utility class for configuration operations### Key Features
- **🔒 Type Safety**: All configurations use Python dataclasses with validation
- **✅ Data Validation**: Automatic validation of parameter types and ranges
- **🔄 Inheritance**: Method-specific configs inherit from base configuration
- **📁 YAML Support**: Load/save configurations in human-readable YAML format
- **🎛️ Command Overrides**: Command-line arguments can override config values
- **🔧 Utility Functions**: Built-in tools for validation, merging, and conversion### Configuration Structure
```python
# Example configuration hierarchy
BaseConfig
├── ModelConfig # Model settings (name, trust_remote_code)
├── TrainingConfig # Common training parameters
│ ├── SFTTrainingConfig
│ ├── DPOTrainingConfig (with beta parameter)
│ └── RLTrainingConfig (with num_generations)
├── DatasetConfig # Dataset settings
├── HardwareConfig # Hardware settings (GPU, mixed precision)
├── OutputConfig # Output settings
└── EvaluationConfig # Evaluation settings
```## 🚀 Quick Start
### Installation
1. **Clone the repository**:
```bash
git clone https://github.com/YanCotta/post_training_llms.git
cd post_training_llms
```2. **Install dependencies**:
```bash
pip install -r requirements.txt
```3. **Verify installation**:
```bash
python -c "import torch; import transformers; import datasets; import trl; print('✅ All dependencies installed successfully!')"
```### Configuration Management
The project now features a **unified configuration system** that eliminates code duplication and ensures consistency across all training methods.
#### Using Configuration Files
All training scripts now support configuration files with command-line overrides:
```bash
# Use configuration file with overrides
python examples/run_sft.py \
--config configs/sft_config.yaml \
--learning-rate 1e-4 \
--epochs 2
```#### Configuration Utilities
Use the configuration utility script for common operations:
```bash
# Create a new configuration template
python examples/config_utils.py create --type sft --output configs/my_config.yaml# Validate a configuration file
python examples/config_utils.py validate --config configs/sft_config.yaml# List all available configurations
python examples/config_utils.py list --directory configs# Convert configuration to training arguments
python examples/config_utils.py convert --config configs/sft_config.yaml
```### Testing the Configuration System
The configuration system includes comprehensive testing:
```bash
# Test all configuration functionality
python -c "
from src.utils.config import create_default_config
from src.utils.config_manager import ConfigManager# Create and validate configurations
sft_config = create_default_config('sft')
is_valid = ConfigManager.validate_config(sft_config)
print(f'Configuration system working: {is_valid}')
"
```### Running Your First Training
#### Supervised Fine-Tuning (SFT)
```bash
python examples/run_sft.py \
--config configs/sft_config.yaml \
--max-samples 100
```#### Direct Preference Optimization (DPO)
```bash
python examples/run_dpo.py \
--config configs/dpo_config.yaml \
--new-identity "My Assistant" \
--max-samples 50
```#### Online Reinforcement Learning (GRPO)
```bash
python examples/run_rl.py \
--model "HuggingFaceTB/SmolLM2-135M-Instruct" \
--dataset "openai/gsm8k" \
--max-train-samples 20 \
--max-eval-samples 10 \
--output-dir "./models/my_rl_model"
```## 📖 Tutorials
### Interactive Jupyter Notebooks
Explore the techniques through our comprehensive tutorial notebooks:
1. **[Supervised Fine-Tuning Tutorial](notebooks/01_supervised_fine_tuning.ipynb)**
- Learn how SFT improves instruction-following
- Hands-on training with real datasets
- Performance evaluation and analysis2. **[Direct Preference Optimization Tutorial](notebooks/02_direct_preference_optimization.ipynb)**
- Understand preference-based training
- Identity modification example
- Consistency measurement and evaluation3. **[Online Reinforcement Learning Tutorial](notebooks/03_online_reinforcement_learning.ipynb)**
- Reward-based model improvement
- Mathematical reasoning enhancement
- GRPO training and evaluation### Running Notebooks
```bash
jupyter notebook notebooks/
```## 🎛️ Configuration
All training parameters can be customized using YAML configuration files:
### SFT Configuration (`configs/sft_config.yaml`)
```yaml
model:
name: "HuggingFaceTB/SmolLM2-135M"
training:
learning_rate: 8.0e-5
num_train_epochs: 1
per_device_train_batch_size: 1
dataset:
name: "banghua/DL-SFT-Dataset"
max_samples: 1000
```### DPO Configuration (`configs/dpo_config.yaml`)
```yaml
model:
name: "HuggingFaceTB/SmolLM2-135M-Instruct"
training:
beta: 0.2
learning_rate: 5.0e-5
identity:
positive_name: "Deep Qwen"
organization_name: "Qwen"
```### RL Configuration (`configs/rl_config.yaml`)
```yaml
model:
name: "HuggingFaceTB/SmolLM2-135M-Instruct"
training:
learning_rate: 5.0e-6
num_generations: 4
dataset:
name: "openai/gsm8k"
```## 🔧 API Usage
### Python API Examples
```python
from src.training.sft_trainer import SFTTrainingPipeline
from src.training.dpo_trainer import DPOTrainingPipeline
from src.training.rl_trainer import RLTrainingPipeline# Supervised Fine-Tuning
sft_pipeline = SFTTrainingPipeline("HuggingFaceTB/SmolLM2-135M")
sft_pipeline.setup_training(dataset, learning_rate=8e-5)
sft_pipeline.train()# Direct Preference Optimization
dpo_pipeline = DPOTrainingPipeline("HuggingFaceTB/SmolLM2-135M-Instruct")
dpo_dataset = dpo_pipeline.create_preference_dataset(raw_dataset)
dpo_pipeline.setup_training(dpo_dataset, beta=0.2)
dpo_pipeline.train()# Online Reinforcement Learning
rl_pipeline = RLTrainingPipeline("HuggingFaceTB/SmolLM2-135M-Instruct")
rl_pipeline.setup_training(train_dataset, reward_function)
rl_pipeline.train()
```## 📊 Evaluation and Benchmarking
### Comprehensive Model Evaluation
```bash
python examples/run_benchmark.py \
--model "path/to/your/model" \
--math-samples 50 \
--target-identity "Your Model Name" \
--output-file "benchmark_results.json"
```### Available Metrics
- **Accuracy**: Task-specific performance measurement
- **Identity Consistency**: Model identity alignment
- **Safety Score**: Harmful content detection
- **Perplexity**: Language modeling quality
- **Math Reasoning**: Mathematical problem-solving ability## 🎓 Educational Value
This repository serves as both a practical implementation and an educational resource:
### Learning Objectives
- **Understand** the theory behind modern post-training techniques
- **Implement** production-ready training pipelines
- **Evaluate** model performance across multiple dimensions
- **Apply** best practices in ML engineering and experimentation### Based on DeepLearning.AI Course
This implementation is based on and extends the **DeepLearning.AI "Post-training LLMs" course**, providing:
- Enhanced code organization and modularity
- Additional evaluation metrics and benchmarks
- Production-ready implementations
- Comprehensive documentation and examples## 🔬 Research and Development
### Supported Models
- **Small Models**: SmolLM2-135M, SmolLM2-1.7B
- **Medium Models**: Qwen2.5-0.5B, Qwen2.5-1.5B
- **Large Models**: Any HuggingFace compatible model
- **Custom Models**: Easy integration with custom architectures### Datasets
- **SFT**: banghua/DL-SFT-Dataset, custom instruction datasets
- **DPO**: mrfakename/identity, preference pair datasets
- **RL**: openai/gsm8k, custom reward-based datasets## 🤝 Contributing
We welcome contributions! Please see our [contribution guidelines](CONTRIBUTING.md) for details.
### Development Setup
```bash
# Clone the repository
git clone https://github.com/YanCotta/post_training_llms.git
cd post_training_llms# Create development environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate# Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt # If available# Run tests
python -m pytest tests/ # If tests are available
```## 📝 Citation
If you use this repository in your research or projects, please cite:
```bibtex
@misc{cotta2024posttrainingllms,
title={Post-Training Techniques for Large Language Models},
author={Yan Cotta},
year={2024},
url={https://github.com/YanCotta/post_training_llms},
note={Based on DeepLearning.AI Post-training LLMs course}
}
```## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🙏 Acknowledgments
- **DeepLearning.AI** for the foundational "Post-training LLMs" course
- **Hugging Face** for the transformers library and model ecosystem
- **TRL Team** for the training utilities and implementations
- **Open Source Community** for the various datasets and tools used## 📞 Support
- **Issues**: [GitHub Issues](https://github.com/YanCotta/post_training_llms/issues)
- **Discussions**: [GitHub Discussions](https://github.com/YanCotta/post_training_llms/discussions)
- **Email**: yanpcotta@gmail.com---
⭐ **Star this repository** if you find it useful for your LLM post-training projects!