{"id":31073587,"url":"https://github.com/yancotta/post_training_llms","last_synced_at":"2025-10-05T19:59:24.367Z","repository":{"id":303848221,"uuid":"1016889883","full_name":"YanCotta/post_training_llms","owner":"YanCotta","description":"Different post-training techniques for LLMs, including:  SFT, DPO and Online RL","archived":false,"fork":false,"pushed_at":"2025-09-05T12:59:37.000Z","size":99,"stargazers_count":3,"open_issues_count":4,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-16T01:51:52.134Z","etag":null,"topics":["alignment","dpo","fine-tuning","huggingface","huggingface-transformers","llm","pytorch","reinforcement-learning","sft","trl"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/YanCotta.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-07-09T17:24:01.000Z","updated_at":"2025-09-10T00:52:42.000Z","dependencies_parsed_at":"2025-07-10T02:24:16.978Z","dependency_job_id":"21ee837b-33cb-41be-bd85-15eb272e8c6b","html_url":"https://github.com/YanCotta/post_training_llms","commit_stats":null,"previous_names":["yancotta/post_training_llms"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/YanCotta/post_training_llms","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YanCotta%2Fpost_training_llms","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YanCotta%2Fpost_training_llms/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YanCotta%2Fpost_training_llms/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YanCotta%2Fpost_training_llms/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/YanCotta","download_url":"https://codeload.github.com/YanCotta/post_training_llms/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/YanCotta%2Fpost_training_llms/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278510913,"owners_count":25998997,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-05T02:00:06.059Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","dpo","fine-tuning","huggingface","huggingface-transformers","llm","pytorch","reinforcement-learning","sft","trl"],"created_at":"2025-09-16T01:50:54.773Z","updated_at":"2025-10-05T19:59:24.330Z","avatar_url":"https://github.com/YanCotta.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Post-Training Techniques for Large Language Models\n\nA comprehensive implementation and educational resource for modern post-training techniques that enhance Large Language Model (LLM) capabilities and alignment.\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n## 🎯 Overview\n\nThis repository provides production-ready implementations of three key post-training techniques:\n\n- **🎓 Supervised Fine-Tuning (SFT)**: Enhance instruction-following capabilities\n- **⚖️ Direct Preference Optimization (DPO)**: Align models with human preferences\n- **🔄 Online Reinforcement Learning (GRPO)**: Improve task-specific performance with reward signals\n\nAll implementations are based on the **DeepLearning.AI \"Post-training LLMs\" course**, enhanced with professional software engineering practices, comprehensive documentation, and extensible architecture.\n\n## 🌟 Key Features\n\n- **🏗️ Modular Architecture**: Clean, extensible codebase with clear separation of concerns\n- **📚 Educational Notebooks**: Step-by-step tutorials with detailed explanations\n- **⚡ Production Ready**: Professional implementations suitable for real-world applications\n- **🔧 Easy Configuration**: YAML-based configuration for all training parameters\n- **📊 Comprehensive Evaluation**: Built-in metrics and benchmarking tools\n- **🚀 Multiple Interfaces**: Command-line scripts, Python API, and Jupyter notebooks\n- **🎛️ Flexible Models**: Support for various model architectures and sizes\n\n## 📁 Repository Structure\n\n```\npost_training_llms/\n├── src/                          # Core implementation\n│   ├── utils/                    # Utility functions\n│   │   ├── model_utils.py       # Model loading, generation, evaluation\n│   │   ├── data_utils.py        # Dataset preparation and processing\n│   │   ├── config.py            # Unified configuration system\n│   │   └── config_manager.py    # Configuration management utilities\n│   ├── training/                # Training pipelines\n│   │   ├── sft_trainer.py       # Supervised Fine-Tuning\n│   │   ├── dpo_trainer.py       # Direct Preference Optimization\n│   │   └── rl_trainer.py        # Online RL with GRPO\n│   └── evaluation/              # Evaluation and metrics\n│       ├── metrics.py           # Performance metrics\n│       └── benchmark.py         # Comprehensive benchmarking\n├── notebooks/                   # Educational tutorials\n│   ├── 01_supervised_fine_tuning.ipynb\n│   ├── 02_direct_preference_optimization.ipynb\n│   └── 03_online_reinforcement_learning.ipynb\n├── examples/                    # Example scripts\n│   ├── run_sft.py              # SFT training example\n│   ├── run_dpo.py              # DPO training example\n│   ├── run_rl.py               # RL training example\n│   ├── run_benchmark.py        # Model evaluation\n│   └── config_utils.py         # Configuration utilities\n├── configs/                     # Configuration files\n│   ├── sft_config.yaml         # SFT parameters\n│   ├── dpo_config.yaml         # DPO parameters\n│   └── rl_config.yaml          # RL parameters\n├── data/                        # Data storage (created at runtime)\n└── models/                      # Model storage (created at runtime)\n```\n\n## ⚙️ Configuration System Architecture\n\nThe unified configuration system provides a robust, type-safe way to manage all training parameters:\n\n### Core Components\n\n- **`BaseConfig`**: Abstract base class with common configuration fields\n- **`SFTConfig`**: Configuration for Supervised Fine-Tuning\n- **`DPOConfig`**: Configuration for Direct Preference Optimization  \n- **`RLConfig`**: Configuration for Reinforcement Learning\n- **`ConfigManager`**: Utility class for configuration operations\n\n### Key Features\n\n- **🔒 Type Safety**: All configurations use Python dataclasses with validation\n- **✅ Data Validation**: Automatic validation of parameter types and ranges\n- **🔄 Inheritance**: Method-specific configs inherit from base configuration\n- **📁 YAML Support**: Load/save configurations in human-readable YAML format\n- **🎛️ Command Overrides**: Command-line arguments can override config values\n- **🔧 Utility Functions**: Built-in tools for validation, merging, and conversion\n\n### Configuration Structure\n\n```python\n# Example configuration hierarchy\nBaseConfig\n├── ModelConfig          # Model settings (name, trust_remote_code)\n├── TrainingConfig      # Common training parameters\n│   ├── SFTTrainingConfig\n│   ├── DPOTrainingConfig (with beta parameter)\n│   └── RLTrainingConfig (with num_generations)\n├── DatasetConfig       # Dataset settings\n├── HardwareConfig      # Hardware settings (GPU, mixed precision)\n├── OutputConfig        # Output settings\n└── EvaluationConfig    # Evaluation settings\n```\n\n## 🚀 Quick Start\n\n### Installation\n\n1. **Clone the repository**:\n   ```bash\n   git clone https://github.com/YanCotta/post_training_llms.git\n   cd post_training_llms\n   ```\n\n2. **Install dependencies**:\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n3. **Verify installation**:\n   ```bash\n   python -c \"import torch; import transformers; import datasets; import trl; print('✅ All dependencies installed successfully!')\"\n   ```\n\n### Configuration Management\n\nThe project now features a **unified configuration system** that eliminates code duplication and ensures consistency across all training methods.\n\n#### Using Configuration Files\n\nAll training scripts now support configuration files with command-line overrides:\n\n```bash\n# Use configuration file with overrides\npython examples/run_sft.py \\\n    --config configs/sft_config.yaml \\\n    --learning-rate 1e-4 \\\n    --epochs 2\n```\n\n#### Configuration Utilities\n\nUse the configuration utility script for common operations:\n\n```bash\n# Create a new configuration template\npython examples/config_utils.py create --type sft --output configs/my_config.yaml\n\n# Validate a configuration file\npython examples/config_utils.py validate --config configs/sft_config.yaml\n\n# List all available configurations\npython examples/config_utils.py list --directory configs\n\n# Convert configuration to training arguments\npython examples/config_utils.py convert --config configs/sft_config.yaml\n```\n\n### Testing the Configuration System\n\nThe configuration system includes comprehensive testing:\n\n```bash\n# Test all configuration functionality\npython -c \"\nfrom src.utils.config import create_default_config\nfrom src.utils.config_manager import ConfigManager\n\n# Create and validate configurations\nsft_config = create_default_config('sft')\nis_valid = ConfigManager.validate_config(sft_config)\nprint(f'Configuration system working: {is_valid}')\n\"\n```\n\n### Running Your First Training\n\n#### Supervised Fine-Tuning (SFT)\n```bash\npython examples/run_sft.py \\\n    --config configs/sft_config.yaml \\\n    --max-samples 100\n```\n\n#### Direct Preference Optimization (DPO)\n```bash\npython examples/run_dpo.py \\\n    --config configs/dpo_config.yaml \\\n    --new-identity \"My Assistant\" \\\n    --max-samples 50\n```\n\n#### Online Reinforcement Learning (GRPO)\n```bash\npython examples/run_rl.py \\\n    --model \"HuggingFaceTB/SmolLM2-135M-Instruct\" \\\n    --dataset \"openai/gsm8k\" \\\n    --max-train-samples 20 \\\n    --max-eval-samples 10 \\\n    --output-dir \"./models/my_rl_model\"\n```\n\n## 📖 Tutorials\n\n### Interactive Jupyter Notebooks\n\nExplore the techniques through our comprehensive tutorial notebooks:\n\n1. **[Supervised Fine-Tuning Tutorial](notebooks/01_supervised_fine_tuning.ipynb)**\n   - Learn how SFT improves instruction-following\n   - Hands-on training with real datasets\n   - Performance evaluation and analysis\n\n2. **[Direct Preference Optimization Tutorial](notebooks/02_direct_preference_optimization.ipynb)**\n   - Understand preference-based training\n   - Identity modification example\n   - Consistency measurement and evaluation\n\n3. **[Online Reinforcement Learning Tutorial](notebooks/03_online_reinforcement_learning.ipynb)**\n   - Reward-based model improvement\n   - Mathematical reasoning enhancement\n   - GRPO training and evaluation\n\n### Running Notebooks\n\n```bash\njupyter notebook notebooks/\n```\n\n## 🎛️ Configuration\n\nAll training parameters can be customized using YAML configuration files:\n\n### SFT Configuration (`configs/sft_config.yaml`)\n```yaml\nmodel:\n  name: \"HuggingFaceTB/SmolLM2-135M\"\ntraining:\n  learning_rate: 8.0e-5\n  num_train_epochs: 1\n  per_device_train_batch_size: 1\ndataset:\n  name: \"banghua/DL-SFT-Dataset\"\n  max_samples: 1000\n```\n\n### DPO Configuration (`configs/dpo_config.yaml`)\n```yaml\nmodel:\n  name: \"HuggingFaceTB/SmolLM2-135M-Instruct\"\ntraining:\n  beta: 0.2\n  learning_rate: 5.0e-5\nidentity:\n  positive_name: \"Deep Qwen\"\n  organization_name: \"Qwen\"\n```\n\n### RL Configuration (`configs/rl_config.yaml`)\n```yaml\nmodel:\n  name: \"HuggingFaceTB/SmolLM2-135M-Instruct\"\ntraining:\n  learning_rate: 5.0e-6\n  num_generations: 4\ndataset:\n  name: \"openai/gsm8k\"\n```\n\n## 🔧 API Usage\n\n### Python API Examples\n\n```python\nfrom src.training.sft_trainer import SFTTrainingPipeline\nfrom src.training.dpo_trainer import DPOTrainingPipeline\nfrom src.training.rl_trainer import RLTrainingPipeline\n\n# Supervised Fine-Tuning\nsft_pipeline = SFTTrainingPipeline(\"HuggingFaceTB/SmolLM2-135M\")\nsft_pipeline.setup_training(dataset, learning_rate=8e-5)\nsft_pipeline.train()\n\n# Direct Preference Optimization\ndpo_pipeline = DPOTrainingPipeline(\"HuggingFaceTB/SmolLM2-135M-Instruct\")\ndpo_dataset = dpo_pipeline.create_preference_dataset(raw_dataset)\ndpo_pipeline.setup_training(dpo_dataset, beta=0.2)\ndpo_pipeline.train()\n\n# Online Reinforcement Learning\nrl_pipeline = RLTrainingPipeline(\"HuggingFaceTB/SmolLM2-135M-Instruct\")\nrl_pipeline.setup_training(train_dataset, reward_function)\nrl_pipeline.train()\n```\n\n## 📊 Evaluation and Benchmarking\n\n### Comprehensive Model Evaluation\n\n```bash\npython examples/run_benchmark.py \\\n    --model \"path/to/your/model\" \\\n    --math-samples 50 \\\n    --target-identity \"Your Model Name\" \\\n    --output-file \"benchmark_results.json\"\n```\n\n### Available Metrics\n\n- **Accuracy**: Task-specific performance measurement\n- **Identity Consistency**: Model identity alignment\n- **Safety Score**: Harmful content detection\n- **Perplexity**: Language modeling quality\n- **Math Reasoning**: Mathematical problem-solving ability\n\n## 🎓 Educational Value\n\nThis repository serves as both a practical implementation and an educational resource:\n\n### Learning Objectives\n- **Understand** the theory behind modern post-training techniques\n- **Implement** production-ready training pipelines\n- **Evaluate** model performance across multiple dimensions\n- **Apply** best practices in ML engineering and experimentation\n\n### Based on DeepLearning.AI Course\nThis implementation is based on and extends the **DeepLearning.AI \"Post-training LLMs\" course**, providing:\n- Enhanced code organization and modularity\n- Additional evaluation metrics and benchmarks\n- Production-ready implementations\n- Comprehensive documentation and examples\n\n## 🔬 Research and Development\n\n### Supported Models\n- **Small Models**: SmolLM2-135M, SmolLM2-1.7B\n- **Medium Models**: Qwen2.5-0.5B, Qwen2.5-1.5B\n- **Large Models**: Any HuggingFace compatible model\n- **Custom Models**: Easy integration with custom architectures\n\n### Datasets\n- **SFT**: banghua/DL-SFT-Dataset, custom instruction datasets\n- **DPO**: mrfakename/identity, preference pair datasets\n- **RL**: openai/gsm8k, custom reward-based datasets\n\n## 🤝 Contributing\n\nWe welcome contributions! Please see our [contribution guidelines](CONTRIBUTING.md) for details.\n\n### Development Setup\n```bash\n# Clone the repository\ngit clone https://github.com/YanCotta/post_training_llms.git\ncd post_training_llms\n\n# Create development environment\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install development dependencies\npip install -r requirements.txt\npip install -r requirements-dev.txt  # If available\n\n# Run tests\npython -m pytest tests/  # If tests are available\n```\n\n## 📝 Citation\n\nIf you use this repository in your research or projects, please cite:\n\n```bibtex\n@misc{cotta2024posttrainingllms,\n  title={Post-Training Techniques for Large Language Models},\n  author={Yan Cotta},\n  year={2024},\n  url={https://github.com/YanCotta/post_training_llms},\n  note={Based on DeepLearning.AI Post-training LLMs course}\n}\n```\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## 🙏 Acknowledgments\n\n- **DeepLearning.AI** for the foundational \"Post-training LLMs\" course\n- **Hugging Face** for the transformers library and model ecosystem\n- **TRL Team** for the training utilities and implementations\n- **Open Source Community** for the various datasets and tools used\n\n## 📞 Support\n\n- **Issues**: [GitHub Issues](https://github.com/YanCotta/post_training_llms/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/YanCotta/post_training_llms/discussions)\n- **Email**: yanpcotta@gmail.com\n\n---\n\n⭐ **Star this repository** if you find it useful for your LLM post-training projects!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyancotta%2Fpost_training_llms","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyancotta%2Fpost_training_llms","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyancotta%2Fpost_training_llms/lists"}