{"id":25758061,"url":"https://github.com/martimfasantos/custompos-for-slms","last_synced_at":"2025-10-29T07:06:49.898Z","repository":{"id":270774606,"uuid":"911416749","full_name":"martimfasantos/CustomPOs-for-SLMs","owner":"martimfasantos","description":"Novel Preference Optimization Algorithms for state-of-the-art small LMs, enhancing performance in GenAI and NLP tasks","archived":false,"fork":false,"pushed_at":"2025-01-05T21:51:07.000Z","size":279,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-26T16:47:45.394Z","etag":null,"topics":["evaluation","gen-ai","human-preferences","llms","nlp","preference-learning","preference-optimization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/martimfasantos.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-03T01:08:28.000Z","updated_at":"2025-01-18T21:12:52.000Z","dependencies_parsed_at":null,"dependency_job_id":"778212e3-a31a-4346-ad04-3dde08a5d2e4","html_url":"https://github.com/martimfasantos/CustomPOs-for-SLMs","commit_stats":null,"previous_names":["martimfasantos/custompos-for-slms"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/martimfasantos/CustomPOs-for-SLMs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/martimfasantos%2FCustomPOs-for-SLMs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/martimfasantos%2FCustomPOs-for-SLMs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/martimfasantos%2FCustomPOs-for-SLMs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/martimfasantos%2FCustomPOs-for-SLMs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/martimfasantos","download_url":"https://codeload.github.com/martimfasantos/CustomPOs-for-SLMs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/martimfasantos%2FCustomPOs-for-SLMs/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265265826,"owners_count":23737112,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation","gen-ai","human-preferences","llms","nlp","preference-learning","preference-optimization"],"created_at":"2025-02-26T16:37:00.037Z","updated_at":"2025-10-29T07:06:44.846Z","avatar_url":"https://github.com/martimfasantos.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DPO-γ \u0026 SLiC-DPO - New Preference Optimization Methods\r\n\r\nThis repository contains the code and released models for **DPO-γ** and **SLiC-DPO**, two innovative methods for preference optimization. These approaches offer substantial improvements over state-of-the-art methods like Direct Preference Optimization (DPO) when applied to Small Language Models (SLMs)[^1], particularly in tasks such as machine translation and summarization.\r\n\r\nNotably, results demonstrate that our proposed approach, **DPO-γ, outperforms all other preference optimization methods considered in the machine translation and summarization tasks**. \r\n\r\n:robot: All trained models are available [here](https://huggingface.co/martimfasantos).\r\n\r\n:scroll: For a detailed exploration of the algorithms and results, refer to my [thesis repository](https://github.com/martimfasantos/MSc-Thesis). \r\n\r\n[^1]: The definition of “small” is subjective and relative. We consider the upper limit of 5B parameters as for the size of SLMs, as done in state-of-the-art papers.\r\n\r\n---\r\n\r\n\r\n## DPO-γ\r\n\r\nThis algorithm combines the policy objective from DPO and SimPO’s target reward margin term, γ \u003e 0, introduced to the Bradley-Terry objective to ensure that the reward for the winning response, _r(x, y\u003csub\u003ew\u003c/sub\u003e)_, exceeds the reward for the losing response, _r(x, y\u003csub\u003el\u003c/sub\u003e)_, by at least γ. The resulting objective function for DPO-γ is defined as follows:\r\n\r\n![alt text](res/dpo-gamma-objective.png)\r\n\r\nwhere π\u003csub\u003eref\u003c/sub\u003e = π\u003csub\u003eSFT\u003c/sub\u003e. Similarly to DPO, an implicit reward can be fitted in such a way that the optimal policy simply becomes π\u003csub\u003eθ\u003c/sub\u003e.\r\n\r\n---\r\n\r\n## SLiC-DPO\r\n\r\nAlso aligning with DPO’s objective, we implement another custom approach, **SLiC-DPO**, which integrates DPO’s reward reparameterization with SLiC’s rank calibration loss. This combination results in the following loss function:\r\n\r\n![alt text](res/slic-dpo-objective.png)\r\n\r\n---\r\n\r\n\r\n### Hyperparameter tuning\r\nHyperparameter tuning is crucial for these algorithms (and other preference optimization algorithms in general). The three main hyperparameters of DPO-γ and SLiC-DPO to focus on are `learning_rate`, `beta`, and `gamma` (we recommend keeping the total batch size fixed at 128 for machine translation and 64 for summarization).\r\n- `learning_rate`: It is the most critical hyperparameter for preference optimization. A large learning rate (e.g., 1e-5) can significantly degrade performance, causing the model to produce incoherent sentences or completely repetitive responses. We recommend grid searching over 5e-8, 1e-7, 2e-7, and 3e-7, if resources allow. We find that a smaller learning rate (e.g., 1e-7) is more suitable for reasoning intensive domains like math for both DPO, DPO-γ and SLiC-DPO.\r\n  \r\n- `beta`: Beta controls the reward scaling between winning and losing responses. DPO-γ requires a similiar `beta` than DPO. In our work, we used a beta of `0.1` but we recognize that a further analysis and fine-tuning of this value could yield better results.\r\n  \r\n- `gamma`: Gamma controls the target reward margin. We recommend using `0.5` as a starting point for `gamma` and grid searching between `0` and `1`. A well-tuned `gamma` can provide a modest improvement for our algorithms, but it is not as critical as other hyperparameters.\r\n\r\n**Training Hyperparameters for Released Models for both tasks**\r\n| Setting           | β   | γ   | Learning rate |\r\n|-------------------|-----|-----|----------------|\r\n| [EuroLLM-1.7B-Instruct](https://huggingface.co/utter-project/EuroLLM-1.7B-Instruct)      | 0.1 | 0.5 | 1e-7           |\r\n| [Gemma-2-2B-IT](https://huggingface.co/google/gemma-2-2b-it)     | 0.1 | 0.5 | 1e-7           |\r\n| [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T)     | 0.1 | 0.5 | 1e-7           |\r\n\r\nThese parameters were obtained comparing the stability and performance of the algorithms across all models. This ensured a reliable comparison of RL-free preference optimization methods on small LMs and enabled a thorough evaluation of the models’ ability to align with human preferences. \r\n\r\nWe conducted multiple initial experiments that allowed us to evaluate the effects of different hyperparameters on training stability, performance, and alignment with human preferences. \r\n\r\n**Full Hyperparameter Ranges**\r\n\r\n| Parameter                     | Values                            |\r\n|-------------------------------|------------------------------------|\r\n| Learning Rate             | {5 × 10⁻⁸, 1 × 10⁻⁷, 2 × 10⁻⁷, 3 × 10⁻⁷} |\r\n| Batch Size                | {16, 32, 64, 128}                |\r\n| Beta (β)                  | {0.01, 0.05, 0.1}                |\r\n| Number of Epochs          | {1, 2, 3}                        |\r\n| Warmup Ratio              | {0.1, 0.15}                      |\r\n| Gradient Accumulation Steps | {4, 8, 16, 32}                 |\r\n| Number of Devices         | {1, 2, 4}                        |\r\n\r\n\r\n---\r\n\r\n## Install Requirements\r\n\r\nThe codebase is built upon the [alignment-handbook repo](https://github.com/huggingface/alignment-handbook). The following steps will guide you through the installation process.\r\n\r\nFirst, create a Python virtual environment using e.g. pyenv:\r\n```shell\r\npython3 -m venv venv\r\n```\r\n\r\nNext, install PyTorch `v2.2.2`. Since this is hardware-dependent, we\r\ndirect you to the [PyTorch Installation Page](https://pytorch.org/get-started/previous-versions/).\r\n\r\n```shell\r\n# CUDA 12.1\r\npip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121\r\n```\r\n\r\nEnsure that CUDA `12.1` is installed on your system. If it's not already installed, you can download it from the official NVIDIA website: [CUDA 12.1 Download Archive](https://developer.nvidia.com/cuda-12-1-0-download-archive).\r\n\r\nYou can then install the remaining package dependencies of [alignment-handbook](https://github.com/huggingface/alignment-handbook) as follows:\r\n\r\n```shell\r\ngit clone https://github.com/huggingface/alignment-handbook.git\r\ncd ./alignment-handbook/\r\npython -m pip install .\r\n```\r\n\r\nYou will also need Flash Attention 2 installed, which can be done by running:\r\n\r\n```shell\r\npython -m pip install flash-attn --no-build-isolation\r\n```\r\n\r\n## Training Scripts\r\n\r\nWe provide some examples of training config files for the training setups reported in our work. The training config is optimized for 4 NVIDIA RTX A6000 48GB GPUs. You may need to adjust `num_processes` and `per_device_train_batch_size` based on your computation environment. \r\n\r\n#### Machine Translation\r\n* EuroLLM 1.7B:\r\n```shell\r\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml algorithms/run_custom_po.py training_configs/eurollm-1.7b/mt/eurollm-1.7b-it-mt-dpo-gamma.yaml\r\n```\r\n* Gemma-2 2B:\r\n```shell\r\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml algorithms/run_custom_po.py training_configs/gemma-2-2b/mt/gemma-2-2b-it-mt-dpo-gamma.yaml\r\n```\r\n* TinyLlama 1.1B:\r\n```shell\r\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml algorithms/run_custom_po.py training_configs/tinyllama-1.1b/mt/tinyllama-1.1b-mt-dpo-gamma.yaml\r\n```\r\n\r\n#### Summarization\r\n* Gemma-2 2B:\r\n```shell\r\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml algorithms/run_custom_po.py training_configs/gemma-2-2b/sum/gemma-2-2b-sum-dpo-gamma.yaml\r\n```\r\n* TinyLlama 1.1B:\r\n```shell\r\nACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml algorithms/run_custom_po.py training_configs/tinyllama-1.1b/sum/tinyllama-1.1b-sum-dpo-gamma.yaml\r\n```\r\n\r\n## Evaluation\r\n\r\nWe compare our models against state-of-the-art models using commonly used automatic metrics:\r\n\r\n| Task                  | Metrics                                                    |\r\n|-----------------------|------------------------------------------------------------|\r\n| **Machine Translation** | chrF ([Popović, 2015](#)) ↑                              |\r\n|                       | BLEU ([Papineni et al., 2002](#)) ↑                        |\r\n|                       | COMET-22 ([Rei et al., 2020](#)) ↑                         |\r\n|                       | XCOMET ([Guerreiro et al., 2023](#)) ↑                     |\r\n| **Summarization**      | ROUGE ([Lin, 2004](#)) ↑                                  |\r\n|                       | METEOR ([Banerjee and Lavie, 2005](#)) ↑                   |\r\n|                       | BERTScore ([Zhang et al., 2020](#)) ↑                      |\r\n|                       | Reward ([Song et al., 2023](#)) ↑                          |\r\n\r\nWe also study the effectiveness of preference learning algorithms by **assessing their alignment with human preferences**, specially when considering small LLMs.\r\n\r\n:dart: Our results show that **DPO-γ outperforms all preference optimization methods considered in the machine translation and summarization tasks** and **is capable of further aligning the models with human preferences.**\r\n\r\nFor further results and details on the evaluation and alignment with human preferences, please refer to [this repo](https://github.com/martimfasantos/MSc-Thesis).\r\n\r\n\r\n## Bugs or Questions?\r\nIf you encounter any problems when using the code, or want to report a bug, feel free to open an issue! Please try to specify the problem with details so we can help you better and quicker!\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmartimfasantos%2Fcustompos-for-slms","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmartimfasantos%2Fcustompos-for-slms","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmartimfasantos%2Fcustompos-for-slms/lists"}