Post fine-tuning LLMs with Direct Preference Optimization

Thanh Long Phan

Since the publication of our previous blog on Reinforcement Learning from Human Feedback (RLHF), an alternative algorithm has been introduced, that does not require the use of a reward model to fine-tune Large Language Models (LLMs) based on human preferences. This method is called Direct Preference Optimization (DPO) and was introduced in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model, which was one of the best papers at NeurlPS 2023. Well-known open-source models such as Mixtral 8x7B have been optimized using DPO. At the time of writing this blog, Meta also released their new Llama 3 models, which also utilize DPO for fine-tuning. And one week later, Microsoft has unveiled Phi-3, also leveraging DPO for its optimization processes.


Fine tuning LLMs through instruction dataset and human-written completions can significantly enhance their performance in various tasks and ensure alignment with user intent. While instruction fine-tuning has shown promise, it often demands multiple experts to create completions. Another effective method involves leveraging human judgements to guide model refinement, where users determine their preferred completions, forming the basis for fine-tuning via approaches such as Reinforcement Learning with Human Feedback (RLHF), which notably requires only relative human judgment, making data collection more manageable.

However, this process entails two additional steps to fine-tune a pre-trained LLMs. Reinforcement learning, though powerful, demands substantial data input. Firstly, a reward model must be trained, followed by optimizing the pre-trained model alongside the reward model without drifting too far from the original pretrained model. Notably, such training may pose challenges on consumer-grade GPUs due to resource constraints, and it is a complex and often unstable procedure.

In contrast, DPO directly optimizes the model with a simple classification objective without the need of training a reward model.

How does DPO work?

Similar to RLHF, DPO requires a pre-trained LLM. However, the training process of DPO is conducted in a single step. We proceed directly to the loss function, which represents the core of DPO.

$$\begin{aligned} \mathcal{L}_{DPO}(\pi_\theta;\pi_{ref})=-\mathbb{E}_{(x\ , y_w\ , y_l) \sim D}&\Bigl(\log\sigma\Bigl(\\ &\beta \log \frac{\pi_\theta(y_w\ |\ x)}{\pi_{ref}(y_w\ |\ x)} -\beta\log\frac{\pi_\theta(y_l\ |\ x)}{\pi_{ref}(y_l\ |\ x)} \Bigl)\Bigl) \end{aligned}$$

The dataset $$D$$ comprises triplets, where $$x$$ represents the input prompt, and $$y_w$$ and $$y_l$$ denote the 2 completions. $$y_w$$ indicates the user's preferred completion, while $$y_l$$ the non-preferred completion. $$\beta$$ represents a non-negative constant, while $$\sigma$$ denotes the sigmoid activation function. And $$\mathbb{E}$$ is the expectation.

In the process of fine-tuning a pre-trained LLM, the approach may be analogous to RLHF. This entails the creation of an additional copy of the pre-trained LLM, one of which has its weights frozen, $$\pi_{ref}$$ , also known as the reference model, while the other, $$\pi_\theta$$, is updated.

By defining the reward function as 

$$r_\theta(x,\ y)= \beta \log\frac{\pi_\theta(y\ |\ x)}{\pi_{ref}(y\ |\ x)},$$

we can see that, to minimize the loss function, the reward function has to assign a higher reward score to the preferred completion than to the non-preferred completion. By defining the reward this way, your language model is secretly a reward model.

This approach ensures that preferred completions are rewarded more favorably, thereby aligning with the main objective of the implicit reward model, which is to elevate the reward for preferred completions overall.


The training process can be efficiently managed using the trl library from Hugging Face ecosystem. We have created a simple script for you to utilize in training your LLM.  However, please note that the provided script is currently tailored to work with the argilla/dpo-mix-7k dataset, if you intend to employ a different dataset, you will need to adjust the code within the prepare_examples_for_training function accordingly.

Moreover, the training process can be improved by employing the HfArgumentParser, facilitating dynamic parameters adjustments. The script is designed to support training across multiple GPUs, as including DeepSpeed Zero. If you wish to offload the parameters during training to another GPUs/CPUs, it is advisable to utilize ZeRo Stage-2, as Stage-3 is not yet compatible with 4-bit quantization.

Test your LLM

You can load your fine-tuned model and test the generation with the following script:

from transformers import BitsAndBytesConfig
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
quantization_config = BitsAndBytesConfig(
# if you fine-tuned model with LORA
# otherwise you can just load the weights directly with
# AutoModelForCausalLM.from_pretrained
peft_config = PeftConfig.from_pretrained("path_to_your_trained_model")
# load the original weights
base_model = AutoModelForCausalLM.from_pretrained(
# combine lora with the original weights
model = PeftModel.from_pretrained(
pipe = pipeline(
   # temperature=temperature,
question = “Your question here”
completion = pipe(tokenizer.apply_chat_template(

The full code for the training can be found on github.

Additional Details

During the period in which this blog was written, PyTorch released a library, Torchtune, which enables users to easily fine-tune and experiment with LLM. At present, it supports Llama2, Mistral, Gemma and the new Llama model, Llama3 on full training and DPO. Torch tune is a user-friendly tool that enables users to easily download and start their DPO training with just a YAML file. For more information on how to start the training, please refer to the torchtune documentation.

While DPO has been shown to offer advantages over RLHF, it is not necessary to use DPO exclusively. The post-training process for the new Llama 3 combines supervised fine-tuning (SFT), reject sampling, proximal policy optimization (PPO), also known as reinforcement learning from human feedback (RLHF), and DPO.

Last month Jiwoo Hong et al. released the paper ORPO: Monolithic Preference Optimization without Reference Model, which introduces a new preference alignment algorithm that does not require a reference model during training. The HuggingFace library trl ,with the version trl>=0.8.2 , supports this training method with its ORPOTrainer. With a small change in the training code via replacing DPOTrainer with ORPOTrainer, you can fine-tune your models using ORPO-method. The code I test out still uses an old version of trl, so do not forget to update trl if you want to test ORPO also.