Getting Started with Hugging Face TRL: Fine‑tune LLaVA using DPO

This guide introduces Hugging Face's TRL library, explains how to install it alongside Transformers, and walks through modifying LLaVA's trainer, dataset, and data collator to apply the DPO reinforcement‑learning algorithm for multimodal model fine‑tuning.

Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Getting Started with Hugging Face TRL: Fine‑tune LLaVA using DPO

Hugging Face's TRL (Transformer Reinforcement Learning) library is an open‑source toolkit designed for reinforcement‑learning‑based fine‑tuning of large language models, supporting models such as Qwen and LLaMA and algorithms like PPO and DPO.

Because TRL builds on the transformers library, both packages must be installed. The required pip command is: pip install transformers, trl The article uses the multimodal LLaVA model as a concrete example and adopts the RLHF‑V preference dataset for training.

First, the llava_trainer.py file is edited: the original LLaVATrainer (a subclass of transformers.Trainer) is replaced with trl.DPOTrainer. The change is illustrated in the following image:

Next, the LazySupervisedDataset class is updated. Its __getitem__ method now reads image paths and both chosen and rejected answers from the RLHF‑V dataset, converting the answers into a dialogue‑template format. The modification is shown below:

Following the LLaVA SFT data‑processing pipeline, the chosen and rejected answers are encoded as IDs, labels are computed, and the final __getitem__ output format is displayed in the next figure:

The DataCollatorForSupervisedDataset class in train.py is also revised. Its __call__ method now builds a training batch from the IDs and labels of the chosen and rejected responses, with the resulting batch structure illustrated here:

Because the current TRL version does not support multimodal inference, the concatenated_forward method of DPOTrainer is patched to accept image inputs during training. The code change is depicted in the following image:

After these modifications, the DPO fine‑tuning pipeline for LLaVA is complete and can be executed using LLaVA’s existing training scripts.

Reference material:

https://huggingface.co/docs/trl/index

https://github.com/haotian-liu/LLaVA

https://github.com/YiyangZhou/CSR

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

RLHFMultimodal LLMDPOHugging FaceLLaVATRL
Network Intelligence Research Center (NIRC)
Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.