Mitigating Hallucinations in Large Vision-Language Models via DPO:
On-Policy Data Hold the Key

Zhihe Yang1,3  Xufang Luo2  Dongqi Han2  Yunjian Xu1,3   Dongsheng Li2
1The Chinese University of Hong Kong, Hong Kong SAR, China
2Microsoft Research Asia, Shanghai, China
3The Chinese University of Hong Kong, Shenzhen Research Institute (SZRI), Guangdong, China
The work was conducted during the internship of Zhihe Yang (zhyang@link.cuhk.edu.hk) at Microsoft Research Asia.Corresponding author (xufang.luo@microsoft.com)Corresponding author (yjxu@mae.cuhk.edu.hk)
Abstract

Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples.

1 Introduction

Refer to caption
Figure 1: (a) OPA-DPO motivation: Naive adoption of DPO struggles to learn off-policy preferred responses due to the substantial reverse KL-divergence constraint (induced by unmatched supports). Our OPA operation aligns these responses on-policy, enabling effective learning with subsequent DPO. (b) Data scale vs. performance: We present the AMBER hallucination rates for various DPO-based algorithms and their training data volume. OPA-DPO (star markers) achieves SOTA performance with minimal amount of data. (c) Impact of OPA: Using LLaVA-1.5-13B with 4.8k data, we evaluate performance of DPO with/without OPA operations. The inclusion of OPA significantly enhances performance compared to DPO alone.

Recent advancements in instruction-following Large Vision-Language Models (LVLMs) have achieved significant milestones [1, 2, 3, 4]. By integrating pre-trained vision encoders with Large Language Models (LLMs) and fine-tuning them on instruction-based datasets, the combined models demonstrate remarkable image understanding capabilities [5, 6, 7]. This technology shows considerable potential across various fields, including image captioning [6], pathology recognition [8, 9], and medical imaging diagnostic [10, 11]. Nevertheless, a significant barrier hinders their practical application: hallucinations [12, 13, 14], which refer to discrepancies between the image’s actual content and the model-generated text. Such issue is pronounced in LVLMs. Even the most advanced GPT-4V [4] exhibits hallucinations in 45.9% of responses for certain tasks [15].

Refer to caption
Figure 2: We categorize existing DPO-based algorithms for addressing hallucination issues in LVLMs into 3 classes: (1) Hallucination Injection (POVID [16] and HALVA [17]). The ground-truth response is preferred, while the rejected response contains injected hallucinations. Since the errors do not originate from the model itself, the policy is unlikely to benefit from training. (2) Hallucination Recognition (RLHF-V [15], HA-DPO [18] and HSA-DPO [19]). The model generates responses, after which experts (AI or human) identify errors and make revisions. The off-policy nature of the revised responses makes them challenging to learn effectively. (3) Self Evolution (RLAIF-V [20]). Both preferred and rejected responses are generated by the initial policy. A superior model assesses hallucinations, preferring the response with fewer errors. However, hallucinations may exist in both responses, thereby affecting the learning efficiency.

Among numerous studies aimed at reducing hallucinations in LVLMs, a remarkable approach is further fine-tuning the models using Reinforcement Learning from Human Feedback (RLHF) [15, 21] or AI Feedback (RLAIF) [19, 18, 20]. RLH(AI)F aligns the model to the correct direction by fine-tuning it with constructed preference pairs, where the win response has less hallucination to the identical image and prompt than the loss one. RLH(AI)F algorithms can be broadly categorized into two classes: Proximal Policy Optimization (PPO) [22] and Direct Preference Optimization (DPO) [23]. DPO is generally simpler than PPO in practice because it streamlines the operational framework by eliminating the need for reward model training and the online rollout process in dataset generation, but relying solely on pre-collected offline data.

Indeed, PPO [22] and DPO [23] share the same learning objective: to maximize the reward derived from the Bradley-Terry model [24] while constraining the Kullback-Leibler (KL) divergence [25] between the updated policy and the initial (reference) policy. However, a key distinction arises during the training process: PPO, an online and on-policy algorithm, requires the use of online rollout data, whereas DPO relies entirely on offline datasets, which may be collected by any policy in practice [26, 27]. Therefore, owing to the constraint of reverse KL-divergence, the data used in PPO training process is predominantly on-policy111In this paper, on-policy denotes data with high sampling probability under the initial policy, while off-policy indicates low or near-zero. in relation to the initial (reference) policy. In contrast, for DPO, where policy is trained completely offline, the alignment of training data with the reference policy is rarely considered, especially for LVLM works.

In this paper, we reveal a key insight: the on-policy property of training data, which was neglected by DPO-based algorithms used to train LVLMs, plays a crucial role in enabling effective DPO training. As illustrated in Figure 1a left, strictly off-policy preferred responses cannot be learned by naive DPO. The limitation stems from the fact that assigning even a small positive probability to such off-policy data leads to substantially large KL-divergence between updated policy and reference policy, due to the mismatches in their support (see detailed analysis in Chapter 3).

Based on the on-policy property, we classify existing algorithms that employ DPO to tackle the hallucination issues for LVLMs into three distinct categories, as shown in Figure 2. Despite minor differences in datasets and training parameters, Method 3 significantly outperforms the other two methods. This can be attributed to Method 3’s exclusive use of on-policy preference pairs, whereas the other two methods involve either preferred or rejected responses off-policy. Notably, Method 3 has an inherent limitation: persistent hallucinations may exist in both preferred and rejected responses, since all responses are generated by the policy to be updated. This shortcoming results in inefficient learning and the requirement on a large volume of data to achieve satisfying performance.

Converting these insights into solutions, we propose a novel framework: On-Policy Alignment (OPA)-DPO (cf. Figure 1a right), which ensures that data remains on-policy while simultaneously leveraging expert guidance to improve learning efficiency of LVLMs. We first utilize GPT-4V [4] to recognize hallucination and deliver fine-grained revisions to the model-generated responses. Then we align these off-policy adjustments on-policy through fine-tuning the initial policy. The operation allows the subsequent DPO training to circumvent the constraints imposed by KL-divergence and effectively incorporate these changes.

Our contributions are threefold: (1) We identify an intrinsic property of DPO: its high reliance on on-policy data. (2) We summarize the inherent flaws of existing algorithms that employ DPO to address the hallucination problem. (3) Building upon the identified shortcomings of existing methods, we propose OPA-DPO, a novel framework that utilizes 4.8k data to achieve state-of-the-art (SOTA) performance on hallucination benchmarks, surpassing previous methods relying on larger datasets (Figure 1b,c).

2 Preliminary

Large Vision-Language Models.

LVLMs represent a class of multimodal models that integrate visual and linguistic information to generate outputs in natural language [14]. Typically, LVLMs comprise three components [5]: a visual encoder, a modality connection module, and an LLM. The visual encoder transforms input images (𝐦𝐦\mathbf{m}bold_m) into visual tokens. The connection module aligns these visual tokens with the LLM’s word embedding space. Combined with a user-provided linguistic prompt (𝐱𝐱\mathbf{x}bold_x), the LLM generates response (𝐲𝐲\mathbf{y}bold_y) in an auto-regressive manner.

Direct Preference Optimization.

To further enhance the performance of LVLMs, RLHF/RLAIF necessitates a reward model r���(𝐱,𝐲,𝐦)𝑟𝐱𝐲𝐦r(\mathbf{x},\mathbf{y},\mathbf{m})italic_r ( bold_x , bold_y , bold_m ), which evaluates human preferences for the response 𝐲𝐲\mathbf{y}bold_y given the prompt 𝐱𝐱\mathbf{x}bold_x and the image 𝐦𝐦\mathbf{m}bold_m. The fundamental learning objective is expressed as

maxπθ𝔼𝒟[r(𝐱,𝐲,𝐦)]β𝔻KL[πθ(|𝐱,𝐦)||πref(|𝐱,𝐦)],\max\nolimits_{\pi_{\theta}}\mathbb{E}_{\mathcal{D}}[r(\mathbf{x},\mathbf{y},% \mathbf{m})]-\\ \beta\mathbb{D}_{\mathrm{KL}}[\pi_{\theta}(\cdot|\mathbf{x},\mathbf{m})||\pi_{% \mathrm{ref}}(\cdot|\mathbf{x},\mathbf{m})],roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ italic_r ( bold_x , bold_y , bold_m ) ] - italic_β blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | bold_x , bold_m ) | | italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( ⋅ | bold_x , bold_m ) ] , (1)

where 𝒟𝒟\mathcal{D}caligraphic_D represents the datasets where the prompts and images are sampled from. 𝔻KLsubscript𝔻KL\mathbb{D}_{\mathrm{KL}}blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT stands for the KL-divergence, and β𝛽\betaitalic_β controls the degree of regularization. DPO [23] derives the closed-form optimal solution for Eq. (1) and identify the reward function can be analytically expressed via

r(𝐱,𝐲,𝐦)=βlogπθ(𝐲|𝐱,𝐦)πref(𝐲|𝐱,𝐦)+βlogZ(𝐱,𝐦),𝑟𝐱𝐲𝐦𝛽subscript𝜋𝜃conditional𝐲𝐱𝐦subscript𝜋refconditional𝐲𝐱𝐦𝛽𝑍𝐱𝐦r(\mathbf{x},\mathbf{y},\mathbf{m})=\beta\log\frac{\pi_{\theta}(\mathbf{y}|% \mathbf{x},\mathbf{m})}{\pi_{\mathrm{ref}}(\mathbf{y}|\mathbf{x},\mathbf{m})}+% \beta\log Z(\mathbf{x},\mathbf{m}),italic_r ( bold_x , bold_y , bold_m ) = italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y | bold_x , bold_m ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( bold_y | bold_x , bold_m ) end_ARG + italic_β roman_log italic_Z ( bold_x , bold_m ) , (2)

where Z(𝐱,𝐦)𝑍𝐱𝐦Z(\mathbf{x},\mathbf{m})italic_Z ( bold_x , bold_m ) is a partition function that only depends on prompt 𝐱𝐱\mathbf{x}bold_x and image 𝐦𝐦\mathbf{m}bold_m. Incorporating with Bradley-Terry model [24], and the dataset comprising preference pairs (𝐲wsubscript𝐲𝑤\mathbf{y}_{w}bold_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT over 𝐲lsubscript𝐲𝑙\mathbf{y}_{l}bold_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT) towards the same prompt 𝐱𝐱\mathbf{x}bold_x and image 𝐦𝐦\mathbf{m}bold_m, the model can be directly optimized through

DPO=𝔼𝒟[logσ(r(𝐱,𝐲w,𝐦)r(𝐱,𝐲l,𝐦))]subscriptDPOsubscript𝔼𝒟delimited-[]𝜎𝑟𝐱subscript𝐲𝑤𝐦𝑟𝐱subscript𝐲𝑙𝐦\displaystyle\mathcal{L}_{\mathrm{DPO}}=-\mathbb{E}_{\mathcal{D}}[\log\sigma(r% (\mathbf{x},\mathbf{y}_{w},\mathbf{m})-r(\mathbf{x},\mathbf{y}_{l},\mathbf{m}))]caligraphic_L start_POSTSUBSCRIPT roman_DPO end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_r ( bold_x , bold_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , bold_m ) - italic_r ( bold_x , bold_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_m ) ) ] (3)
=𝔼𝒟[logσ(βlogπθ(𝐲w|𝐱,𝐦)πref(𝐲w|𝐱,𝐦)βlogπθ(𝐲l|𝐱,𝐦)πref(𝐲l|𝐱,𝐦))],absentsubscript𝔼𝒟delimited-[]𝜎𝛽subscript𝜋𝜃conditionalsubscript𝐲𝑤𝐱𝐦subscript𝜋refconditionalsubscript𝐲𝑤𝐱𝐦𝛽subscript𝜋𝜃conditionalsubscript𝐲𝑙𝐱𝐦subscript𝜋refconditionalsubscript𝐲𝑙𝐱𝐦\displaystyle=-\mathbb{E}_{\mathcal{D}}[\log\sigma(\scalebox{1.05}{$\beta\log% \frac{\pi_{\theta}(\mathbf{y}_{w}|\mathbf{x},\mathbf{m})}{\pi_{\mathrm{ref}}(% \mathbf{y}_{w}|\mathbf{x},\mathbf{m})}\!-\!\beta\log\frac{\pi_{\theta}(\mathbf% {y}_{l}|\mathbf{x},\mathbf{m})}{\pi_{\mathrm{ref}}(\mathbf{y}_{l}|\mathbf{x},% \mathbf{m})}$})],= - blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG ) ] ,

where σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) denotes the sigmoid function.

Supervised Fine Tuning.

As the most commonly used technique for LLMs and multimodal LLMs, Supervised Fine Tuning (SFT) is a simple and efficient method to align pre-trained models with downstream tasks. Given a dataset 𝒟𝒟\mathcal{D}caligraphic_D including prompts 𝐱𝐱\mathbf{x}bold_x, images 𝐦𝐦\mathbf{m}bold_m, and the corresponding standard responses 𝐲𝐲\mathbf{y}bold_y, the training loss for SFT is

SFT=𝔼𝒟[iLcC𝕀(yic)logπθ(yic|𝐱,𝐦,𝐲<i)],subscriptSFTsubscript𝔼𝒟delimited-[]subscriptsuperscript𝐿𝑖subscriptsuperscript𝐶𝑐𝕀superscriptsubscript𝑦𝑖𝑐subscript𝜋𝜃conditionalsuperscriptsubscript𝑦𝑖𝑐𝐱𝐦subscript𝐲absent𝑖\mathcal{L}_{\mathrm{SFT}}=-\mathbb{E}_{\mathcal{D}}\left[\sum^{L}_{i}\sum^{C}% _{c}\mathbb{I}(y_{i}^{c})\log\pi_{\theta}(y_{i}^{c}|\mathbf{x},\mathbf{m},% \mathbf{y}_{<i})\right],caligraphic_L start_POSTSUBSCRIPT roman_SFT end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ ∑ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT blackboard_I ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_x , bold_m , bold_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ] , (4)

where L𝐿Litalic_L is the length of the response, C𝐶Citalic_C is the number of possible classes or tokens, 𝕀(yic)𝕀superscriptsubscript𝑦𝑖𝑐\mathbb{I}(y_{i}^{c})blackboard_I ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) is an indicator function that equals 1 if the i𝑖iitalic_i-th token is of class c𝑐citalic_c and 0 otherwise, and πθ(yic|𝐱,𝐦,𝐲<i)subscript𝜋𝜃conditionalsuperscriptsubscript𝑦𝑖𝑐𝐱𝐦subscript𝐲absent𝑖\pi_{\theta}(y_{i}^{c}|\mathbf{x},\mathbf{m},\mathbf{y}_{<i})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | bold_x , bold_m , bold_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) represents the model’s predicted probability of the i𝑖iitalic_i-th token given the prompt 𝐱𝐱\mathbf{x}bold_x, image 𝐦𝐦\mathbf{m}bold_m, and the sequence of preceding tokens 𝐲<isubscript𝐲absent𝑖\mathbf{y}_{<i}bold_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT.

On-Policy Data.

In the realm of reinforcement learning (RL), on-policy data is sampled from the current policy and becomes off-policy after the policy is updated [28]. As a fine-tuning process, policy updates for LLMs do not significantly change its sampling probabilities. In this paper, we define a response 𝐲𝐲\mathbf{y}bold_y to a prompt 𝐱𝐱\mathbf{x}bold_x and image 𝐦𝐦\mathbf{m}bold_m as on-policy if πref(𝐲|𝐱,𝐦)>ϵsubscript𝜋refconditional𝐲𝐱𝐦italic-ϵ\pi_{\mathrm{ref}}(\mathbf{y}|\mathbf{x},\mathbf{m})>\epsilonitalic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( bold_y | bold_x , bold_m ) > italic_ϵ, where ϵitalic-ϵ\epsilonitalic_ϵ is a small positive threshold.

3 Problem Analysis

Three questions reflect our thinking path in this work:

  • \bullet

    Q1: How does the dataset distribution relative to the initial/reference policy affect the performance of DPO?

  • \bullet

    Q2: What are the inherent flaws of other algorithms that employ DPO to tackle hallucination problems?

  • \bullet

    Q3: What adjustments can be made to current frameworks to rectify their intrinsic deficiencies?

Question 1

Note that the minimizer of Eq. (3) corresponds to the optimal solution for Eq. (1). Nevertheless, by reconsidering the definition of the KL-divergence

𝔻KL[PQ]:=y𝒴P(y)logP(y)Q(y),assignsubscript𝔻KLdelimited-[]conditional𝑃𝑄subscript𝑦𝒴𝑃𝑦𝑙𝑜𝑔𝑃𝑦𝑄𝑦\mathbb{D}_{\mathrm{KL}}[P\|Q]:=\sum\nolimits_{y\in\mathcal{Y}}P(y)log% \scalebox{1.05}{$\frac{P(y)}{Q(y)}$},blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_P ∥ italic_Q ] := ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT italic_P ( italic_y ) italic_l italic_o italic_g divide start_ARG italic_P ( italic_y ) end_ARG start_ARG italic_Q ( italic_y ) end_ARG , (5)

where P���Pitalic_P and Q𝑄Qitalic_Q represent two distinct probability distributions. We can deduce the following fact

Fact 1.

Given a prompt 𝐱𝐱\mathbf{x}bold_x and an image 𝐦𝐦\mathbf{m}bold_m, suppose there exists one response 𝐲𝐲\mathbf{y}bold_y such that πθ(𝐲|𝐱,𝐦)>0subscript𝜋𝜃conditional𝐲𝐱𝐦0\pi_{\theta}(\mathbf{y}|\mathbf{x},\mathbf{m})>0italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y | bold_x , bold_m ) > 0, whereas πref(𝐲|𝐱,𝐦)0subscript𝜋refconditional𝐲𝐱𝐦0\pi_{\mathrm{ref}}(\mathbf{y}|\mathbf{x},\mathbf{m})\rightarrow 0italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( bold_y | bold_x , bold_m ) → 0, the KL-divergence between the two policy has 𝔻KL[πθ(|𝐱,𝐦)||πref(|𝐱,𝐦)]\mathbb{D}_{\mathrm{KL}}[\pi_{\theta}(\cdot|\mathbf{x},\mathbf{m})||\pi_{% \mathrm{ref}}(\cdot|\mathbf{x},\mathbf{m})]\rightarrow\inftyblackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | bold_x , bold_m ) | | italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( ⋅ | bold_x , bold_m ) ] → ∞.

Fact 1 illustrates that if the preferred response has a near-zero probability with respect to the initial/reference policy, i.e., it is strictly off-policy data, then it can never be learned by any policy that begins from the learning objective outlined in Eq. (1). In other words, denoting the support for the updated policy as 𝒴θsubscript𝒴𝜃\mathcal{Y}_{\theta}caligraphic_Y start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, the support for the initial policy as 𝒴refsubscript𝒴ref\mathcal{Y}_{\mathrm{ref}}caligraphic_Y start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT, and the global sampling space as 𝒴globalsubscript𝒴global\mathcal{Y}_{\mathrm{global}}caligraphic_Y start_POSTSUBSCRIPT roman_global end_POSTSUBSCRIPT, we always have the relationship 𝒴θ𝒴ref𝒴globalsubscript𝒴𝜃subscript𝒴refsubscript𝒴global\mathcal{Y}_{\theta}\subseteq\mathcal{Y}_{\mathrm{ref}}\subseteq\mathcal{Y}_{% \mathrm{global}}caligraphic_Y start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ⊆ caligraphic_Y start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ⊆ caligraphic_Y start_POSTSUBSCRIPT roman_global end_POSTSUBSCRIPT. Any responses falling into the set 𝒴global𝒴refsubscript𝒴globalsubscript𝒴ref\mathcal{Y}_{\mathrm{global}}\setminus\mathcal{Y}_{\mathrm{ref}}caligraphic_Y start_POSTSUBSCRIPT roman_global end_POSTSUBSCRIPT ∖ caligraphic_Y start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT cannot be learned by πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. It should be noted that this issue arises only with DPO, as PPO samples responses on-line from 𝒴θsubscript𝒴𝜃\mathcal{Y}_{\theta}caligraphic_Y start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

A natural question arises: how does DPO deal with these off-policy preferred responses? By taking the partial derivative of Eq. (3), the gradient with respect to the policy parameters θ𝜃\thetaitalic_θ can be expressed as

θDPOsubscript𝜃subscriptDPO\displaystyle\nabla_{\theta}\mathcal{L}_{\mathrm{DPO}}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_DPO end_POSTSUBSCRIPT =𝔼(𝐲w,𝐲l,𝐱,𝐦)𝒟absentsubscript𝔼similar-tosubscript𝐲𝑤subscript𝐲𝑙𝐱𝐦𝒟\displaystyle=-\mathbb{E}_{(\mathbf{y}_{w},\mathbf{y}_{l},\mathbf{x},\mathbf{m% })\sim\mathcal{D}}= - blackboard_E start_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_x , bold_m ) ∼ caligraphic_D end_POSTSUBSCRIPT (6)
[βσ\displaystyle\bigg{[}\beta\cdot\sigma[ italic_β ⋅ italic_σ (βlogπθ(𝐲w|𝐱,𝐦)πref(𝐲w|𝐱,𝐦)rw+βlogπθ(𝐲l|𝐱,𝐦)πref(𝐲l|𝐱,𝐦)rl)\displaystyle\bigg{(}-\underbrace{\beta\log\frac{\pi_{\theta}(\mathbf{y}_{w}|% \mathbf{x},\mathbf{m})}{\pi_{\mathrm{ref}}(\mathbf{y}_{w}|\mathbf{x},\mathbf{m% })}}_{r_{w}}+\underbrace{\beta\log\frac{\pi_{\theta}(\mathbf{y}_{l}|\mathbf{x}% ,\mathbf{m})}{\pi_{\mathrm{ref}}(\mathbf{y}_{l}|\mathbf{x},\mathbf{m})}}_{r_{l% }}\bigg{)}\cdot( - under⏟ start_ARG italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG end_ARG start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_ref end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG end_ARG start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ⋅
(θlogπθ(𝐲w|𝐱,𝐦)θlogπθ(𝐲l|𝐱,𝐦)loglikelihood)],\displaystyle\big{(}\underbrace{\nabla_{\theta}\log\pi_{\theta}(\mathbf{y}_{w}% |\mathbf{x},\mathbf{m})-\nabla_{\theta}\log\pi_{\theta}(\mathbf{y}_{l}|\mathbf% {x},\mathbf{m})}_{\mathrm{log-likelihood}}\big{)}\bigg{]},( under⏟ start_ARG ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | bold_x , bold_m ) - ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG start_POSTSUBSCRIPT roman_log - roman_likelihood end_POSTSUBSCRIPT ) ] ,

where σ(rwrl)=0.5𝜎subscript𝑟𝑤subscript𝑟𝑙0.5\sigma(r_{w}-r_{l})=0.5italic_σ ( italic_r start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) = 0.5 prior to the policy update. Assuming that the preferred response 𝐲wsubscript𝐲𝑤\mathbf{y}_{w}bold_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is off-policy, πθ(𝐲w|𝐱,𝐦)subscript𝜋𝜃conditionalsubscript𝐲𝑤𝐱𝐦\pi_{\theta}(\mathbf{y}_{w}|\mathbf{x},\mathbf{m})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | bold_x , bold_m ) undergoes a single step of log-likelihood maximization with a coefficient of 0.5β0.5𝛽0.5\beta0.5 italic_β. Following this update, rwsubscript𝑟𝑤r_{w}italic_r start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT inclined to become substantially large, thereby causing σ(rwrl)0𝜎subscript𝑟𝑤subscript𝑟𝑙0\sigma(r_{w}-r_{l})\rightarrow 0italic_σ ( italic_r start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) → 0. Nonetheless, the increment in probability induced by this single-step update proves insufficient for the preferred response to be sampled during the auto-regressive generation process. In summary, the low likelihood of off-policy preferred responses (relative to reference policy) drives the DPO updating weight toward zero, thereby rendering effective learning nearly impossible.

Question 2

For algorithms adopting Method 1 (Hallucination Injection) outlined in Figure 2, a ground-truth (GT) response is deemed on-policy if it has been incorporated into fine-tuning SFT dataset. The shortcoming is evident: hallucinations do not originate from the model itself. While the probability associated with the GT response is augmented, the probability of model-intrinsic hallucinations is neither explicitly identified nor substantially diminished.

Method 2, Hallucination Recognition, is the most widely adopted approach, with the majority of studies opting to use GPT-4 with ground-truth image captions or GPT-4V as the recognizer. Nevertheless, a significant challenge persists: the preferred response often remains off-policy, as highlighted in our answers to Question 1.

Method 3, Self Evolution, is exclusively employed by RLAIF-V [20], which significantly outperforms the previous two methods in hallucination benchmarks. However, it has a notable shortcoming: since the method relies on the model generating two responses to form a preference pair, it cannot effectively address intrinsic hallucinations present in both responses. As a result, this approach requires a substantial amount of data and multiple iterative updates.

Question 3

Method 2 employs domain experts to construct preferred responses, establishing a robust paradigm but encountering the off-policy issue. Although Method 3 addresses this challenge, the reliability of the preferred responses is compromised. To synergize the strengths of both approaches, namely aligning expert-revised preferred responses with the on-policy framework, it is essential to consider modifications to the model itself before commencing DPO training. A promising method that comes to mind is adapting Low Rank Adaptation (LoRA) SFT to the expert revision, which our experimental evidence demonstrates to be exceptionally effective. In conjunction with our adjusted DPO training loss, OPA-DPO is capable of achieving SOTA performance with a minimal data requirement.

4 On-Policy Alignment DPO

Refer to caption
Figure 3: Our proposed OPA-DPO comprises four essential steps: ① Collect responses from the original policy based on the images and corresponding prompts. ② Utilize GPT-4V to correct any hallucinations in the generated responses with minimal modifications. ③ Conduct LoRA-SFT on the GT responses and revised responses. ④ Initiate OPA-DPO training from the policy obtained in step 3.

As illustrated in Figure 3, our proposed OPA-DPO framework encompasses four essential steps. The initial two steps, designated as data collection, along with the third step, on-policy alignment, are detailed in Chapter 4.1. The final step, OPA-DPO training, is elaborated in Chapter 4.2.

4.1 Data Collection and On-Policy Alignment

Initially, we instruct the model (slated for training) to generate responses based on pre-collected images and prompts, using a combination of “top-k” and “top-p” sampling methods. Following this, we supply GPT-4V with the generated responses 𝐲Gensubscript𝐲Gen\mathbf{y}_{\mathrm{Gen}}bold_y start_POSTSUBSCRIPT roman_Gen end_POSTSUBSCRIPT, the original prompts 𝐱𝐱\mathbf{x}bold_x, the images 𝐦𝐦\mathbf{m}bold_m, and the GT responses 𝐲GTsubscript𝐲GT\mathbf{y}_{\mathrm{GT}}bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT. GPT-4V is tasked with identifying hallucinations by evaluating the generated responses at the sentence level. Each sentence within a response is assigned a score, Shalsubscript𝑆halS_{\mathrm{hal}}italic_S start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT, which indicates the severity of the hallucination. Moreover, GPT-4V is required to categorize sentences with incorrect description as either image recognition errors or language comprehension errors, with the classification results represented by Simgsubscript𝑆imgS_{\mathrm{img}}italic_S start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT. Additionally, GPT-4V is also instructed to make minimal revisions to any erroneous sentences, and the aggregate of these revised sentences is denoted as 𝐲Revsubscript𝐲Rev\mathbf{y}_{\mathrm{Rev}}bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT (refer to Appendix for further details).

Subsequently, we integrate the GT responses with the GPT-4V revised responses to construct an instruction-following dataset, which includes 𝐲GT,𝐲Rev,𝐱,𝐦subscript𝐲GTsubscript𝐲Rev𝐱𝐦\mathbf{y}_{\mathrm{GT}},\mathbf{y}_{\mathrm{Rev}},\mathbf{x},\mathbf{m}bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT , bold_x , bold_m. We then perform LoRA-SFT on this dataset, utilizing the loss function in Eq. (4). We denote the resulting policy from this phase as πOPAsubscript𝜋OPA\pi_{\mathrm{OPA}}italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT. Note that this policy serves as the reference (initial) policy for the subsequent OPA-DPO training.

4.2 OPA-DPO Training

Compared to classical DPO, which forms only a single language preference pair, our OPA-DPO loss comprises three distinct components, each containing two pairs:

Language Corrections.

As the most basic component of DPO, the language-level preference is naturally formed between the GT response and the generated response, as well as between the revised response and the generated response. Following the approach outlined in RLHF-V [15], we aim to concentrate the policy update on the erroneous sections and their respective corrections. To achieve this, we construct a mapping from the GPT-4V marked hallucination scores to establish the update weight Whal(Shal)subscript𝑊halsubscript𝑆halW_{\mathrm{hal}}(S_{\mathrm{hal}})italic_W start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT ). Then the hallucination-weighted log-policy is defined as logπhw(𝐲|𝐱,𝐦)=iLWhal(Shali)logπ(yi|𝐱,𝐦,𝐲<i)superscript𝜋hwconditional𝐲𝐱𝐦superscriptsubscript𝑖𝐿subscript𝑊halsubscriptsuperscript𝑆𝑖hal𝜋conditionalsubscript𝑦𝑖𝐱𝐦subscript𝐲absent𝑖\log\pi^{\mathrm{hw}}(\mathbf{y}|\mathbf{x},\mathbf{m})=\sum_{i}^{L}W_{\mathrm% {hal}}(S^{i}_{\mathrm{hal}})\log\pi(y_{i}|\mathbf{x},\mathbf{m},\mathbf{y}_{<i})roman_log italic_π start_POSTSUPERSCRIPT roman_hw end_POSTSUPERSCRIPT ( bold_y | bold_x , bold_m ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT ( italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT ) roman_log italic_π ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_x , bold_m , bold_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ), where L𝐿Litalic_L represents the response length, and Shalisubscriptsuperscript𝑆𝑖halS^{i}_{\mathrm{hal}}italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT denotes the hallucination score for token yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Note that Shalisubscriptsuperscript𝑆𝑖halS^{i}_{\mathrm{hal}}italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT remain consistent for tokens within the same sentence but may vary between different sentences. Subsequently, we can form language correction preference pairs

LC=𝔼(𝐲GT,𝐲Rev,𝐲Gen,𝐱,𝐦)𝒟subscriptLCsubscript𝔼similar-tosubscript𝐲GTsubscript𝐲Revsubscript𝐲Gen𝐱𝐦𝒟\displaystyle\mathcal{L}_{\mathrm{LC}}=-\mathbb{E}_{(\mathbf{y}_{\mathrm{GT}},% \mathbf{y}_{\mathrm{Rev}},\mathbf{y}_{\mathrm{Gen}},\mathbf{x},\mathbf{m})\sim% \mathcal{D}}caligraphic_L start_POSTSUBSCRIPT roman_LC end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_Gen end_POSTSUBSCRIPT , bold_x , bold_m ) ∼ caligraphic_D end_POSTSUBSCRIPT (7)
[logσ(βlogπθ(𝐲GT|𝐱,𝐦)πOPA(𝐲GT|𝐱,𝐦)βlogπθ(𝐲Gen|𝐱,𝐦)πOPA(𝐲Gen|𝐱,𝐦))\displaystyle\bigg{[}\log\sigma\bigg{(}\beta\log\tfrac{\pi_{\theta}(\mathbf{y}% _{\mathrm{GT}}|\mathbf{x},\mathbf{m})}{\pi_{\mathrm{OPA}}(\mathbf{y}_{\mathrm{% GT}}|\mathbf{x},\mathbf{m})}-\beta\log\tfrac{\pi_{\theta}(\mathbf{y}_{\mathrm{% Gen}}|\mathbf{x},\mathbf{m})}{\pi_{\mathrm{OPA}}(\mathbf{y}_{\mathrm{Gen}}|% \mathbf{x},\mathbf{m})}\bigg{)}[ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_Gen end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_Gen end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG )
+logσ(βlogπθhw(𝐲Rev|𝐱,𝐦)πOPAhw(𝐲Rev|𝐱,𝐦)βlogπθhw(𝐲Gen|𝐱,𝐦)πOPAhw(𝐲Gen|𝐱,𝐦))].\displaystyle+\log\sigma\bigg{(}\beta\log\tfrac{\pi_{\theta}^{\mathrm{hw}}(% \mathbf{y}_{\mathrm{Rev}}|\mathbf{x},\mathbf{m})}{\pi_{\mathrm{OPA}}^{\mathrm{% hw}}(\mathbf{y}_{\mathrm{Rev}}|\mathbf{x},\mathbf{m})}-\beta\log\tfrac{\pi_{% \theta}^{\mathrm{hw}}(\mathbf{y}_{\mathrm{Gen}}|\mathbf{x},\mathbf{m})}{\pi_{% \mathrm{OPA}}^{\mathrm{hw}}(\mathbf{y}_{\mathrm{Gen}}|\mathbf{x},\mathbf{m})}% \bigg{)}\bigg{]}.+ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_hw end_POSTSUPERSCRIPT ( bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_hw end_POSTSUPERSCRIPT ( bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_hw end_POSTSUPERSCRIPT ( bold_y start_POSTSUBSCRIPT roman_Gen end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_hw end_POSTSUPERSCRIPT ( bold_y start_POSTSUBSCRIPT roman_Gen end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG ) ] .

Image Focus Mechanism.

A critical obstacle in ensuring LVLMs properly engage with images is their innate tendency to ignore the visual modality during the optimization phase [27, 16]. Intuitively, when the image data is compromised, the probability that the model produces the correct response diminishes. Building upon mDPO [27], we form preference pairs between the original images 𝐦𝐦\mathbf{m}bold_m and distorted images 𝐦superscript𝐦\mathbf{m^{\prime}}bold_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, using the same prompts and GT/revised responses. Furthermore, we expect this mechanism to be more effective for sentences where understanding of the image itself is biased. To accomplish this, we create another mapping from the GPT-4V marked categorization results Simgsubscript𝑆imgS_{\mathrm{img}}italic_S start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT to determine the update weight Wimg(Simg)subscript𝑊imgsubscript𝑆imgW_{\mathrm{img}}(S_{\mathrm{img}})italic_W start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT ). We then describe the image-weighted log-policy as logπiw(𝐲|𝐱,𝐦)=iLWimg(Simgi)logπ(yi|𝐱,𝐦,𝐲<i)superscript𝜋iwconditional𝐲𝐱𝐦superscriptsubscript𝑖𝐿subscript𝑊imgsubscriptsuperscript𝑆𝑖img𝜋conditionalsubscript𝑦𝑖𝐱𝐦subscript𝐲absent𝑖\log\pi^{\mathrm{iw}}(\mathbf{y}|\mathbf{x},\mathbf{m})=\sum_{i}^{L}W_{\mathrm% {img}}(S^{i}_{\mathrm{img}})\log\pi(y_{i}|\mathbf{x},\mathbf{m},\mathbf{y}_{<i})roman_log italic_π start_POSTSUPERSCRIPT roman_iw end_POSTSUPERSCRIPT ( bold_y | bold_x , bold_m ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT ( italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT ) roman_log italic_π ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_x , bold_m , bold_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ). This allows us to subsequently establish image focus preference pairs

IF=𝔼(𝐲GT,𝐲Rev,𝐱,𝐦,𝐦)𝒟subscriptIFsubscript𝔼similar-tosubscript𝐲GTsubscript𝐲Rev𝐱𝐦superscript𝐦𝒟\displaystyle\mathcal{L}_{\mathrm{IF}}=-\mathbb{E}_{(\mathbf{y}_{\mathrm{GT}},% \mathbf{y}_{\mathrm{Rev}},\mathbf{x},\mathbf{m},\mathbf{m^{\prime}})\sim% \mathcal{D}}caligraphic_L start_POSTSUBSCRIPT roman_IF end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT , bold_x , bold_m , bold_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT (8)
[logσ(βlogπθ(𝐲GT|𝐱,𝐦)πOPA(𝐲GT|𝐱,𝐦)βlogπθ(𝐲GT|𝐱,𝐦)πOPA(𝐲GT|𝐱,𝐦))\displaystyle\bigg{[}\log\sigma\bigg{(}\beta\log\tfrac{\pi_{\theta}(\mathbf{y}% _{\mathrm{GT}}|\mathbf{x},\mathbf{m})}{\pi_{\mathrm{OPA}}(\mathbf{y}_{\mathrm{% GT}}|\mathbf{x},\mathbf{m})}-\beta\log\tfrac{\pi_{\theta}(\mathbf{y}_{\mathrm{% GT}}|\mathbf{x},\mathbf{m^{\prime}})}{\pi_{\mathrm{OPA}}(\mathbf{y}_{\mathrm{% GT}}|\mathbf{x},\mathbf{m^{\prime}})}\bigg{)}[ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT | bold_x , bold_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT | bold_x , bold_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG )
+logσ(βlogπθiw(𝐲Rev|𝐱,𝐦)πOPAiw(𝐲Rev|𝐱,𝐦)βlogπθiw(𝐲Rev|𝐱,𝐦)πOPAiw(𝐲Rev|𝐱,𝐦))].\displaystyle+\log\sigma\bigg{(}\beta\log\tfrac{\pi_{\theta}^{\mathrm{iw}}(% \mathbf{y}_{\mathrm{Rev}}|\mathbf{x},\mathbf{m})}{\pi_{\mathrm{OPA}}^{\mathrm{% iw}}(\mathbf{y}_{\mathrm{Rev}}|\mathbf{x},\mathbf{m})}-\beta\log\tfrac{\pi_{% \theta}^{\mathrm{iw}}(\mathbf{y}_{\mathrm{Rev}}|\mathbf{x},\mathbf{m^{\prime}}% )}{\pi_{\mathrm{OPA}}^{\mathrm{iw}}(\mathbf{y}_{\mathrm{Rev}}|\mathbf{x},% \mathbf{m^{\prime}})}\bigg{)}\bigg{]}.+ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_iw end_POSTSUPERSCRIPT ( bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_iw end_POSTSUPERSCRIPT ( bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_iw end_POSTSUPERSCRIPT ( bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT | bold_x , bold_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_iw end_POSTSUPERSCRIPT ( bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT | bold_x , bold_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ) ] .

Anchored Preference.

Numerous studies [29, 27, 18] document a reduced likelihood of preferred response during the DPO training process. This trend may be attributed to the intrinsic characteristics of DPO, which concentrates on the relative preferences. Our findings align with these studies and we observe that the reduction adversely affects downstream performance. Following mDPO [27], we employ two anchors to constrain the preferred

Anc=𝔼(𝐲GT,𝐲Rec,𝐱,𝐦)𝒟subscriptAncsubscript𝔼similar-tosubscript𝐲GTsubscript𝐲Rec𝐱𝐦𝒟\displaystyle\mathcal{L}_{\mathrm{Anc}}\!=\!-\mathbb{E}_{(\mathbf{y}_{\mathrm{% GT}},\mathbf{y}_{\mathrm{Rec}},\mathbf{x},\mathbf{m})\sim\mathcal{D}}caligraphic_L start_POSTSUBSCRIPT roman_Anc end_POSTSUBSCRIPT = - blackboard_E start_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_Rec end_POSTSUBSCRIPT , bold_x , bold_m ) ∼ caligraphic_D end_POSTSUBSCRIPT [logσ(βlogπθ(𝐲GT|𝐱,𝐦)πOPA(𝐲GT|𝐱,𝐦)δ)\displaystyle\bigg{[}\log\sigma\bigg{(}\beta\log\tfrac{\pi_{\theta}(\mathbf{y}% _{\mathrm{GT}}|\mathbf{x},\mathbf{m})}{\pi_{\mathrm{OPA}}(\mathbf{y}_{\mathrm{% GT}}|\mathbf{x},\mathbf{m})}\!-\!\delta\bigg{)}[ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG - italic_δ ) (9)
+logσ𝜎\displaystyle+\log\sigma+ roman_log italic_σ (βlogπθ(𝐲Rec|𝐱,𝐦)πOPA(𝐲Rec|𝐱,𝐦)δ)].\displaystyle\bigg{(}\beta\log\tfrac{\pi_{\theta}(\mathbf{y}_{\mathrm{Rec}}|% \mathbf{x},\mathbf{m})}{\pi_{\mathrm{OPA}}(\mathbf{y}_{\mathrm{Rec}}|\mathbf{x% },\mathbf{m})}-\delta\bigg{)}\bigg{]}.( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_Rec end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_Rec end_POSTSUBSCRIPT | bold_x , bold_m ) end_ARG - italic_δ ) ] .

Combining Eqs. (7)(8)(9), we get the loss for OPA-DPO:

OPADPO=LC+γ1IF+γ2Anc.subscriptOPADPOsubscriptLCsubscript𝛾1subscriptIFsubscript𝛾2subscriptAnc\mathcal{L}_{\mathrm{OPA-DPO}}=\mathcal{L}_{\mathrm{LC}}+\gamma_{1}\mathcal{L}% _{\mathrm{IF}}+\gamma_{2}\mathcal{L}_{\mathrm{Anc}}.caligraphic_L start_POSTSUBSCRIPT roman_OPA - roman_DPO end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_LC end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_IF end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_Anc end_POSTSUBSCRIPT . (10)

It should be noted that each component is crucial and cannot be omitted. See Chapter 5.4 for ablation studies.

5 Experiments

5.1 Experimental Setup

Models and Datasets.

We apply OPA-DPO on two LVLMs with distinct parameter sizes: LLAVA-v1.5-7B and LLAVA-v1.5-13B, both using CLIP ViT-L-336px as the vision encoder. The 7B model is based on Vicuna-7B, and the 13B on Vicuna-13B. Each model underwent pretraining on 558K image-text pairs and was subsequently fine-tuned on 665K instruction-based samples. As for the datasets, we randomly selected 4.8K samples from the RLAIF-V [20] datasets, using their preferred response as the ground truth.

Evaluation Metrics.

We focuses on mitigating hallucinations in LVLMs, with experiments conducted on four benchmarks: 1) AMBER [30] A benchmark with detailed object annotations, featuring 1004 images in a generative task. Using the official codebase, we evaluate CHAIR score, object coverage, hallucination rate, and alignment with human cognition. 2) MMHalBench [21]: A question-answering benchmark with 96 images across 12 object categories. Following the official protocol, we use GPT-4 to rate responses from zero to six, calculating hallucination rate by the proportion of responses rated below three. 3) Object HalBench [31]: A widely used benchmark for assessing object hallucination. We evaluate across 300 instances using the Yu et al. [15] codebase, reporting hallucination rates at both response (CHAIRs) and object levels (CHAIRi).222It is noted that some studies report results based on 50 instances, we exclude these outcomes to prevent any potential confusion. 4) POPE [32]: A yes/no question-answering benchmark for object hallucination evaluation. We report accuracy and precision on its Adversarial set, consisting of 3000 cases.

Baseline Algorithms.

We mainly compare OPA-DPO with algorithms based on RLHF/RLAIF. As mentioned in Chapter 1, most algorithms, such as HALVA [17], POVID [16], RLHF-V [15], HA-DPO [18], HSA-DPO [19], RLAIF-V [20], and mDPO [27], prefer to use DPO, while LLaVA-RLHF [21] choose to use PPO.

Implementation Details.

For both the 7B and 13B models, we start with OPA training (LoRA-SFT) for 2 epochs, using a cosine learning rate schedule beginning at 2e-5. We set the batch size to 128, with a LoRA rank of 256 and alpha of 512. Following this, we perform OPA-DPO training on the SFT-tuned LoRA module for 4 more epochs, using a batch size of 32 and a cosine learning rate starting at 1e-6. In our equations, we set β=0.1𝛽0.1\beta=0.1italic_β = 0.1 in Eqs. (7)(8)(9), and γ1=0.2,γ2=1.0formulae-sequencesubscript𝛾10.2subscript𝛾21.0\gamma_{1}=0.2,\gamma_{2}=1.0italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.2 , italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1.0 in Eq. (10). For the distorted images 𝐦superscript𝐦\mathbf{m^{\prime}}bold_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in Eq. (8), we randomly mask 30% of pixels. For the anchors in Eq. (9), we set δ=0𝛿0\delta=0italic_δ = 0. See Appendix for pseudo code and more detailed settings.

5.2 Policy Distribution over Revised Responses

To demonstrate that off-policy preferred responses are not effectively learned through DPO, we visualize the response-averaged log probabilities of tokens, denoted as 1LiLlogπ(𝐲i|𝐱,𝐦,𝐲<i)1𝐿superscriptsubscript𝑖𝐿𝜋conditionalsubscript𝐲𝑖𝐱𝐦subscript𝐲absent𝑖\frac{1}{L}\sum_{i}^{L}\log\pi(\mathbf{y}_{i}|\mathbf{x},\mathbf{m},\mathbf{y}% _{<i})divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_log italic_π ( bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_x , bold_m , bold_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ), across 200 significantly revised responses from GPT-4V datasets, as shown in Figure 4. The distribution shows negligible change after DPO training without OPA, whereas a significant increase is observed with our proposed OPA-DPO.

Refer to caption
Figure 4: Distribution of response-averaged log probabilities for 200 significantly revised responses across different models.
Refer to caption
Figure 5: Impact of data amount on hallucination-rate metrics.

In order to emphasize the constraint arises from reverse KL-divergence, we measure the average KL-divergence 1LiL𝔻KL[πP(|𝐱,𝐦,𝐲<i)πQ(|𝐱,𝐦,𝐲<i)]\frac{1}{L}\sum_{i}^{L}\mathbb{D}_{\mathrm{KL}}[\pi_{P}(\cdot|\mathbf{x},% \mathbf{m},\mathbf{y}_{<i})\|\pi_{Q}(\cdot|\mathbf{x},\mathbf{m},\mathbf{y}_{<% i})]divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ⋅ | bold_x , bold_m , bold_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ∥ italic_π start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( ⋅ | bold_x , bold_m , bold_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ] and maximum KL-divergence maxi𝔻KL[πP(|𝐱,𝐦,𝐲<i)πQ(|𝐱,𝐦,𝐲<i)]\max_{i}\mathbb{D}_{\mathrm{KL}}[\pi_{P}(\cdot|\mathbf{x},\mathbf{m},\mathbf{y% }_{<i})\|\pi_{Q}(\cdot|\mathbf{x},\mathbf{m},\mathbf{y}_{<i})]roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ⋅ | bold_x , bold_m , bold_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ∥ italic_π start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( ⋅ | bold_x , bold_m , bold_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ] between different policies, averaging the results over the same 200 samples in Table 1. We observe that the reverse KL shows minor change after DPO (𝔻KL[πDPOπbase]subscript𝔻KLdelimited-[]conditionalsubscript𝜋DPOsubscript𝜋𝑏𝑎𝑠𝑒\mathbb{D}_{\mathrm{KL}}[\pi_{\mathrm{DPO}}\|\pi_{base}]blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT roman_DPO end_POSTSUBSCRIPT ∥ italic_π start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT ] and 𝔻KL[πDPOOPAπOPA]subscript𝔻KLdelimited-[]conditionalsubscriptsuperscript𝜋OPADPOsubscript𝜋OPA\mathbb{D}_{\mathrm{KL}}[\pi^{\mathrm{OPA}}_{\mathrm{DPO}}\|\pi_{\mathrm{OPA}}]blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_π start_POSTSUPERSCRIPT roman_OPA end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_DPO end_POSTSUBSCRIPT ∥ italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT ]); however, the divergence gap between πDPOOPAsubscriptsuperscript𝜋OPADPO\pi^{\mathrm{OPA}}_{\mathrm{DPO}}italic_π start_POSTSUPERSCRIPT roman_OPA end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_DPO end_POSTSUBSCRIPT and πbasesubscript𝜋𝑏𝑎𝑠𝑒\pi_{base}italic_π start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT is nearly an order of magnitude larger, indicating that naive DPO is insufficient to bridge this gap.

Table 1: Comparison of average and maximum KL-divergence between different policies. Results are averaged over the same 200 significantly revised responses as in Figure 4.
𝔻KL[PQ]subscript𝔻KLdelimited-[]conditional𝑃𝑄\mathbb{D}_{\mathrm{KL}}[P\,\|\,Q]blackboard_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_P ∥ italic_Q ] P𝑃Pitalic_P πDPOsubscript𝜋DPO\pi_{\mathrm{DPO}}italic_π start_POSTSUBSCRIPT roman_DPO end_POSTSUBSCRIPT πDPOOPAsubscriptsuperscript𝜋OPADPO\pi^{\mathrm{OPA}}_{\mathrm{DPO}}italic_π start_POSTSUPERSCRIPT roman_OPA end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_DPO end_POSTSUBSCRIPT πDPOOPAsubscriptsuperscript𝜋OPADPO\pi^{\mathrm{OPA}}_{\mathrm{DPO}}italic_π start_POSTSUPERSCRIPT roman_OPA end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_DPO end_POSTSUBSCRIPT πOPAsubscript𝜋OPA\pi_{\mathrm{OPA}}italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT
Q𝑄Qitalic_Q πbasesubscript𝜋𝑏𝑎𝑠𝑒\pi_{base}italic_π start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT πbasesubscript𝜋base\pi_{\mathrm{base}}italic_π start_POSTSUBSCRIPT roman_base end_POSTSUBSCRIPT πOPAsubscript𝜋OPA\pi_{\mathrm{OPA}}italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT πbasesubscript𝜋base\pi_{\mathrm{base}}italic_π start_POSTSUBSCRIPT roman_base end_POSTSUBSCRIPT
7B meanmeanmeanmean\mathrm{mean}-\mathrm{mean}roman_mean - roman_mean 0.039 0.371 0.025 0.276
maxmeanmaxmean\mathrm{max}-\mathrm{mean}roman_max - roman_mean 0.396 2.288 0.161 1.839
13B meanmeanmeanmean\mathrm{mean}-\mathrm{mean}roman_mean - roman_mean 0.044 0.261 0.036 0.174
maxmeanmaxmean\mathrm{max}-\mathrm{mean}roman_max - roman_mean 0.421 1.944 0.349 1.364

5.3 Benchmark Evaluation Results

Table 2: Comparison of RLAIF/RLHF-based algorithms for enhancing LVLMs across various benchmarks. For baseline algorithms with available official checkpoints, we retest the models, and these results are marked with §§\S§. For algorithms without official checkpoints, results are sourced from the respective papers: \dagger denotes results from [19], \ddagger from [17], and \star from [27]. To ensure a fair comparison, greedy sampling is used in all evaluations to avoid potential randomness. The best result for each metric within each group is highlighted in bold.
AMBER (1004) MMHal-Bench (96) Object Hal (300) POPE Adversarial (3000)
Algorithm Data Size Feedback CHAIR↓ Cover↑ HalRate↓ Cog↓ Score↑ HalRate↓ CHAIRs↓ CHAIRi↓ Acc.↑ Pre.↑
Qwen-VL-Chat -34B [7] 6.6 53.2 31.0 2.9 2.89 0.43 36 21.3 - -
+Silkie [26] 80k GPT-4V 5.4 55.8 29.0 2.0 3.01 0.41 25.3 13.9 - -
LLaVA-Instruct-1.5-7B [5, 6]§ 7.7 51.6 34.7 4.2 2.01 0.61 55.67 15.96 84.93% 89.10%
+LLaVA-RLHF [21]§ 122k Self-Reward 9.7 53.2 46.6 5.3 1.88 0.71 58.00 15.61 80.00% 87.19%
+HALVA [17] 21.5k GPT-4V 6.6 53.0 32.2 3.4 2.25 0.54 41.40 11.70 - -
+mDPO [27] 10k GPT-4V 4.4 52.4 24.5 2.4 2.39 0.54 35.70 9.80 - 95.36%
+HA-DPO [18]§ 6k GPT4 7.8 52.1 35.6 4.2 1.89 0.65 54.00 14.45 84.90% 90.42%
+POVID [16]§ 17k GPT-4V 7.4 51.3 34.3 3.9 2.08 0.60 50.67 15.28 84.77% 89.01%
+RLAIF-V [20]§ 16k LLaVA-Next 3.0 50.4 16.2 1.0 3.00 0.38 16.00 3.70 81.57% 94.97%
+OPA (ours) 4.8k GPT-4V 5.6 52.8 23.2 2.3 2.41 0.52 28.00 9.48 82.53% 95.36%
+OPA-DPO (ours) 4.8k GPT-4V 2.2 47.9 11.6 0.9 2.83 0.45 13.00 4.25 82.60% 95.61%
LLaVA-Instruct-1.5-13B [5, 6]§ 6.8 51.9 31.8 3.3 2.48 0.52 51.00 13.71 85.50% 90.31%
+LLaVA-RLHF [21]§ 122k Self-Reward 7.7 52.3 38.6 4.0 2.27 0.64 44.67 11.83 82.47% 90.25%
+RLHF-V (HD) [15] 1.4k Human 6.3 46.1 25.1 2.1 2.81 0.49 - - - -
+HSA-DPO [19] 8k GPT-4/4V 2.1 47.3 13.4 1.2 2.61 0.48 - - 84.00% 80.20%
+HALVA [17] 21.5k GPT-4V 6.4 52.6 30.4 3.2 2.58 0.45 45.40 12.80 - -
+OPA (ours) 4.8k GPT-4V 5.2 54.1 21.4 2.2 2.75 0.45 31.33 8.88 83.60% 96.24%
+OPA-DPO (ours) 4.8k GPT4V 2.4 48.3 12.8 0.9 3.07 0.39 16.33 5.48 82.63% 96.31%

The experimental results across various benchmarks are presented in Table 2. For LLaVA-Instruct-1.5-7B, our OPA-DPO achieves SOTA performance in 50% of the hallucination metrics, which increases to 70% for the LLaVA-Instruct-1.5-13B. OPA-DPO particularly excels in metrics that measure the occurrence of hallucinations, such as CHAIR and HalRate. However, the enhancement leads to a slight compromise in coverage-related metrics (Cover). In the yes-or-no benchmark (POPE), while precision significantly improves, accuracy remains the same due to the model’s tendency to provide fewer ’yes’ answers. This results in higher accuracy for positive samples but lower accuracy for negative ones. All indicators suggest that the OPA-DPO trained models tend to adopt a slightly conservative strategy, avoiding uncertain assertions. This strategy significantly enhances the credibility of responses but may omit some ambiguous details, which necessitate a trade-off.

To demonstrate the scalability of OPA-DPO, we present its performance under varying amounts of training data as in Figure 5. Even with 600 data only, OPA-DPO surpasses the majority of baseline algorithms in metrics related to hallucinations. Notably, increasing the data volume does not lead to significant performance improvements in πOPAsubscript𝜋OPA\pi_{\mathrm{OPA}}italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT, the policy after LoRA-SFT. However, the performance enhancement of OPA-DPO with increased data is quite remarkable.

5.4 Ablation Studies

We emphasize that each component of our OPA-DPO, as detailed in Chapter 4, is important. The OPA operation, specifically the LoRA-SFT on GT responses and the GPT-4V revised response, is the most critical element.

Table 3: Ablation studies on On-Policy Alignment operation.
AMBER Object Hal
Model size Data size Algo. CHAIR↓ Cover↑ HalRate↓ Cog↓ CHAIRs↓ CHAIRi↓
w OPA 2.2 47.9 11.6 0.9 13.00 4.25
4.8k w/o OPA 3.8 48.0 22.6 2.2 23.00 7.64
w OPA 3.3 47.8 15.1 1.3 18.67 5.63
7B 2.4k w/o OPA 4.6 48.6 26.8 1.8 34.67 9.81
w OPA 2.4 48.3 12.8 0.9 16.33 5.48
4.8k w/o OPA 5.7 50.4 27.5 2.7 32.67 9.45
w OPA 4.1 49.8 15.7 1.4 24.67 7.38
13B 2.4k w/o OPA 5.2 49.7 25.6 2.7 38.33 11.98

To highlight the significance of our proposed On-Policy Alignment framework in training DPO, we conduct ablation studies on the OPA operation, as illustrated in Table 3. The results indicate that the performance of the trained policy without OPA is nearly identical to that of RLHF-V and mDPO, neither of which account for on-policy data. However, integrating the OPA operation results in a nearly 50% reduction in the AMBER HalRate and Object-hal CHAIRs metrics compared to the policy trained without OPA.

Table 4: Ablation studies on the Image Focus mechanism (IF), Anchored preference (Anc), and the hallucination-weighted (hw) and image-weighted (iw) policy updating. The metric “repeat” indicates the frequency of generating sentence- or phrase-level repetitions without an EOS token across 1004 AMBER samples.
AMBER Object Hal
Model size Ablation CHAIR↓ Cover↑ HalRate↓ Cog↓ repeat CHAIRs↓ CHAIRi↓
OPA-DPO 2.2 47.9 11.6 0.9 0.6% 13.00 4.25
w/o IF 5.1 50.7 15.4 1.1 15.7% 14.67 9.87
w/o Anc 2.3 45.6 13.2 1.0 6.7% 14.33 4.21
w/o IF&Anc 4.2 50.4 16.2 1.2 15.1% 13.00 9.63
7B w/o hw&iw 2.4 46.2 12.6 0.9 0.4% 17.00 4.68
OPA-DPO 2.4 48.3 12.8 0.9 0.8% 16.33 5.48
w/o IF 3.2 53.1 16.9 1.3 17.1% 21.33 9.82
w/o Anc 2.4 48.5 13.5 0.9 4.6% 17.33 5.38
w/o IF&Anc 3.5 52.9 16.7 1.2 15.9% 21.33 12.36
13B w/o hw&iw 2.8 48.8 14.6 1.1 0.9% 18.33 6.02

In addition, we present ablation studies on the three components of our OPA-DPO training loss as described in Chapter 4.2. The evaluation results, shown in Table 4, indicate that each term in Eq.(10), as well as the hallucination-weighted/image-weighted policy updates in Eqs.(8) and (9), plays a crucial role in reducing hallucination. Notably, in the absence of the Image Focus mechanism (IF) or Anchored Preference (Anc), the policy tends to repeat its last sentence or words and fails to generate an EOS token when using greedy sampling. This phenomenon is particularly pronounced in long-form QA tasks, such as the AMBER generation task. However, when all three components are employed together, the repetition issue is resolved.

5.5 Case Study

To provide an intuitive understanding of our OPA-DPO, we present a qualitative example in Figure 6. The initial model’s generation contains numerous hallucinations and flawed reasoning. This issue persists after training naive DPO without OPA. However, after implementing OPA on 4.8k samples, the hallucinations are nearly eliminated, though some minor instances remain. Subsequent OPA-DPO completely resolves these issues, albeit at the cost of omitting some details present in the original description, which aligns with our discussion in Chapter 5.3.

6 Related Works

Refer to caption
Figure 6: Qualitative example of responses from different models with the same prompt and image. Hallucinated parts are marked in red, flawed reasoning is highlighted in blue, and missing details are highlighted in yellow. This example illustrates a common case; additional examples can be found in the Appendix.

RLHF.

As a fundamental technique driving the advancements of LLMs and LVLMs in recent years, RLHF [33, 34, 35] has been demonstrated to be effective in aligning fine-tuned large models with human preferences. By leveraging vast amounts of human preference data and RL methodologies, numerous language models have benefited from this approach and have been widely adopted. Notable examples include GPT [36, 37, 38, 39], LLaMA [40, 41, 42], Qwen [43, 44, 45], Gemini [46, 47], and Claude [48]. PPO [22] is the original RL algorithm used in RLHF. While stable, its reliance on a dependable reward model and numerous hyper-parameters has led to the exploration of alternatives. DPO [23] has attracted attention for its strong performance and removal of the need for a separate reward model. However, it has not yet matched PPO’s performance [29], motivating efforts to close this gap. Methods such as ORPO [49], CPO [50], TPO [51], and SimPO [52] aim to better align models with preference data by removing reference policy constraints. Nevertheless, these approaches lack comprehensive validation across various datasets and modalities. More pertinent to our work, iterative DPO [53] and SPPO [54] address off-policy issues by sampling preferred responses on-policy. This manner is adopted by RLAIF-V [20], but it faces challenges with low efficiency in addressing persistent hallucinations in multimodal contexts.

Hallucination for LVLMs.

Hallucination reduction in LVLMs has garnered significant attention as a major misalignment issue [55, 12, 14]. We categorize the methodologies into two classes. The first class, termed RL-free, primarily investigates the decoding mechanisms of translating visual information into language output, with some studies focusing on attention patterns [56, 57, 58] and others examining distribution shifts when image information is distorted [59, 60]. Additionally, some research explores the intriguing effects of special tokens on hallucinations, such as ’EOS’ [61] and ’\\\backslash\n’ [62].

Table 5: Comparison of algorithms utilizing DPO to address hallucinations.
Algorithm
Expert
Correction
On-policy
Data
Image
Focus
HALVA[17]
POVID[16]
RLHF-V[15]
HA-DPO[18]
HSA-DPO[19]
RLAIF-V[20]
mDPO[27]
OPA-DPO

The second class, RL-based, employs the RLHF framework to gather feedback from humans or AI systems with superhuman capabilities. Compared with RL-freed methods, RL-based methods generally demonstrate superior results on benchmarks designed to assess the reduction of hallucination. Within this class, only a few studies elect PPO [21], while the majority, like our work, choose DPO [17, 16, 15, 18, 19, 20, 27, 26]. Given the inherent vulnerabilities associated with the naive adoption of DPO in LVLMs, we summarize the characteristics of various algorithms across three dimensions: expert correction, on-policy data, and image-focus, as presented in Table 5. Our OPA-DPO is the only algorithm that considers all three aspects, thereby achieving SOTA performance across multiple metrics.

7 Conclusions

In conclusion, our study uncovers a crucial characteristic of DPO: heavy reliance on on-policy data. By examining dataset distribution, we identify and systematically summarize the inherent flaws in existing DPO-based algorithms for addressing hallucination issues. To address the shortcomings, we introduce On-Policy Alignment (OPA)-DPO, a framework that integrates the strengths of various approaches. OPA-DPO leverages expert feedback to correct hallucinated responses and ensures alignment of both the original and expert-revised responses on-policy. Remarkably, with only 4.8k training samples, OPA-DPO improved LLaVA-1.5-7B and LLaVA-1.5-13B achieve SOTA performance on over half hallucination-related benchmarks, surpassing other DPO-based algorithms in mitigating hallucination problems, which generally requires over 10k data.

References

  • [1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022.
  • [2] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems, 2024.
  • [3] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning. PMLR, 2023.
  • [4] OpenAI. GPT-4V(ision) System Card. 2023.
  • [5] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, 2024.
  • [6] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  • [7] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-VL: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  • [8] Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. In Advances in Neural Information Processing Systems, 2024.
  • [9] Shenghuan Sun, Gregory M Goldgof, Alexander Schubert, Zhiqing Sun, Thomas Hartvigsen, Atul J Butte, and Ahmed Alaa. Dr-LLaVA: Visual instruction tuning with symbolic clinical grounding. arXiv preprint arXiv:2405.19567, 2024.
  • [10] Stephanie L Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C Castro, Mercy Ranjit, Anton Schwaighofer, Fernando Pérez-García, Valentina Salvatelli, Shaury Srivastav, Anja Thieme, et al. MAIRA-1: A specialised large multimodal model for radiology report generation. arXiv preprint arXiv:2311.13668, 2023.
  • [11] Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Anton Schwaighofer, Anja Thieme, Sam Bond-Taylor, Maximilian Ilse, Fernando Pérez-García, Valentina Salvatelli, Harshita Sharma, Felix Meissen, Mercy Ranjit, Shaury Srivastav, Julia Gong, Noel C. F. Codella, Fabian Falck, Ozan Oktay, Matthew P. Lungren, Maria Teodora Wetscherek, Javier Alvarez-Valle, and Stephanie L. Hyland. MAIRA-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449, 2024.
  • [12] Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024.
  • [13] Wei Lan, Wenyi Chen, Qingfeng Chen, Shirui Pan, Huiyu Zhou, and Yi Pan. A survey of hallucination in large visual language models. arXiv preprint arXiv:2410.15359, 2024.
  • [14] Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, Ke Wang, Liping Hou, Rongjun Li, and Wei Peng. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253, 2024.
  • [15] Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. RLHF-V: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  • [16] Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Aligning modalities in vision large language models via preference fine-tuning. In ICLR 2024 Workshop on Reliable and Responsible Foundation Models, 2024.
  • [17] Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan Ö. Arık, and Tomas Pfister. Data-augmented phrase-level alignment for mitigating object hallucination. arXiv preprint arXiv:2405.18654, 2024.
  • [18] Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, and Conghui He. Beyond hallucinations: Enhancing LVLMs through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023.
  • [19] Wenyi Xiao, Ziwei Huang, Leilei Gan, Wanggui He, Haoyuan Li, Zhelun Yu, Hao Jiang, Fei Wu, and Linchao Zhu. Detecting and mitigating hallucination in large vision language models via fine-grained AI feedback. arXiv preprint arXiv:2404.14233, 2024.
  • [20] Tianyu Yu, Haoye Zhang, Yuan Yao, Yunkai Dang, Da Chen, Xiaoman Lu, Ganqu Cui, Taiwen He, Zhiyuan Liu, Tat-Seng Chua, et al. RlAIF-V: Aligning mllms through open-source AI feedback for super gpt-4v trustworthiness. arXiv preprint arXiv:2405.17220, 2024.
  • [21] Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525, 2023.
  • [22] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • [23] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, 2024.
  • [24] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  • [25] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  • [26] Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong, and Qi Liu. VLFeedback: A large-scale AI feedback dataset for large vision-language models alignment. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024.
  • [27] Fei Wang, Wenxuan Zhou, James Y. Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen. mDPO: Conditional preference optimization for multimodal large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024.
  • [28] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • [29] Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, and Yi Wu. Is DPO superior to PPO for llm alignment? a comprehensive study. International Conference on Machine Learning, 2024.
  • [30] Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Ming Yan, Ji Zhang, and Jitao Sang. An LLM-free multi-dimensional benchmark for MLLMs hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023.
  • [31] Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.
  • [32] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  • [33] Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017.
  • [34] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  • [35] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022.
  • [36] Alec Radford. Improving language understanding by generative pre-training. OpenAI blog, 2018.
  • [37] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 2019.
  • [38] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
  • [39] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [40] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The LLaMA 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  • [41] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [42] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • [43] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  • [44] Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.
  • [45] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
  • [46] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • [47] Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  • [48] Anthropic. The Claude 3 model family: Opus, sonnet, haiku.
  • [49] Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic preference optimization without reference model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024.
  • [50] Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. In International Conference on Machine Learning, 2024.
  • [51] Amir Saeidi, Shivanshu Verma, Aswin RRV, and Chitta Baral. Triple preference optimization: Achieving better alignment with less data in a single step optimization. arXiv preprint arXiv:2405.16681, 2024.
  • [52] Yu Meng, Mengzhou Xia, and Danqi Chen. SimPO: Simple preference optimization with a reference-free reward. In Advances in Neural Information Processing Systems, 2024.
  • [53] Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason E Weston. Iterative reasoning preference optimization. Advances in Neural Information Processing Systems, 2024.
  • [54] Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment. arXiv preprint arXiv:2405.00675, 2024.
  • [55] Anku Rani, Vipula Rawte, Harshad Sharma, Neeraj Anand, Krishnav Rajbangshi, Amit Sheth, and Amitava Das. Visual hallucination: Definition, quantification, and prescriptive remediations. arXiv preprint arXiv:2403.17306, 2024.
  • [56] Xuan Gong, Tianshi Ming, Xinpeng Wang, and Zhihua Wei. DAMRO: Dive into the attention mechanism of LVLM to reduce object hallucination. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024.
  • [57] Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Conghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. OPERA: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  • [58] Fan Yuan, Chi Qin, Xiaogang Xu, and Piji Li. Helpd: Mitigating hallucination of lvlms by hierarchical feedback learning with vision-enhanced penalty decoding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024.
  • [59] Xintong Wang, Jingheng Pan, Liang Ding, and Chris Biemann. Mitigating hallucinations in large vision-language models with instruction contrastive decoding. arXiv preprint arXiv:2403.18715, 2024.
  • [60] Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  • [61] Zihao Yue, Liang Zhang, and Qin Jin. Less is more: Mitigating multimodal hallucination from an EOS decision perspective. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024.
  • [62] Zongbo Han, Zechen Bai, Haiyang Mei, Qianli Xu, Changqing Zhang, and Mike Zheng Shou. Skip\\\backslash\n: A simple method to reduce hallucination in large vision-language models. arXiv preprint arXiv:2402.01345, 2024.
\thetitle

Appendix

The Appendix is organized as follows:

  • In Chapter A, we offer a comprehensive description of our implementation details, 333Our implementation is available at https://github.com/zhyang2226/OPA-DPO. We also provide OPA-DPO trained LLaVA-v1.5-7B and 13B models, along with the corresponding training dataset., complementing the information presented in Chapter 5.1. The training details and hyperparameter settings are reported in Chapter A.1, while the GPT-4V prompt and corresponding examples are provided in Chapter A.2.

  • In Chapter B, we supply some additional experimental results. The helpfulness-related benchmark evaluations are conducted in Chapter B.1, and additional ablation studies on the hyperparameter choosing are presented in Chapter B.2.

  • In Chapter C, we provide additional analytical examples to complement the example presented in Figure 6.

Appendix A Implementation Details

A.1 Training Details.

Importantly, we emphasize that OPA-DPO does not depend on detailed hyperparameter tuning for different base models or training datasets. In our experiments, we apply OPA-DPO to two LVLMs with varying parameter sizes: LLAVA-v1.5-7B and LLAVA-v1.5-13B. We maintain consistent hyperparameter settings for OPA-DPO across different models and training datasets.

As shown in Figure 3, the initial step in OPA-DPO involves instructing the model (slated for training) to generate responses 𝐲Gensubscript𝐲Gen\mathbf{y}_{\mathrm{Gen}}bold_y start_POSTSUBSCRIPT roman_Gen end_POSTSUBSCRIPT based on pre-collected images 𝐦𝐦\mathbf{m}bold_m and prompts 𝐱𝐱\mathbf{x}bold_x. Notably, we employ a combination of ”top-k” and ”top-p” sampling methods to select tokens with relatively high sampling probabilities according to the initial policy, thereby revealing the intrinsic hallucinations of the policy itself. For token sampling, we set topk=30topk30\mathrm{topk}=30roman_topk = 30, topp=0.95topp0.95\mathrm{topp}=0.95roman_topp = 0.95, and use a temperature of 1.01.01.01.0.

Algorithm 1 OPA-DPO Training
1:Phase 1 Training: On-Policy Alignment
2:Initial policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, datasets 𝒟={𝐱,𝐦,𝐲GT,𝐲Rev}N𝒟superscript𝐱𝐦subscript𝐲GTsubscript𝐲Rev𝑁\mathcal{D}=\{\mathbf{x},\mathbf{m},\mathbf{y}_{\mathrm{GT}},\mathbf{y}_{% \mathrm{Rev}}\}^{N}caligraphic_D = { bold_x , bold_m , bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT
3:for SFT epochs do
4:    for {𝐱,𝐦,𝐲GT,𝐲Rev}M1𝒟similar-tosuperscript𝐱𝐦subscript𝐲GTsubscript𝐲Revsubscript𝑀1𝒟\{\mathbf{x},\mathbf{m},\mathbf{y}_{\mathrm{GT}},\mathbf{y}_{\mathrm{Rev}}\}^{% M_{1}}\sim\mathcal{D}{ bold_x , bold_m , bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∼ caligraphic_D do
5:       Calculate loss in Eq. (4) for πθ(𝐲GT|𝐱,𝐦)subscript𝜋𝜃conditionalsubscript𝐲GT𝐱𝐦\pi_{\theta}(\mathbf{y}_{\mathrm{GT}}|\mathbf{x},\mathbf{m})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT | bold_x , bold_m ) and πθ(𝐲Rev|𝐱,𝐦)subscript𝜋𝜃conditionalsubscript𝐲Rev𝐱𝐦\pi_{\theta}(\mathbf{y}_{\mathrm{Rev}}|\mathbf{x},\mathbf{m})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT | bold_x , bold_m )
6:       Update πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT
7:    end for
8:end for
9:return OPA policy πOPA=πθfinalsubscript𝜋OPAsuperscriptsubscript𝜋𝜃final\pi_{\mathrm{OPA}}=\pi_{\theta}^{\mathrm{final}}italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_final end_POSTSUPERSCRIPT
10:Phase 2 Training: OPA-DPO
11:Initial policy πθ=πOPAsubscript𝜋superscript𝜃subscript𝜋OPA\pi_{\theta^{\prime}}=\pi_{\mathrm{OPA}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT; hyperparameters β,γ1,γ2,δ𝛽subscript𝛾1subscript𝛾2𝛿\beta,\gamma_{1},\gamma_{2},\deltaitalic_β , italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_δ; datasets 𝒟={𝐱,𝐦,𝐲Gen,𝐲GT,𝐲Rev,Shal,Simg}N𝒟superscript𝐱𝐦subscript𝐲Gensubscript𝐲GTsubscript𝐲Revsubscript𝑆halsubscript𝑆img𝑁\mathcal{D}=\{\mathbf{x},\mathbf{m},\mathbf{y}_{\mathrm{Gen}},\mathbf{y}_{% \mathrm{GT}},\mathbf{y}_{\mathrm{Rev}},S_{\mathrm{hal}},S_{\mathrm{img}}\}^{N}caligraphic_D = { bold_x , bold_m , bold_y start_POSTSUBSCRIPT roman_Gen end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT
12:for OPA-DPO epochs do
13:    for {𝐱,𝐦,𝐲Gen,𝐲GT,𝐲Rev,Shal,Simg}M2𝒟similar-tosuperscript𝐱𝐦subscript𝐲Gensubscript𝐲GTsubscript𝐲Revsubscript𝑆halsubscript𝑆imgsubscript𝑀2𝒟\{\mathbf{x},\mathbf{m},\mathbf{y}_{\mathrm{Gen}},\mathbf{y}_{\mathrm{GT}},% \mathbf{y}_{\mathrm{Rev}},S_{\mathrm{hal}},S_{\mathrm{img}}\}^{M_{2}}\sim% \mathcal{D}{ bold_x , bold_m , bold_y start_POSTSUBSCRIPT roman_Gen end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∼ caligraphic_D do
14:       Calculate loss in Eq. (7) with {𝐱,𝐦,𝐲Gen,𝐲GT,𝐲Rev,Shal}M2superscript𝐱𝐦subscript𝐲Gensubscript𝐲GTsubscript𝐲Revsubscript𝑆halsubscript𝑀2\{\mathbf{x},\mathbf{m},\mathbf{y}_{\mathrm{Gen}},\mathbf{y}_{\mathrm{GT}},% \mathbf{y}_{\mathrm{Rev}},S_{\mathrm{hal}}\}^{M_{2}}{ bold_x , bold_m , bold_y start_POSTSUBSCRIPT roman_Gen end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
15:       Produce distorted image 𝐦=𝐦pixel_masksuperscript𝐦direct-product𝐦pixel_mask\mathbf{m}^{\prime}=\mathbf{m}\odot\mathrm{pixel\_mask}bold_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_m ⊙ roman_pixel _ roman_mask
16:       Calculate loss in Eq. (8) with {𝐱,𝐦,𝐦,𝐲GT,𝐲Rev,Simg}M2superscript𝐱𝐦superscript𝐦subscript𝐲GTsubscript𝐲Revsubscript𝑆imgsubscript𝑀2\{\mathbf{x},\mathbf{m},\mathbf{m}^{\prime},\mathbf{y}_{\mathrm{GT}},\mathbf{y% }_{\mathrm{Rev}},S_{\mathrm{img}}\}^{M_{2}}{ bold_x , bold_m , bold_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
17:       Calculate loss in Eq. (9) with {𝐱,𝐦,𝐲GT,𝐲Rev}M2superscript𝐱𝐦subscript𝐲GTsubscript𝐲Revsubscript𝑀2\{\mathbf{x},\mathbf{m},\mathbf{y}_{\mathrm{GT}},\mathbf{y}_{\mathrm{Rev}}\}^{% M_{2}}{ bold_x , bold_m , bold_y start_POSTSUBSCRIPT roman_GT end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
18:       Combine the losses as in Eq. (10)
19:       Update πθsubscript𝜋superscript𝜃\pi_{\theta^{\prime}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
20:    end for
21:end for
22:return OPA-DPO policy πOPADPO=πθfinalsuperscriptsubscript𝜋OPADPOsuperscriptsubscript𝜋superscript𝜃final\pi_{\mathrm{OPA}}^{\mathrm{DPO}}=\pi_{\theta^{\prime}}^{\mathrm{final}}italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_DPO end_POSTSUPERSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_final end_POSTSUPERSCRIPT

Following that, GPT-4V is tasked with identifying hallucinations by evaluating the generated responses at the sentence level. Each sentence in a response is assigned a hallucination severity score, Shalsubscript𝑆halS_{\mathrm{hal}}italic_S start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT, on a scale from one to four, indicating the severity of any hallucination present. As we introduced in Eq. 7, this score is incorporated into hallucination-weighted policy updating, with the corresponding mapping between scores and weights provided in Table 6. Additionally, GPT-4V is required to categorize sentences with incorrect description as either image recognition errors or language comprehension errors. The classification results Simgsubscript𝑆imgS_{\mathrm{img}}italic_S start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT is utilized for image-weighted policy updating, as defined in Eq. (8). Table 7 outlines the mapping between these classifications and their respective updating weights. Note that both Shalsubscript𝑆halS_{\mathrm{hal}}italic_S start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT and Simgsubscript𝑆imgS_{\mathrm{img}}italic_S start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT are evaluated at the sentence level, ensuring that Whalsubscript𝑊halW_{\mathrm{hal}}italic_W start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT and Wimgsubscript𝑊imgW_{\mathrm{img}}italic_W start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT are assigned the same value for each token within a sentence. Lastly but most importantly, GPT-4V is also instructed to make minimal revisions to any erroneous sentences, and the aggregate of these revised sentences is denoted as 𝐲Revsubscript𝐲Rev\mathbf{y}_{\mathrm{Rev}}bold_y start_POSTSUBSCRIPT roman_Rev end_POSTSUBSCRIPT. Please refer to Chapter A.2 for detailed prompt and example. In our implementation, we utilize the GPT-4V version from 2024-02-15, with the generation temperature set to 0.

Table 6: GPT-4V assigned hallucination scores and the corresponding update weights for language correction loss, as described in Eq. (7).
Hallucination
Severity
Score from GPT-4V
(Shalsubscript𝑆halS_{\mathrm{hal}}italic_S start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT)
Updating Weight
( Whalsubscript𝑊halW_{\mathrm{hal}}italic_W start_POSTSUBSCRIPT roman_hal end_POSTSUBSCRIPT)
Not at all 4 2.5
Minor 3 2.0
Major 2 1.5
Totally 1 1.0
Table 7: GPT-4V labeled error types and the corresponding update weights for image focus loss, as described in Eq. (8).
Label from GPT-4V
(Simgsubscript𝑆imgS_{\mathrm{img}}italic_S start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT)
Updating Weight
(Wimgsubscript𝑊imgW_{\mathrm{img}}italic_W start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT)
correct 1.0
language_comprehension_error 1.0
image_recognition_error 3.0
Refer to caption
Figure 7: Example of feedback from GPT-4V. Hallucinated parts in the base-model generated responses are marked in red, missing details are highlighted in yellow. Note that the feedback from GPT-4V also contain hallucinations, as highlighted in green.

After completing the data collection, we proceed with a two-phase training for the initial models as detailed in Algorithm 1. The first phase training (line 1-7) termed On-Policy Alignment (OPA), involves performing a 2-epoch LoRA-SFT on both ground-truth responses and GPT-4V revised responses. The entire backbone model, including the vision encoder and multimodal connection layers, is wrapped with LoRA modules. We employ a cosine learning rate schedule beginning at 2e-5 with a batch size of 128. The LoRA rank is set to 256, and LoRA alpha is set to 512. The updated policy from this phase is denoted as πOPAsubscript𝜋OPA\pi_{\mathrm{OPA}}italic_π start_POSTSUBSCRIPT roman_OPA end_POSTSUBSCRIPT, which serves as the initial (reference) policy for the subsequent OPA-DPO training. The second phase of training (lines 8-18) uses the same LoRA module as in phase 1, extending over 4 additional epochs with a batch size of 32 and a cosine learning rate starting at 1e-6. In our equations, we set β=0.1𝛽0.1\beta=0.1italic_β = 0.1 in Eqs. (7)(8)(9), δ=0𝛿0\delta=0italic_δ = 0 in Eq. (9), and γ1=0.2,γ2=1.0formulae-sequencesubscript𝛾10.2subscript𝛾21.0\gamma_{1}=0.2,\gamma_{2}=1.0italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.2 , italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1.0 in Eq. (10). For the distorted images 𝐦superscript𝐦\mathbf{m^{\prime}}bold_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in Eq. (8), we randomly mask 30% of pixels, assigning the masked areas the average pixel values. For ablation studies on the relative hyperparameters, please refer to Chapter B.2.

A.2 Prompts for GPT-4V

To obtain fine-grained feedback from GPT-4V, we crafted a detailed prompt, as shown in the TextBox on the following page. To establish a one-to-one correspondence between the revised and original responses, we instruct GPT-4V to first copy the generated sentence before proceeding with assessment and revision. Additionally, we request that GPT-4V provide the rationale behind its assigned score or revision. It is important to note that GPT-4V may itself produce hallucinations, which can affect the reliability of its feedback. An example is provided in Figure 7.

GPT-4V Prompt for Fine-Grained Sentence-Level Revision of Generated Responses Your role is as a discerning assistant tasked with evaluating and refining responses for multimodal tasks. Upon being presented with a question that requires the interpretation of both text and images, you will receive two distinct responses. The first is crafted by our sophisticated multimodal model, while the second represents an approximate ideal answer—it may be incomplete or incorrect. You will also be provided with the images pertinent to the question. Your objective is to meticulously assess these responses. You are to enhance the model-generated response by making precise, minimal modifications that bring it into closer alignment with both the image and the approximate ideal answer. Your revisions should preserve the integrity of the original response as much as possible. Be mindful that the approximate ideal response may not contain all the necessary information to fully address the question or may include mistakes. In such cases, you must carefully evaluate the accuracy of the model-generated response by consulting the image, which serves as the primary reference. Your analysis should prioritize the information provided in the image to ascertain the accuracy and completeness of the model-generated response. The ultimate goal is to ensure that the final response is both accurate in relation to the images and as informative as possible while remaining true to the content originally produced by the model. Your task involves meticulous scrutiny of the generated response to a multimodal task, sentence by sentence. Here’s how you should approach the revision process: Evaluate each sentence within the generated response. - If a sentence is both accurate and relevant to the task, it should remain unchanged. - If you encounter a sentence that is only partially correct, carefully adjust the erroneous or incomplete segments to improve its precision. Ensure that these modifications are minimal and directly address the inaccuracies. - If you find any sentences that contain hallucinations or extraneous information, these must be either rephrased or replaced entirely. Use the image and the approximate ideal response as your sources for correction, aiming to retain the essence of the original content when possible. You are to present your output in a structured JSON format. Begin with the key “image_description” where a comprehensive description of the provided images should be articulated. Following this, evaluate the generated response sentence by sentence. For each sentence, craft a JSON object that contains the original sentence, your refined version, and a brief commentary explaining your revisions. The format is as follows: 1. “copied_content”: Copy and paste the original sentence as it appears in the generated response. 2. “score”: Provide a score between 1 and 4, reflecting the sentence’s accuracy and relevance to the image and question: - 4 for a sentence that is completely accurate and relevant, aligning perfectly with the image information and the approximate ideal answer, requiring no adjustments. - 3 for a sentence that is largely correct but needs minor tweaks, like an accurate object described with an incorrect count or size. - 2 for a sentence with substantial issues requiring significant changes, such as incorrect object recognition or incorrect relationships between objects. - 1 for a sentence that is completely irrelevant or incorrect, with no relation to the image or the question at hand. 3. “error_type”: Specify the type of error detected in the sentence: - “correct” if the sentence is accurate or requires only minor adjustments, applicable only to a score of 4. - “image_recognition_error” when the error arises from an incorrect interpretation of the visual content, like mistaking an apple for a pear. - “language_comprehension_error” when the image is correctly understood, but the language used is incorrect, such as placing the Eiffel Tower in Berlin instead of Paris. 4. “object”: List any objects that are hallucinated or misidentified, and provide the correct identification. Leave this field empty if there are no hallucinations or misidentifications. - For instance, if the sentence inaccurately identifies a cat sleeping on a table as a dog standing on a blanket, the “object” should be [“dog ->>> cat”, “standing ->>> sleeping”, “blanket ->>> table”]. 5. “rewritten_content”: Present the corrected sentence after applying necessary adjustments, considering all information from the image captions and the approximate ideal answer. 6. “reason”: Explain the rationale for the given score, the identified error type, and any modifications made. This should include the reasoning behind changes and the decision to maintain certain parts of the original sentence. If the rewritten sentences still lack essential information necessary for answering the given questions, add the missing part to the “Added” section and incorporate that missing information minimally. Only do this if absolutely necessary. You should never bring other hallucinations into the rewritten parts. Only do the modifications when you are one hundred percent sure that the original sentence is incorrect or irrelevant. Please note that the rewritten sentence should retain as much of the generated response as possible. All unnecessary changes should be minimized.

Appendix B Additional Experiments

B.1 Helpfulness Benchmark Evaluations.

To demonstrate that the exceptional performance of OPA-DPO on hallucination-related metrics does not result in a decline in helpfulness-related metrics, we evaluated the performance of various RLHF/RLAIF-based algorithms designed to enhance LVLMs on the LLaVA-Bench [5], as shown in Table 8. The results indicate that the OPA-DPO trained model performs at an upper-middle level. With the exception of LLaVA-RLHF, the performance of each algorithm on the LLaVA-Benchmark shows minimal variation. However, LLaVA-RLHF is significantly less effective than other algorithms in hallucination-related metrics.

Table 8: Comparison of RLAIF/RLHF-based algorithms for enhancing LVLMs on LLaVA-Bench.
LLaVA-Bench
Algorithm Conv.↑ Detail↑ Comp.↑ All↑
LLaVA-Instruct-1.5-7B [5, 6] 84.1 74.4 89.8 83.0
+ LLaVA-RLHF [21] 84.1 75.3 106.8 88.9
+ HA-DPO [18] 80.7 74.5 88.4 81.4
+ POVID [16] 84.9 77.3 90.3 84.3
+ RLAIF-V [20] 75.8 83.7 90.7 83.5
+ OPA-DPO (ours) 82.1 79.5 87.9 83.2
LLaVA-Instruct-1.5-13B [5, 6] 79.6 77.3 91.4 82.9
+ LLaVA-RLHF [21] 93.1 76.2 105.6 91.8
+ RLHF-V [15] 93.1 75.3 91.6 86.7
+ HSA-DPO [19] 76.0 71.8 88.2 80.5
+ OPA-DPO (ours) 87.1 78.3 90.7 85.5

B.2 Additional Ablation Studies.

Table 9: Ablation studies on the mask ratio of the distorted image and the term coefficient γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in the image focus mechanism.
AMBER MMHal-Bench Object Hal
Model Size Mask Ratio IF Coef γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Cover↑ HalRate↓ repeat Score↑ HalRate↓ CHAIRs↓ CHAIRi↓
0.1 0.2 46.1 12.5 0.3% 2.60 0.49 14.67 4.28
0.3 0.2 47.9 11.6 0.6% 2.83 0.45 13.00 4.25
0.5 0.2 46.3 11.6 0.3% 2.73 0.47 14.67 4.18
0.7 0.2 45.6 12.2 0.2% 2.69 0.47 13.33 4.45
0.3 0.1 46.2 12.3 0.6% 2.70 0.47 14.67 4.05
0.3 0.5 44.3 11.1 5.3% 2.79 0.45 14.67 4.32
7B 0.3 1.0 43.5 9.3 20.8% 2.26 0.59 9.67 2.98
0.1 0.2 48.3 13.9 0.4% 2.84 0.45 19.00 6.16
0.3 0.2 48.3 12.8 0.8% 3.07 0.39 16.33 5.48
0.5 0.2 48.2 13.4 0.8% 2.99 0.41 18.00 5.41
0.7 0.2 48.3 13.9 0.4% 2.84 0.45 19.00 6.16
0.3 0.1 48.3 13.0 0.2% 2.95 0.42 18.33 5.89
0.3 0.5 48.1 12.3 2.6% 2.97 0.44 16.67 5.35
13B 0.3 1.0 46.1 10.3 7.9% 2.72 0.44 16.33 5.00

As an important component of OPA-DPO, the image focus mechanism (see Chapter 4.2) involves two hyperparameters that require tuning: the term coefficient γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in Eq. 10 and the mask ratio of the distorted image. We found that setting γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 0.2 and randomly masking 30% of the pixels is optimal for both the LLaVA-1.5-7B and LLaVA-1.5-13B models. In contrast, the pioneering algorithm mDPO [27], which first utilized this mechanism, opts to set γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 1.0 and employs a variable masking ratio of 0-20% of pixels randomly. We find that the mask ratio has a slight impact on the model’s performance, whereas the term coefficient γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT has a more significant effect. In particular, setting the coefficient too high results in excellent performance in metrics related to hallucination rate, but at the cost of being overly conservative and severely lacking in explanatory detail. Additionally, the model tends to repeat its last sentence or words and fails to generate an EOS token when using greedy sampling. As a compromise, we set γ1=0.2subscript𝛾10.2\gamma_{1}=0.2italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.2. Ablation studies supporting our findings are presented in Table 9.

Appendix C Additional Qualitative Examples

Image Descriptions.

As introduced in Chapter 5, OPA-DPO is particularly effective in preventing hallucinations by adopting a somewhat conservative strategy that avoids uncertain assertions. Such strategy significantly enhances the credibility of the responses but may lead to the omission of some ambiguous details, necessitating a trade-off. In addition to the case presented in Chapter 5.5, we offer further examples involving image detail descriptions, as illustrated in Figures 8, 9, 10, and 11. In these cases, the initial model’s output contained numerous hallucinations and flawed reasoning. This issue persisted even after training with naive DPO without OPA. However, after applying OPA to 4.8k samples, hallucinations were nearly eliminated, with only minor instances remaining. The subsequent implementation of OPA-DPO completely resolved these issues, although some details from the original description were omitted. It is important to note that the omitted details are often not central to the image’s main information and do not cause the overall description to deviate.

False Premise Queries.

Another interesting phenomenon we observed in our experiments is that, LVLMs consistently experience hallucinations when presented with queries based on false premises. These queries contain objects or details that do not exist in the image or are irrelevant to it. For example, the LVLM is asked to describe the girl’s outfit given a picture of a basketball. As demonstrated in Figures 12, 13, and 14, the base model consistently produces absurd responses to nonsensical questions due to linguistic inertia. The application of DPO without OPA does not generally modify these responses. Furthermore, utilizing the OPA operation in isolation is sometimes insufficient to address the issue. However, when both methods are combined, through training with OPA-DPO, the model is able to discern false premises in queries or prompts and provide reasoned responses.

Refer to caption
Figure 8: Qualitative results of different models. Hallucinated parts are marked in red, and missing details are highlighted in yellow.
Refer to caption
Figure 9: Qualitative results of different models. Hallucinated parts are marked in red, flawed reasoning is highlighted in blue, and missing details are highlighted in yellow.
Refer to caption
Figure 10: Qualitative results of different models. Hallucinated parts are marked in red, and missing details are highlighted in yellow.
Refer to caption
Figure 11: Qualitative results of different models. Hallucinated parts are marked in red, and missing details are highlighted in yellow.
Refer to caption
Figure 12: Qualitative results of different models. Hallucinated parts are marked in red.
Refer to caption
Figure 13: Qualitative results of different models. Hallucinated parts are marked in red.
Refer to caption
Figure 14: Qualitative results of different models. Flawed reasoning is highlighted in blue, and missing details are highlighted in yellow.