HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning

1Columbia University, 2Google DeepMind

Abstract

Hallucination has been a major problem for large language models and remains a critical challenge when it comes to multimodality in which vision-language models (VLMs) have to deal with not just textual but also visual inputs. Despite rapid progress in VLMs, resources for evaluating and addressing multimodal hallucination are limited and mostly focused on evaluation. This work introduces HaloQuest, a novel visual question answering dataset that captures various aspects of multimodal hallucination such as false premises, insufficient contexts, and visual challenges. A novel idea from HaloQuest is to leverage synthetic images, apart from real ones, to enable dataset creation at scale. With over 7.7K examples spanning across a wide variety of categories, HaloQuest was designed to be both a challenging benchmark for VLMs and a finetuning dataset for advancing multimodal reasoning. Our experiments reveal that current models struggle with HaloQuest, with all open-source VLMs achieving below 36% accuracy. On the other hand, fine-tuning on HaloQuest significantly reduces hallucination rates while preserving performance on standard reasoning tasks. Our results discover that benchmarking with generated images is highly correlated (r = 0.97) with real images. Last but not least, we propose a novel Auto-Eval mechanism that is highly correlated with human raters (r = 0.99) for evaluating VLMs. In sum, this work makes concrete strides towards understanding, evaluating, and mitigating hallucination in VLMs, serving as an important step towards more reliable multimodal AI systems in the future.

Dataset Description

Summary

HaloQuest is a novel visual question answering (VQA) dataset that focuses on multimodal hallucination in vision-language models (VLMs). It contains over 7,748 examples with a combination of real and synthetically generated images, annotated with questions and answers designed to trigger and evaluate hallucinations.

Supported Tasks

HaloQuest supports tasks related to hallucination detection and reduction in VLMs, providing a challenging benchmark for Visual Question Answering. The dataset is useful for both evaluation and fine-tuning purposes, aiming to advance multimodal reasoning.

Dataset Details

Data Collection

HaloQuest includes a mix of real images from the Open Images dataset and synthetic images generated using Midjourney and Stable Diffusion. Images were curated based on interest and comprehensibility. Questions and answers were crafted by humans and large language models (LLMs), focusing on false premises, visually challenging questions, and questions with insufficient context.

Data Instances

Example entries from HaloQuest include complex questions requiring nuanced reasoning and detailed answers. Below are some samples:

Data Splits

The dataset is split into training and evaluation sets. The following table provides detailed statistics for each subset.

Real ImagesSynthetic ImagesFalse Premise QuestionsVisually Challenging QuestionsInsufficient Context QuestionsTotal Entries
Train Set298541552698297314697140
Eval Set217391304183121608
Total320245463002315615907748

Browse Dataset

Refresh

HaloQuest Leaderboard

(Gemini 1.0 Pro was used for Auto-Eval)

Overall Generated Real False Premise Visually Challenging Insufficient Context
Model (#Param) Rank Human Eval Auto-Eval Human Eval Auto-Eval Human Eval Auto-Eval Human Eval Auto-Eval Human Eval Auto-Eval Human Eval Auto-Eval

Contact

For any questions about HaloQuest, please contact olinzhecanwang@gmail.com.

BibTeX

@article{wang2024haloquest,
      title={HaloQuest: A Visual Hallucination Dataset for Advancing Multimodal Reasoning},
      author={Wang, Zhecan and Bingham, Garrett and Yu, Adams and Le, Quoc and Luong, Thang and Ghiasi, Golnaz},
      journal={arXiv preprint arXiv:2407.15680},
      year={2024}
    }