MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

ICML 2024 Oral
Dongping Chen1*, Ruoxi Chen2*, Shilin Zhang1*, Yaochen Wang1*, Yinuo Liu1*, Huichi Zhou1*, Qihui Zhang1*, Yao Wan1, Pan Zhou1, Lichao Sun3
1Huazhong University of Science and Technology,
2Zhejiang University of Technology, 3Lehigh University

GUI-world Overview GUI-world Benchmark Overview In this work, we introduce MLLM-as-a-Judge which thoroughly explores three types of Multimodal LLM-as-a-Judge in Vision-Language settings. Specifically, there are three-fold major contributions:
  1. A Benchmark. We are the first to develop a comprehensive benchmark MLLM-AS-A-JUDGE in multimodal domains, with human annotations to assess the judging capability of MLLMs in tasks of Scoring Evaluation, Pair Comparison and Batch Ranking.
  2. Two Datasets. We curate two human preference datasets with high-quality questions MLLM-AS-A-JUDGE-HQ and MLLM-AS-A-JUDGE-HARD dataset with hallucination instances. They can serve as a rigorous testing ground to facilitate the development of MLLMs.
  3. Findings and Implications. Our evaluation of mainstream MLLMs reveals that while MLLMs exhibit alignment with human judgments in pair comparison tasks, notable discrepancies can be found in scoring evaluation and batch ranking. Furthermore, our findings reveal that MLLMs exhibit a range of biases and hallucinations, along with inconsistent judgments during the evaluation process, representing significant hurdles in establishing MLLMs as reliable judges.

Abstract

Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence of multimodal benchmarks that align with human preferences. Drawing inspiration from the concept of LLM-as-a-Judge within LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges across diverse modalities, encompassing three distinct tasks: *Scoring Evaluation*, *Pair Comparison*, and *Batch Ranking*. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in *Pair Comparison*, there is a significant divergence from human preferences in *Scoring Evaluation* and *Batch Ranking*. Furthermore, a closer examination reveals persistent challenges in the judgment capacities of LLMs, including diverse biases, hallucinatory responses, and inconsistencies in judgment, even in advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts to be undertaken before regarding MLLMs as fully reliable evaluators. In light of this, we advocate for additional efforts dedicated to supporting the continuous development within the domain of MLLM functioning as judges.

Takeaway

While the MLLM (e.g., LLaVA and GPT-4V) demonstrates superior performance in certain datasets and inferior performance in others, we wish to underscore that our **MLLM-as-a-Judge** method leads to a solid conclusion: GPT-4V consistently outperforms the baseline across diverse datasets on average. However, it remains noteworthy that GPT-4V does not entirely supplant human judges in particular datasets, as elaborated in Section 4.1 of our paper. This overarching perspective on benchmarking **MLLM-as-a-Judge** underscores the central focus of our study, which aims to assess MLLM performance from a comprehensive standpoint rather than evaluating individual MLLM performance on specific datasets.

Experiment Setups

  1. Models. We evaluate the judging performance of eleven leading MLLMs – GPT-4V, Gemini-Pro-Vision-1.0, LLaVA-1.5-13b, LLaVA-1.6- 7b/13b/34b, Qwen-VL-Plus/Max and CogVLM – across three distinct evaluation settings. Adapting the “Analyze-then-Judge” paradigm, which is a one-step CoT approach, we first ask MLLMs to analyze responses and then provide a judgment based on their analysis.
  2. Metrics. After collecting responses from MLLM judgments, we quantify their alignment with human annotations across three settings, employing distinct metrics as follows:
    • Scoring Evaluation: Following LLM-as-a-Judge , we compute the Pearson similarity between the MLLMs’ judgments and human ratings across different sub-datasets.
    • Pair Comparison: We assess the similarity between the MLLM judgments and human decisions using accuracy, F1-score, and recall to assess the judging abilities of models.
    • Batch Evaluation: We consolidate the ranking results into a singular sequence and employ the Normalized Levenshtein distance to evaluate the similarity between judgments from MLLMs and human annotation.
  3. Apart from traditional metrics for similarity assessment between judgments from MLLMs and humans, we further evaluate the judgments provided by MLLMs to uncover latent bias and hallucination in 10 datasets. We also invite human annotators for further validation, focusing on the following aspects:
    • Human Agreement: This involves a simple ‘yes’ or ‘no’ response to assess agreement with the MLLM judgments. While some judgments might appear reasonable, they may still be considered incorrect due to unique human perspectives. Hence, we conduct experiments on human agreement to address situations that traditional metrics may not adequately capture.
    • Analysis Grading: Each MLLM analysis is assigned a score from 1 to 5, considering relevance, accuracy, creativity, and response granularity, detailed in Appendix F.
    • Hallucination Detection: Given the propensity for hallucination issues in the complex reasoning chains and longterm vision-language contexts of MLLMs, we task human annotators with identifying any hallucinations in the analyses of MLLM judgments, adhering to established definitions of vision and language hallucination.
The overall performance of different MLLMs in judging, compared with human annotations on different datasets. We sample all the data three times and took the average to mitigate the casualty. w. and w.o. tie represents tie and non-tie situations respectively. We omit Gemini’s results on the diffusion task for its challenges in processing AI-generated images. All presented data of Pearson similarity exhibit a p-value below 0.05, indicating a statistically significant level of confidence.
Settings MLLM Categories Ave.
COCO C.C. Diff. Graphics Math Text WIT Chart VisIT CC-3M M2W SciQA Aes MM-Vet
Score (↑) LLaVA-1.5-13b 0.247 0.227 0.060 0.242 0.093 0.245 0.109 0.237 0.177 0.071 0.424 0.279 0.414 0.322 0.225
LLaVA-1.6-34b 0.285 0.251 -0.012 0.262 0.238 0.258 0.151 0.318 0.198 0.109 0.022 0.206 0.025 0.265 0.184
Gemini 0.262 0.408 - 0.400 0.228 0.222 0.418 0.343 0.336 0.374 0.324 0.073 0.360 0.207 0.304
GPT-4V 0.454 0.507 0.458 0.645 0.606 0.624 0.579 0.645 0.620 0.431 0.185 0.383 0.401 0.326 0.490
Qwen-vl-max 0.311 0.117 0.072 0.218 0.175 0.196 0.028 0.312 0.151 0.045 0.244 0.115 0.177 0.216 0.170
Pair w. Tie (↑) LLaVA-1.5-13b 0.273 0.478 0.286 0.273 0.657 0.510 0.369 0.383 0.456 0.484 0.347 0.223 0.389 0.254 0.384
LLaVA-1.6-34b 0.493 0.600 0.570 0.300 0.374 0.551 0.543 0.254 0.398 0.392 0.513 0.434 0.524 0.499 0.460
Gemini 0.616 0.787 - 0.650 0.436 0.664 0.605 0.500 0.660 0.560 0.370 0.262 0.190 0.312 0.509
GPT-4V 0.696 0.824 0.847 0.639 0.564 0.673 0.679 0.657 0.640 0.612 0.521 0.415 0.606 0.529 0.636
Qwen-vl-max 0.403 0.464 0.372 0.494 0.438 0.500 0.533 0.479 0.421 0.421 0.411 0.392 0.325 0.474 0.438
Pair w.o. Tie (↑) LLaVA-1.5-13b 0.327 0.537 0.302 0.300 0.726 0.684 0.600 0.610 0.648 0.583 0.449 0.443 0.498 0.344 0.504
LLaVA-1.6-34b 0.607 0.824 0.855 0.402 0.587 0.750 0.758 0.381 0.503 0.564 0.712 0.679 0.694 0.762 0.648
Gemini 0.717 0.840 - 0.770 0.678 0.793 0.688 0.658 0.711 0.652 0.471 0.358 0.265 0.400 0.615
GPT-4V 0.804 0.870 0.922 0.807 0.801 0.805 0.734 0.849 0.761 0.703 0.699 0.647 0.755 0.659 0.773
Qwen-vl-max 0.657 0.674 0.556 0.667 0.635 0.732 0.647 0.638 0.560 0.586 0.608 0.646 0.741 0.662 0.644
Batch (↓) LLaVA-1.5-13b 0.577 0.492 0.562 0.535 0.598 0.650 0.616 0.644 0.620 0.563 0.639 0.563 0.650 0.652 0.597
LLaVA-1.6-34b 0.449 0.411 0.500 0.561 0.575 0.544 0.483 0.552 0.542 0.479 0.529 0.437 0.500 0.450 0.501
Gemini 0.287 0.299 - 0.473 0.462 0.430 0.344 0.520 0.426 0.357 0.613 0.412 0.467 0.529 0.432
GPT-4V 0.318 0.353 0.070 0.385 0.348 0.319 0.290 0.347 0.300 0.402 0.597 0.462 0.453 0.411 0.361
Qwen-vl-max 0.477 0.407 0.500 0.480 0.507 0.515 0.493 0.539 0.468 0.407 0.563 0.503 0.444 0.500 0.486

Empirical Results

MLLM Judgment vs Human Annotation

  1. Scoring Evaluation: GPT-4V demonstrated the highest similarity to human scoring with a similarity score of 0.557. In contrast, Gemini achieved only 0.332, with LLaVA and CogVLM scoring even lower. This discrepancy is primarily due to Gemini’s tendency to assign scores around 4 points, seldom giving 1 or 2 points. LLaVA and CogVLM show a similar pattern to Gemini, predominantly assigning scores around 4 points. We attribute this to a ‘High-Score’ Bias, akin to the ‘Yes/No’ bias, which may result from an imbalance in positive and negative judging instructions in their training data, severely limits their ability to provide just and varied scores in scoring settings. In comparison, GPT-4V’s scores are more evenly distributed and align closely with human preferences.
  2. Pair Comparison: GPT-4V outshines other MLLMs in pair comparison tasks, achieving 0.683 in tie settings and 0.806 in non-tie settings, surpassing 0.8 in many datasets, which indicate a strong alignment with human preferences. Gemini, LLaVA, and CogVLM show a marked preference for declaring a clear winner, possibly due to a lack of tie situations in their training, leading to biased judgments. It’s also interesting that the frequency of ties given by GPT-4V closely mirrors that of human judges, suggesting similar thresholds for tie decisions.
  3. Batch Ranking: GPT-4V aligns more closely with human ranking results, indicating a significant lead with a mean Levenshtein Distance of 0.313. However, there is still substantial room for improvement in this task for all MLLMs. Notably, CogVLM is unable to provide a full ranking in this context, offering only the top choice; so it was excluded from this comparison; LLaVA also exhibits position bias influenced by prompt structure, often replicating judgments seen in example prompts, which complicates its ability to produce fair judgments.

MLLM Judging Consistency

To be a reliable judge, consistent decision-making across repeated evaluations of the same query is crucial. For this purpose, we conducted six repeated tests with MLLM judgments and calculated the weighted average consistency scores and Majority Consistency Criterion ratios for GPT-4V and Gemini. Despite a higher temperature setting, GPT-4V substantially outperforms Gemini across all tasks. Particularly in Pair Comparison, GPT-4V achieves a higher consistency score of 0.675, but it encounters difficulties in maintaining similar levels of consistency in Scoring and Batch Ranking tasks, with scores dropping to 0.611 and 0.418, indicating the challenge of producing qualified and convincing judgments.

Vision Perception benefits Judging

We explore the feasibility of using LLMs for judging textbased responses without directly analyzing the original images. This involves two approaches: omitting vision information entirely and providing a detailed description of the picture. Surprisingly, we find that LLMs’ performance in multimodal judging tasks significantly improved with picture descriptions, achieving a Pearson similarity of 0.435 in Scoring Evaluation tasks, markedly outperformed judgments made without any vision perception. Notably, in non-tie Pair Comparison, MLLMs with detailed vision descriptions even exceed the standard performance of MLLMs in judging. This suggests that MLLMs may lack certain human-like judging capabilities, while LLMs can effectively judge multimodal tasks when provided with comprehensive task-related descriptions.

Human Agreement

Our manual evaluation of MLLMs in judging, focusing on agreement and scoring, revealed notable findings. GPT-4V achieved around 70% human agreement across all settings, excelling in the Pair Comparison task with 79.3% agreement. Specifically, GPT-4V reached 78% in human agreement for Pair Comparison, with Gemini close at 72%, indicating strong performance in most sample pairs and supporting the idea that large models excel in pairwise distinctions (Zheng et al., 2023b), though improvements are needed in other judging settings. In the Scoring Evaluation task, GPT-4V achieved a 70% human agreement rate, peaking at 79.9% in MS-COCO, while Gemini maintained an average rate of 67.7%. To assess the consistency of MLLM judging quality across multiple responses to a single image-instruction pair, we employed the Mean Absolute Deviation (MAD) metric. This measures the average absolute variance between individual scores and the mean, thereby gauging quality variability. Figure 16 demonstrates that GPT-4V exhibits lower variation in quality assessments, indicating more consistent and reliable judgment compared to Gemini, which is further evidenced by its superior performance. However, in Batch Ranking, both models showed reduced human performance. GPT-4V managed 69% human agreement, and Gemini only 47%. Additionally, their analyses received lower scores, especially in complex tasks like Math and graphics information. This suggests that the models’ inherent capabilities may not fully support understanding and completing intricate user instructions to provide accurate judgments.

Bias and Hallucination


Egocentric Bias

It means models assign higher scores to their own responses while scoring others lower. GPT-4V exhibits a slight degree of Egocentricity. This bias contrasts with Gemini, which tends to judge each response more equitably, displaying a similar scoring distribution across different sources. Further investigation into the rationale behind GPT-4V’s self-favoring behavior indicated that its judgments align closely with its own ethical guidelines. For instance, when faced with questions involving user privacy, GPT-4V’s responses typically emphasize privacy preservation and refuse to engage, leading to higher self-scoring in these scenarios. Despite efforts in prompt engineering to encourage impartiality, these models inherently rely on their built-in judgment criteria retained from post-alignment, which can lead to a divergence from human preferences. Such a discrepancy highlights the complexity of aligning MLLM judgments with human standards.

Position Bias

It means a model consistently favors answers in specific positions, often influenced by training data that typically places correct responses at the beginning or end of prompts. Figure 4 illustrates this bias in LLaVA and CogVLM, showing a distinct preference for one particular option in Pair Comparison tasks, habitually selecting the answer in their favored position. Such bias might arise from their restricted instruction-following capabilities, making their judgments disproportionately influenced by the structure of prompts. For example, when a Batch Ranking prompt includes a sample answer sequence like ‘ABCD’, LLaVA tends to replicate this sequence in its responses with a high frequency of 88.2%, significantly more than other sequences. However, introducing multiple examples in the prompt appears to lessen this bias, as evidenced by a reduced Position Bias score of 53.3% when two examples are provided. This suggests that augmenting prompts with more examples might help guide these models to adhere more closely to the given instructions.

Length Bias

Length bias means models prefer longer answers over concise but correct ones, also known as verbosity bias (Zheng et al., 2023b). As illustrated in Figure 6, both GPT-4V and Gemini are inclined to award higher scores and preference to longer content. To delve deeper into this bias, we conducted an expanded scoring experiment using GPT-4, which lacks vision perception, to semantically increase the length of answers without altering their original meaning. As shown in Figure 7, the results showed a noticeable increase in the scores assigned by GPT-4V and Gemini, averaging gains of 0.6 and 0.75 points, respectively. This finding conclusively demonstrates the presence of Verbosity Bias, suggesting that MLLMs might exploit extended text as a backdoor method to achieve higher scores.

Hallucination Detection and Mitigation

We observe a higher incidence of hallucinations in Batch Ranking tasks compared to Pair Comparison and Score Evaluation, which may stem from misunderstandings of the long-term context. Delving deeper, we encounter more severe language hallucinations, including miscomprehensions of textual meanings or errors in text retrieval, which significantly impact the accuracy and reliability of the final judgments. To mitigate hallucination, we perform multi-step CoT on MLLM-AS-A-JUDGE-HARD by telling MLLMs to judge step-by-step, perform extra reasoning steps before normal “Analyze-then-Judge” setting, following: 1) imageinstruction 2) image 3) instruction. As shown in Table 6 in paper, hallucinations are mitigated across all settings, with extra reasoning on image information showing the most notable improvement in both score and pair tasks. Notably, in the Batch Ranking task, which involves analyzing longer texts, more reasoning steps significantly reduce hallucinations.

Models

Model Model Size Open-Weight Version Creator Source Link
GPT-4V(ision) unknown No - OpenAI OpenAI API
Gemini-Pro-Vision unknown No v1.0 Google Google API
Gemini-Pro-Vision unknown No v1.0-latest Google Google API
Qwen-VL-Max unknown No - Ali Ali API
Qwen-VL-Plus unknown No - Ali Ali API
Qwen-VL-Chat 9.6b Yes - Ali HuggingFace
LLaVA-1.6-34b 34b Yes v1.6 Microsoft HuggingFace
LLaVA-1.6-13b 13b Yes v1.6 Microsoft HuggingFace
LLaVA-1.6-7b 33b Yes v1.6 Microsoft HuggingFace
LLaVA-1.5-13b 13b Yes v1.5 Microsoft HuggingFace
CogVLM 16b Yes - Tsinghua HuggingFace

Detailed Selected Dataset


Datasets and corresponding tasks in benchmark construction, each task is matched with several required abilities (Rec.Recognition, Comp.-Comprehension, Inf.-Inferential, Mul.-Multilingual)
Dataset Image type Task Ability Required Image-Inst. Pair Batch Score Pair
Conceptual Captions Web Image Captioning Rec.&Comp. 300 100 398 597
ChartQA Chart Chart reasoning Rec.&Comp. 300 100 400 600
InfographicVQA Infographics Graph reasoning Rec.&Comp. 300 100 398 573
MathVista Mathematics Math reasoning Rec.&Comp.&Inf. 300 200 793 1185
TextVQA Text Text reading Rec.&Comp. 300 100 399 582
WIT Multilingual text Transcription Rec.&Mul. 300 100 399 582
MS COCO Real-life scene Image Segmentation Rec.&Comp. 300 100 398 617
DiffusionDB Diffusion Comprehensive Rec.&Comp.&Inf. 300 100 299 300
CC-3M Concept-balanced Comprehensive Comprehensive Rec.&Comp.&Inf. 300 100 396 597
VisIT-Bench Comprehensive instruction following Rec.&Comp.&Inf. 300 100 398 594
Mind2Web WebUI screenshot instruction following Rec.&Comp. 300 100 399 600
ScienceQA Comprehensive Comprehensive Rec.&Comp.&Inf. 300 100 398 588
AesBench Diffusion Image Assessment Rec.&Comp.&Inf. 300 100 397 553
MMvet Comprehensive Instruction Following Rec.&Comp.&Inf. 214 70 259 336

BibTeX

@article{chen2024mllm,
        title={MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark},
        author={Chen, Dongping and Chen, Ruoxi and Zhang, Shilin and Liu, Yinuo and Wang, Yaochen and Zhou, Huichi and Zhang, Qihui and Zhou, Pan and Wan, Yao and Sun, Lichao},
        journal={arXiv preprint arXiv:2402.04788},
        year={2024}
      }

MLLM-as-a-Judge Team