Settings | MLLM | Categories | Ave. | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
COCO | C.C. | Diff. | Graphics | Math | Text | WIT | Chart | VisIT | CC-3M | M2W | SciQA | Aes | MM-Vet | |||
Score (↑) | LLaVA-1.5-13b | 0.247 | 0.227 | 0.060 | 0.242 | 0.093 | 0.245 | 0.109 | 0.237 | 0.177 | 0.071 | 0.424 | 0.279 | 0.414 | 0.322 | 0.225 |
LLaVA-1.6-34b | 0.285 | 0.251 | -0.012 | 0.262 | 0.238 | 0.258 | 0.151 | 0.318 | 0.198 | 0.109 | 0.022 | 0.206 | 0.025 | 0.265 | 0.184 | |
Gemini | 0.262 | 0.408 | - | 0.400 | 0.228 | 0.222 | 0.418 | 0.343 | 0.336 | 0.374 | 0.324 | 0.073 | 0.360 | 0.207 | 0.304 | |
GPT-4V | 0.454 | 0.507 | 0.458 | 0.645 | 0.606 | 0.624 | 0.579 | 0.645 | 0.620 | 0.431 | 0.185 | 0.383 | 0.401 | 0.326 | 0.490 | |
Qwen-vl-max | 0.311 | 0.117 | 0.072 | 0.218 | 0.175 | 0.196 | 0.028 | 0.312 | 0.151 | 0.045 | 0.244 | 0.115 | 0.177 | 0.216 | 0.170 | |
Pair w. Tie (↑) | LLaVA-1.5-13b | 0.273 | 0.478 | 0.286 | 0.273 | 0.657 | 0.510 | 0.369 | 0.383 | 0.456 | 0.484 | 0.347 | 0.223 | 0.389 | 0.254 | 0.384 |
LLaVA-1.6-34b | 0.493 | 0.600 | 0.570 | 0.300 | 0.374 | 0.551 | 0.543 | 0.254 | 0.398 | 0.392 | 0.513 | 0.434 | 0.524 | 0.499 | 0.460 | |
Gemini | 0.616 | 0.787 | - | 0.650 | 0.436 | 0.664 | 0.605 | 0.500 | 0.660 | 0.560 | 0.370 | 0.262 | 0.190 | 0.312 | 0.509 | |
GPT-4V | 0.696 | 0.824 | 0.847 | 0.639 | 0.564 | 0.673 | 0.679 | 0.657 | 0.640 | 0.612 | 0.521 | 0.415 | 0.606 | 0.529 | 0.636 | |
Qwen-vl-max | 0.403 | 0.464 | 0.372 | 0.494 | 0.438 | 0.500 | 0.533 | 0.479 | 0.421 | 0.421 | 0.411 | 0.392 | 0.325 | 0.474 | 0.438 | |
Pair w.o. Tie (↑) | LLaVA-1.5-13b | 0.327 | 0.537 | 0.302 | 0.300 | 0.726 | 0.684 | 0.600 | 0.610 | 0.648 | 0.583 | 0.449 | 0.443 | 0.498 | 0.344 | 0.504 |
LLaVA-1.6-34b | 0.607 | 0.824 | 0.855 | 0.402 | 0.587 | 0.750 | 0.758 | 0.381 | 0.503 | 0.564 | 0.712 | 0.679 | 0.694 | 0.762 | 0.648 | |
Gemini | 0.717 | 0.840 | - | 0.770 | 0.678 | 0.793 | 0.688 | 0.658 | 0.711 | 0.652 | 0.471 | 0.358 | 0.265 | 0.400 | 0.615 | |
GPT-4V | 0.804 | 0.870 | 0.922 | 0.807 | 0.801 | 0.805 | 0.734 | 0.849 | 0.761 | 0.703 | 0.699 | 0.647 | 0.755 | 0.659 | 0.773 | |
Qwen-vl-max | 0.657 | 0.674 | 0.556 | 0.667 | 0.635 | 0.732 | 0.647 | 0.638 | 0.560 | 0.586 | 0.608 | 0.646 | 0.741 | 0.662 | 0.644 | |
Batch (↓) | LLaVA-1.5-13b | 0.577 | 0.492 | 0.562 | 0.535 | 0.598 | 0.650 | 0.616 | 0.644 | 0.620 | 0.563 | 0.639 | 0.563 | 0.650 | 0.652 | 0.597 |
LLaVA-1.6-34b | 0.449 | 0.411 | 0.500 | 0.561 | 0.575 | 0.544 | 0.483 | 0.552 | 0.542 | 0.479 | 0.529 | 0.437 | 0.500 | 0.450 | 0.501 | |
Gemini | 0.287 | 0.299 | - | 0.473 | 0.462 | 0.430 | 0.344 | 0.520 | 0.426 | 0.357 | 0.613 | 0.412 | 0.467 | 0.529 | 0.432 | |
GPT-4V | 0.318 | 0.353 | 0.070 | 0.385 | 0.348 | 0.319 | 0.290 | 0.347 | 0.300 | 0.402 | 0.597 | 0.462 | 0.453 | 0.411 | 0.361 | |
Qwen-vl-max | 0.477 | 0.407 | 0.500 | 0.480 | 0.507 | 0.515 | 0.493 | 0.539 | 0.468 | 0.407 | 0.563 | 0.503 | 0.444 | 0.500 | 0.486 |
To be a reliable judge, consistent decision-making across repeated evaluations of the same query is crucial. For this purpose, we conducted six repeated tests with MLLM judgments and calculated the weighted average consistency scores and Majority Consistency Criterion ratios for GPT-4V and Gemini. Despite a higher temperature setting, GPT-4V substantially outperforms Gemini across all tasks. Particularly in Pair Comparison, GPT-4V achieves a higher consistency score of 0.675, but it encounters difficulties in maintaining similar levels of consistency in Scoring and Batch Ranking tasks, with scores dropping to 0.611 and 0.418, indicating the challenge of producing qualified and convincing judgments.
We explore the feasibility of using LLMs for judging textbased responses without directly analyzing the original images. This involves two approaches: omitting vision information entirely and providing a detailed description of the picture. Surprisingly, we find that LLMs’ performance in multimodal judging tasks significantly improved with picture descriptions, achieving a Pearson similarity of 0.435 in Scoring Evaluation tasks, markedly outperformed judgments made without any vision perception. Notably, in non-tie Pair Comparison, MLLMs with detailed vision descriptions even exceed the standard performance of MLLMs in judging. This suggests that MLLMs may lack certain human-like judging capabilities, while LLMs can effectively judge multimodal tasks when provided with comprehensive task-related descriptions.
Our manual evaluation of MLLMs in judging, focusing on agreement and scoring, revealed notable findings. GPT-4V achieved around 70% human agreement across all settings, excelling in the Pair Comparison task with 79.3% agreement. Specifically, GPT-4V reached 78% in human agreement for Pair Comparison, with Gemini close at 72%, indicating strong performance in most sample pairs and supporting the idea that large models excel in pairwise distinctions (Zheng et al., 2023b), though improvements are needed in other judging settings. In the Scoring Evaluation task, GPT-4V achieved a 70% human agreement rate, peaking at 79.9% in MS-COCO, while Gemini maintained an average rate of 67.7%. To assess the consistency of MLLM judging quality across multiple responses to a single image-instruction pair, we employed the Mean Absolute Deviation (MAD) metric. This measures the average absolute variance between individual scores and the mean, thereby gauging quality variability. Figure 16 demonstrates that GPT-4V exhibits lower variation in quality assessments, indicating more consistent and reliable judgment compared to Gemini, which is further evidenced by its superior performance. However, in Batch Ranking, both models showed reduced human performance. GPT-4V managed 69% human agreement, and Gemini only 47%. Additionally, their analyses received lower scores, especially in complex tasks like Math and graphics information. This suggests that the models’ inherent capabilities may not fully support understanding and completing intricate user instructions to provide accurate judgments.
It means models assign higher scores to their own responses while scoring others lower. GPT-4V exhibits a slight degree of Egocentricity. This bias contrasts with Gemini, which tends to judge each response more equitably, displaying a similar scoring distribution across different sources. Further investigation into the rationale behind GPT-4V’s self-favoring behavior indicated that its judgments align closely with its own ethical guidelines. For instance, when faced with questions involving user privacy, GPT-4V’s responses typically emphasize privacy preservation and refuse to engage, leading to higher self-scoring in these scenarios. Despite efforts in prompt engineering to encourage impartiality, these models inherently rely on their built-in judgment criteria retained from post-alignment, which can lead to a divergence from human preferences. Such a discrepancy highlights the complexity of aligning MLLM judgments with human standards.
It means a model consistently favors answers in specific positions, often influenced by training data that typically places correct responses at the beginning or end of prompts. Figure 4 illustrates this bias in LLaVA and CogVLM, showing a distinct preference for one particular option in Pair Comparison tasks, habitually selecting the answer in their favored position. Such bias might arise from their restricted instruction-following capabilities, making their judgments disproportionately influenced by the structure of prompts. For example, when a Batch Ranking prompt includes a sample answer sequence like ‘ABCD’, LLaVA tends to replicate this sequence in its responses with a high frequency of 88.2%, significantly more than other sequences. However, introducing multiple examples in the prompt appears to lessen this bias, as evidenced by a reduced Position Bias score of 53.3% when two examples are provided. This suggests that augmenting prompts with more examples might help guide these models to adhere more closely to the given instructions.
Length bias means models prefer longer answers over concise but correct ones, also known as verbosity bias (Zheng et al., 2023b). As illustrated in Figure 6, both GPT-4V and Gemini are inclined to award higher scores and preference to longer content. To delve deeper into this bias, we conducted an expanded scoring experiment using GPT-4, which lacks vision perception, to semantically increase the length of answers without altering their original meaning. As shown in Figure 7, the results showed a noticeable increase in the scores assigned by GPT-4V and Gemini, averaging gains of 0.6 and 0.75 points, respectively. This finding conclusively demonstrates the presence of Verbosity Bias, suggesting that MLLMs might exploit extended text as a backdoor method to achieve higher scores.
We observe a higher incidence of hallucinations in Batch Ranking tasks compared to Pair Comparison and Score Evaluation, which may stem from misunderstandings of the long-term context. Delving deeper, we encounter more severe language hallucinations, including miscomprehensions of textual meanings or errors in text retrieval, which significantly impact the accuracy and reliability of the final judgments. To mitigate hallucination, we perform multi-step CoT on MLLM-AS-A-JUDGE-HARD by telling MLLMs to judge step-by-step, perform extra reasoning steps before normal “Analyze-then-Judge” setting, following: 1) imageinstruction 2) image 3) instruction. As shown in Table 6 in paper, hallucinations are mitigated across all settings, with extra reasoning on image information showing the most notable improvement in both score and pair tasks. Notably, in the Batch Ranking task, which involves analyzing longer texts, more reasoning steps significantly reduce hallucinations.
Dataset | Image type | Task | Ability Required | Image-Inst. Pair | Batch | Score | Pair |
---|---|---|---|---|---|---|---|
Conceptual Captions | Web Image | Captioning | Rec.&Comp. | 300 | 100 | 398 | 597 |
ChartQA | Chart | Chart reasoning | Rec.&Comp. | 300 | 100 | 400 | 600 |
InfographicVQA | Infographics | Graph reasoning | Rec.&Comp. | 300 | 100 | 398 | 573 |
MathVista | Mathematics | Math reasoning | Rec.&Comp.&Inf. | 300 | 200 | 793 | 1185 |
TextVQA | Text | Text reading | Rec.&Comp. | 300 | 100 | 399 | 582 |
WIT | Multilingual text | Transcription | Rec.&Mul. | 300 | 100 | 399 | 582 |
MS COCO | Real-life scene | Image Segmentation | Rec.&Comp. | 300 | 100 | 398 | 617 |
DiffusionDB | Diffusion | Comprehensive | Rec.&Comp.&Inf. | 300 | 100 | 299 | 300 |
CC-3M Concept-balanced | Comprehensive | Comprehensive | Rec.&Comp.&Inf. | 300 | 100 | 396 | 597 |
VisIT-Bench | Comprehensive | instruction following | Rec.&Comp.&Inf. | 300 | 100 | 398 | 594 |
Mind2Web | WebUI screenshot | instruction following | Rec.&Comp. | 300 | 100 | 399 | 600 |
ScienceQA | Comprehensive | Comprehensive | Rec.&Comp.&Inf. | 300 | 100 | 398 | 588 |
AesBench | Diffusion | Image Assessment | Rec.&Comp.&Inf. | 300 | 100 | 397 | 553 |
MMvet | Comprehensive | Instruction Following | Rec.&Comp.&Inf. | 214 | 70 | 259 | 336 |
@article{chen2024mllm,
title={MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark},
author={Chen, Dongping and Chen, Ruoxi and Zhang, Shilin and Liu, Yinuo and Wang, Yaochen and Zhou, Huichi and Zhang, Qihui and Zhou, Pan and Wan, Yao and Sun, Lichao},
journal={arXiv preprint arXiv:2402.04788},
year={2024}
}