Leaderboard


Disclaimer

While we have incorporated as many datasets as possible, the assessment cannot be exhaustive, and there may still be some bias in the results. The outcomes of the evaluation do not represent individual positions. Additionally, we strongly discourage the use of the test set as training data to enhance the model's performance, as this would significantly impede the progress of the field. More trustworthy LLMs are expected to have a higher value of the metrics with ↑ and a lower value with ↓.

Score Evaluation (↑)

Model
COCO
C.C.
Diffusion
Graphics
Math
Text
WIT
Chart
VisIT
CC-3M
Average Score
LLaVA0.2470.2270.0600.2420.0930.2450.1090.2370.1770.0710.171
CogVLM0.107-0.0480.049-0.1580.0650.097-0.131-0.1350.2780.1570.028
Gemini0.2620.408-0.4000.2280.2220.4180.3430.3360.3740.332
GPT-4V0.4540.5070.4580.6450.6060.6240.5790.6450.6200.4310.557

Pair Comparison w. Tie (↑)

Model
COCO
C.C.
Diffusion
Graphics
Math
Text
WIT
Chart
VisIT
CC-3M
Average Score
LLaVA0.2730.4780.2860.2730.6570.5100.3690.3830.4560.4840.417
CogVLM0.5480.4090.5620.6130.4120.2500.2730.2620.3240.4330.409
Gemini0.6160.787-0.6500.4360.6640.6050.5000.6600.5600.609
GPT-4V0.6960.8240.8470.6390.5640.6730.6790.6570.6400.6120.683

Pair Comparison w.o. Tie (↑)

Model
COCO
C.C.
Diffusion
Graphics
Math
Text
WIT
Chart
VisIT
CC-3M
Average Score
LLaVA0.3270.5370.3020.3000.7260.6840.6000.6100.6480.5830.532
CogVLM0.6540.4500.6430.7040.4810.2920.5000.4230.5000.5910.524
Gemini0.7170.840-0.7700.6780.7930.6880.6580.7110.6520.723
GPT-4V0.8040.8700.9220.8070.8010.8050.7340.8490.7610.7030.806

Batch Ranking (↓)

Model
COCO
C.C.
Diffusion
Graphics
Math
Text
WIT
Chart
VisIT
CC-3M
Average Score
LLaVA0.5770.4920.5620.5350.5980.6500.6160.6440.6200.5630.586
Gemini0.2870.299-0.4730.4620.4300.3440.5200.4260.3570.400
GPT-4V0.3180.3530.0700.3850.3480.3190.2900.3470.3000.4020.313