to download the project base paper.

Abstract
Large vision-language models have recently achieved remarkable progress, exhibiting great perception and reasoning abilities concerning visual information. However, how to effectively evaluate these large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model’s abilities by incorporating human labor, but they are not scalable and display significant bias. In response to these challenges, we propose MMBench, a novel multi-modality benchmark. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model’s predictions. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models. We hope MMBench will assist the research community in better evaluating
their models and encourage future advancements in this domain. Project page:
https://opencompass.org.cn/MMBench.

Introduction
Recently, notable progress has been achieved within the realm of large language models (LLMs).For instance, the latest large language models, such as OpenAI’s ChatGPT and GPT-4 , have demonstrated remarkable reasoning capabilities that are comparable to, and in some cases, even surpass human capabilities. Drawing inspiration from these promising advancements in LLMs, the field of large vision-language models (LVLMs) has experienced a revolutionary transformation. Notable works, such as MiniGPT-4 , Otter, and LLaVA], have demonstrated enhanced capabilities in terms of image content recognition and reasoning within the domain of vision-language models, surpassing the achievements of earlier models. Nevertheless, the majority of current studies tend to emphasize showcasing qualitative examples, rather than undertaking comprehensive quantitative experiments to thoroughly assess their model performance. The lack of quantitative assessment poses a considerable challenge for comparing various models. Addressing results of six representative large vision-language models across the 20 ability dimensions defined in MMBench. For more comprehensive evaluation results on additional models, please refer to Table 6 and Table 5, as well as the appendix. this concern, recent studies have explored two approaches. The first approach involves utilizing existing public datasets for objective quantitative evaluation. Alternatively, some studies leverage human resources for subjective quantitative evaluation. However, it is worth noting that both approaches exhibit inherent limitations. A multitude of public datasets, such as VQAv2 , COCO Caption , GQA , OK-VQA ,have long served as valuable resources for the quantitative evaluation of vision-language models. These datasets offer objective metrics, including accuracy, BLEU, CIDEr, etc. However, when employed to evaluate more advanced LVLMs, these benchmarks encounter the following challenges:

(1). The existing evaluation metrics require an exact match between the prediction and the reference target, leading to potential limitations. For instance, in the Visual Question Answering (VQA) task, even if the prediction is “bicycle” while the reference answer is “bike”, the existing metric would assign a negative score to the current prediction, despite its correctness. Consequently, this issue results in a considerable number of false-negative samples.

(2). Current public datasets predominantly focus on evaluating a model’s performance on specific tasks, offering limited insights into the fine-grained capabilities of these models. Additionally, they provide minimal feedback regarding potential avenues for future optimization. Given the aforementioned challenges, recent studies, such as mPLUG-Owl and LVLM-eHub propose human-involved subjective evaluation strategies, aiming to address existing methods’ limitations by incorporating human judgment and perception in the evaluation process. mPLUG-Owl comprises artificially constructed open-ended questions related to 50 images sourced from existing datasets. After predictions are generated by both mPLUG-Owl and another vision-language (VL) model, human annotators will assess the quality of these predictions. Similarly, inspired by FastChat , LVLM-eHub develops an online platform where two models are prompted to answer a question related to an image. A participant then compares the answers provided by the two models. These subjective evaluation strategies offer several advantages, consists of accurate matching (humans can accurately match a prediction to the target, even if presented in different formats) and comprehensive assessment (humans tend to compare two predictions based on various ability dimensions, such as the model’s ability to correctly recognize objects in the image or comprehend the relationships between them). The final score is calculated as the average score across different abilities, enabling a comprehensive evaluation of various model capabilities. While subjective evaluation allows for a more comprehensive assessment of a model’s abilities, it also introduces new challenges that need to be addressed. Firstly, human evaluations are inherently biased. Consequently, it becomes challenging to reproduce the results presented in a paper with a different group of annotators. Also, the existing subjective evaluation strategies face scalability issues. Employing annotators for model evaluation after each experiment is an expensive endeavor. Moreover, evaluation datasets with a small scale can result in statistical instability. To ensure a robust evaluation, collecting more data becomes necessary, which in turn demands a significant amount of human labor. In light of the challenges faced by conventional objective and subjective benchmarks, we propose MMBench, a systematically designed objective evaluation benchmark to robustly evaluate different abilities of large vision-language models. Currently, MMBench contains approximately 3000 single choice questions covering 20 different ability dimensions, such as object localization and social reasoning, for vision-language models. Each ability dimension includes more than 75 questions, enabling a balanced and comprehensive evaluation of various abilities. The ability dimensions are not static and will expand as we continue to work on them. Since the instruction-following ability of current vision-language models is weak and they cannot directly output choice labels (A, B, C, etc), we cannot directly compare their output to the ground truth. In order to reduce the number of false-negative samples, we employ ChatGPT to match a model’s prediction to one of the choices in a multiple-choice question and then output the label for the matched choice. We compare ChatGPT-based choice matching to that of humans and find that ChatGPT can perfectly match human evaluations for 87% of ambiguous cases, demonstrating its good alignment and robustness as an evaluator. Besides, to make the evaluation more robust, we propose a novel evaluation strategy, named CircularEval (details in Sec. 4.1). We comprehensively evaluate 14 well-known vision-language models on MMBench and report their performance on different ability dimensions. Additionally, we undertook comparative assessments between Bard, the contemporary largest multimodal model with our benchmarked open-sourced VLMs. The performance ranking offers a direct comparison between various models and provides valuable feedback for future optimization. In summary, our main contributions are three-fold:

  • Systematically-constructed Dataset: In order to thoroughly evaluate the capacity of a VLM, we have meticulously crafted a systematic framework that encompasses three distinct levels of abilities. Within these defined ability levels, we have carefully curated a dataset, comprised of a total of 2,974
    meticulously selected questions, which collectively cover a diverse spectrum of 20 fine-grained skills.
  • Robust Evaluation: We introduce a novel circular evaluation strategy (CircularEval) to improve
    the robustness of our evaluation process. After that, ChatGPT is employed to match model’s
    prediction with given choices, which can successfully extract choices even from predictions of a
    VLM with poor instruction-following capability.
  • Analysis and Observations: We perform a comprehensive evaluation of 14 well-known vision language models using MMBench, and the results are reported across various ability dimensions
    . The observations derived from this analysis, including the training data selection, model architecture design and fine-tuning strategy, provide insights to the research community for future exploration.
Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *