Vision-language models (VLMs), models that generate text given a hybrid text/visual prompt, have a wide range of use cases, including visual question-answering, text-driven image creation and alteration, image captioning, and robotics. However, our current understanding is limited by incomplete reporting — missing results for certain models on specific benchmarks and a notable absence of transparency regarding the prompting methodologies in the technical reports and blog posts released by the model developers.
To increase transparency and better understand VLMs, we introduce the first version of Holistic Evaluation of Vision-Language Models (VHELM) or VHELM v1.0 by extending the HELM framework to assess the performance of prominent VLMs.
We now have HELM Classic and Lite for language models, HEIM for text-to-image models, and VHELM to evaluate vision-language models. For this initial version of the VLM evaluation, we implemented three scenarios and supported 6 state-of-the-art VLMs to HELM.
Vision-language models take text and visual input as a prompt to produce outputs, often in the form of text or visual content. In this initial effort, we focus on VLMs that can process interleaved text and images and generate text. VLMs can be applied to a diverse array of tasks, such as multimodal sentiment analysis, visual commonsense reasoning, and object detection given textual descriptions.
To evaluate VLMs, we start with the following scenarios (datasets):
In HELM, adaptation is the procedure that transforms a model, along with training instances, into a system that can make predictions on new instances. To evaluate VLMs, we initially support the same multiple-choice and generation adaptation methods with support for both zero-shot prompting and in-context learning as in HELM but with multimodal inputs.
This expansion is feasible due to HELM’s modular design, which allows us to support new types of evaluation in the framework.
We evaluate the following 6 vision-language models varying in size and access from 4 different model developers with VHELM:
Model | Developer | # of parameters | Access |
---|---|---|---|
GPT-4V | OpenAI | Unknown | Closed |
Gemini 1.5 Pro | Unknown | Closed | |
Gemini 1.0 Pro Vision | Unknown | Closed | |
Claude 3 Sonnet | Anthropic | Unknown | Closed |
Claude 3 Opus | Anthropic | Unknown | Closed |
IDEFICS 2 | HuggingFace | 8B | Open |
In a zero-shot setting, we evaluate the models using up to 1000 examples from the three benchmarks. For MMMU, we evaluate on the multiple-choice questions only across all the subjects as they make up the vast majority of the examples. When computing the accuracy, we see if the model picks the letter corresponding to the correct answer. For VizWiz and VQAv2, we normalize (lowercase, remove punctuation, articles, and extra whitespace) the text from both the model’s output and the correct answers before checking if they are equal.
All the inputs and outputs can be found here for full transparency. Through our initial analysis, we uncovered the following key empirical findings:
Limitations. Although it is a very popular benchmark for evaluating VLMs, open-ended QA benchmarks like VQAv2 have limitations on the expected answers when relying on purely automated metrics like exact match accuracy (Figure 2). A complete list of examples of these limitations can be found in GPT-4V’s raw predictions to VQAv2 prompts.
VHELM v1.0 is by no means holistic, as we are only testing a few capabilities of these models with the three benchmarks. These findings are preliminary as we plan to continuously update and expand VHELM with more scenarios and models, filling in the gaps of missing results for specific scenario/model pairs as this project progresses.
We have presented VHELM v1.0, which extends HELM to evaluate vision-language models. We started with 3 scenarios (VQAv2, MMMU, and VizWiz), which were used to evaluate 6 vision-language models. For complete transparency, we release all the raw prompts and predictions in addition to the numbers.
Supporting the evaluation of VLMs in HELM was just the first step. We plan to add more scenarios, models, and metrics to measure the different capabilities and risks of vision-language models. We also encourage the community to contribute new models, scenarios, and metrics to improve the evaluation of vision-language models.
We would like to thank OpenAI, Anthropic, and Google for API credits to evaluate their models and Google for funding.
If our work resonates with you, we kindly ask you to reference this blog post:
@misc{vhelm_initial,
author = {Tony Lee and Yifan Mai and Chi Heem Wong and Josselin Somerville Roberts and Michihiro Yasunaga and Faarzan Kaiyom and Rishi Bommasani and Percy Liang},
title = {The First Steps to Holistic Evaluation of Vision-Language Models},
month = {May},
year = {2024},
url = {https://crfm.stanford.edu/2024/05/08/vhelm-initial.html},
}