VIVA : A Benchmark for Vision-Grounded Decision-Making with Human Values

We introduce the VIVA , a benchmark for vision-grounded decision-making driven by human values, which is the first to examine their multimodal capabilities in lever-aging human values to make decisions under a vision-depicted situation.

Large vision language models (VLMs) have demonstrated significant potential for integration into daily life, making it crucial for them to incorporate human values when making decisions in real-world situations. This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VAlues. While most large VLMs focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,240 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make multimodal decisions. Further analyses indicate the potential benefits of exploiting action consequences and predicted human values.

VIVA is a pioneering benchmark aimed at evaluating the vision-grounded decision-making capabilities of VLMs with human values for real-world scenarios. Concretely, VIVA contains 1,240 images covering a broad spectrum of real-life situations pertinent to human values, e.g., providing assistance, handling emergencies, addressing social challenges, and safeguarding vulnerable populations. Each image is meticulously annotated with potential courses of action, pertinent human values influencing decision-making, and accompanying reasons.
Based on our annotations, we construct tasks structured at two levels on human-centered decision making:
(1) Level-1 Task: Action Selection

Given an image depicting a situation, the model must select the most suitable action from distractions.

(2) Level-2 Tasks: Value Inference & Reason Generation

The model is required to base their level-1 decisions on accurate human values and provide appropriate reasoning to justify the selection.

(1) Commercial models typically yield better results than open-sourced models;
(2) Yet all VLMs still faces challenges on this task.

Predicting Consequences in Advance Can Improve Model Decision Making
We incorporated consequences predicted by different models into the Level-1 action selection, and the results show that including proper consequence of each action can improve the model's performance significantly.

However, using the consequences predicted by open-sourced smaller models cannot result in performance gains and sometimes even leads to a decrease. It indicates that smaller models often lack the ability to accurately predict the consequences of each action, thereby limiting effective decision-making.
Incorporation of Relevant Human Values Enhances Model Decision Making
Intuitively, humans often make decisions based on their beliefs and values when choosing a course of action. We incorporate human values into the Level-1 action selection, and the results show that augmenting with relevant values significantly enhances the performance of all models compared to the results without values.

While incoporating both oracle and GPT4-predicted values can improve the models' performance, augmenting with values generated by smaller models does not lead to performance gains. It implies that current open-source VLMs still face challenges associating situations with relevant human values.
Error Analysis
We analyze errors of Level-1 action selection by examining the underlying reasons for incorrect predictions and presenting common types of action selection errors.

Copyright and License
All images in VIVA benchmark are sourced from publicly available content on social media platforms. We guarantee compliance with copyright regulations by utilizing original links to each image without infringement. Additionally, we commit to openly sharing our annotated benchmark, with providing the corresponding link to each image. Throughout the image collection process, we meticulously review samples, filtering out any potentially offensive or harmful content.
Data Annotations with GPT
Our data annotation involves leveraging GPT to produce initial versions of each component, which are then verified and revised by human annotators. Despite our best efforts to ensure the quality of the annotations, we acknowledge that utilizing large language models may introduce potential bias. The generated results may tend to favor certain majority groups. Furthermore, our annotation and task design prioritize collective norms and values. For instance, when presented with a scenario involving a visually impaired individual struggling to cross the road, our action selection favors providing assistance rather than ignoring the situation and taking no action. To mitigate bias, our annotation process includes rigorous quality checks, with each sample annotated and reviewed by different human annotators to reduce ambiguity
Data Annotation and Potential Bias
Six annotators are engaged in our annotation process. All annotators are proficient English speakers and are based in English speaking areas. Before the annotation, we conducted thorough training and task briefing for our annotators, as well as a trial annotation to ensure they have a clear understanding of the research background and the use of the data. We compensate these annotators with an average hourly wage of $10, ensuring fair remuneration for their contributions. The data collection process is conducted under the guidance of the organization ethics review system to ensure the positive societal impact of the project.

Citation

If you find our work helpful, please consider cite us:


          @inproceedings{hu-etal-2024-viva,
            title = "{VIVA}: A Benchmark for Vision-Grounded Decision-Making with Human Values",
            author = "Hu, Zhe  and
              Ren, Yixiao  and
              Li, Jing  and
              Yin, Yu",
            editor = "Al-Onaizan, Yaser  and
              Bansal, Mohit  and
              Chen, Yun-Nung",
            booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
            month = nov,
            year = "2024",
            address = "Miami, Florida, USA",
            publisher = "Association for Computational Linguistics",
            url = "https://aclanthology.org/2024.emnlp-main.137",
            pages = "2294--2311",
            abstract = "This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VA. While most large vision-language models (VLMs) focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,062 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make multimodal decisions. Further analyses indicate the potential benefits of exploiting action consequences and predicted human values.",
        }

VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values

Abstract

VIVA Benchmark Overview

Main Results

Error Analysis and Future Directions for Improvement

Ethics Statement

Citation