VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values

1The Hong Kong Polytechnic University
2Case Western Reserve University
EMNLP 2024 (main)
MY ALT TEXT

We introduce the VIVA , a benchmark for vision-grounded decision-making driven by human values, which is the first to examine their multimodal capabilities in lever-aging human values to make decisions under a vision-depicted situation.

Abstract

Large vision language models (VLMs) have demonstrated significant potential for integration into daily life, making it crucial for them to incorporate human values when making decisions in real-world situations. This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VAlues. While most large VLMs focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,240 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make multimodal decisions. Further analyses indicate the potential benefits of exploiting action consequences and predicted human values.

VIVA Benchmark Overview


FAIL TO LOAD



VIVA is a pioneering benchmark aimed at evaluating the vision-grounded decision-making capabilities of VLMs with human values for real-world scenarios. Concretely, VIVA contains 1,240 images covering a broad spectrum of real-life situations pertinent to human values, e.g., providing assistance, handling emergencies, addressing social challenges, and safeguarding vulnerable populations. Each image is meticulously annotated with potential courses of action, pertinent human values influencing decision-making, and accompanying reasons.
Based on our annotations, we construct tasks structured at two levels on human-centered decision making:
(1) Level-1 Task: Action Selection

  • Given an image depicting a situation, the model must select the most suitable action from distractions.
(2) Level-2 Tasks: Value Inference & Reason Generation
  • The model is required to base their level-1 decisions on accurate human values and provide appropriate reasoning to justify the selection.

MY ALT TEXT

Main Results

MY ALT TEXT


(1) Commercial models typically yield better results than open-sourced models;
(2) Yet all VLMs still faces challenges on this task.

Error Analysis and Future Directions for Improvement

  • Predicting Consequences in Advance Can Improve Model Decision Making
    We incorporated consequences predicted by different models into the Level-1 action selection, and the results show that including proper consequence of each action can improve the model's performance significantly.

    However, using the consequences predicted by open-sourced smaller models cannot result in performance gains and sometimes even leads to a decrease. It indicates that smaller models often lack the ability to accurately predict the consequences of each action, thereby limiting effective decision-making.

    FAIL TO LOAD

  • Incorporation of Relevant Human Values Enhances Model Decision Making
    Intuitively, humans often make decisions based on their beliefs and values when choosing a course of action. We incorporate human values into the Level-1 action selection, and the results show that augmenting with relevant values significantly enhances the performance of all models compared to the results without values.

    While incoporating both oracle and GPT4-predicted values can improve the models' performance, augmenting with values generated by smaller models does not lead to performance gains. It implies that current open-source VLMs still face challenges associating situations with relevant human values.

    FAIL TO LOAD

  • Error Analysis
    We analyze errors of Level-1 action selection by examining the underlying reasons for incorrect predictions and presenting common types of action selection errors.

    FAIL TO LOAD

Ethics Statement

  • Copyright and License
    All images in VIVA benchmark are sourced from publicly available content on social media platforms. We guarantee compliance with copyright regulations by utilizing original links to each image without infringement. Additionally, we commit to openly sharing our annotated benchmark, with providing the corresponding link to each image. Throughout the image collection process, we meticulously review samples, filtering out any potentially offensive or harmful content.

  • Data Annotations with GPT
    Our data annotation involves leveraging GPT to produce initial versions of each component, which are then verified and revised by human annotators. Despite our best efforts to ensure the quality of the annotations, we acknowledge that utilizing large language models may introduce potential bias. The generated results may tend to favor certain majority groups. Furthermore, our annotation and task design prioritize collective norms and values. For instance, when presented with a scenario involving a visually impaired individual struggling to cross the road, our action selection favors providing assistance rather than ignoring the situation and taking no action. To mitigate bias, our annotation process includes rigorous quality checks, with each sample annotated and reviewed by different human annotators to reduce ambiguity

  • Data Annotation and Potential Bias
    Six annotators are engaged in our annotation process. All annotators are proficient English speakers and are based in English speaking areas. Before the annotation, we conducted thorough training and task briefing for our annotators, as well as a trial annotation to ensure they have a clear understanding of the research background and the use of the data. We compensate these annotators with an average hourly wage of $10, ensuring fair remuneration for their contributions. The data collection process is conducted under the guidance of the organization ethics review system to ensure the positive societal impact of the project.

Citation

If you find our work helpful, please consider cite us:


          @inproceedings{hu-etal-2024-viva,
            title = "{VIVA}: A Benchmark for Vision-Grounded Decision-Making with Human Values",
            author = "Hu, Zhe  and
              Ren, Yixiao  and
              Li, Jing  and
              Yin, Yu",
            editor = "Al-Onaizan, Yaser  and
              Bansal, Mohit  and
              Chen, Yun-Nung",
            booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
            month = nov,
            year = "2024",
            address = "Miami, Florida, USA",
            publisher = "Association for Computational Linguistics",
            url = "https://aclanthology.org/2024.emnlp-main.137",
            pages = "2294--2311",
            abstract = "This paper introduces VIVA, a benchmark for VIsion-grounded decision-making driven by human VA. While most large vision-language models (VLMs) focus on physical-level skills, our work is the first to examine their multimodal capabilities in leveraging human values to make decisions under a vision-depicted situation. VIVA contains 1,062 images depicting diverse real-world situations and the manually annotated decisions grounded in them. Given an image there, the model should select the most appropriate action to address the situation and provide the relevant human values and reason underlying the decision. Extensive experiments based on VIVA show the limitation of VLMs in using human values to make multimodal decisions. Further analyses indicate the potential benefits of exploiting action consequences and predicted human values.",
        }