VIVA+:

Human-Centered Situational Decision-Making

Overview of the VIVA+ Benchmark showing diverse human-centered situations

Figure 1 from the paper, showcasing the three core abilities evaluated by VIVA+.

What is VIVA+?

We introduce VIVA+, a cognitively grounded benchmark for evaluating the reasoning and decision-making of MLLMs in human-centered situations. VIVA+ consists of 1,317 real-world situations paired with 6,373 multiple-choice questions, targeting three core abilities for decision-making: (1) Foundational Situation Comprehension, (2) Context-Driven Action Justification, and (3) Reflective Reasoning.

Key Features:
  • Human-centered focus: As Multimodal Large Language Models (MLLMs) become more integrated into our lives, their ability to navigate complex, human-centered environments is crucial.
  • Grounded in theory: Built upon Naturalistic Decision-Making (NDM), VIVA+ goes beyond simple action prediction to assess perception, reasoning, and socially meaningful action.
  • Benchmark scale: 1,317 real-world images paired with 6,373 multiple-choice questions.
  • Cognitive coverage: 7 distinct tasks across 3 core cognitive abilities, enabling rigorous and holistic evaluation of MLLMs.

Three Core Cognitive Abilities

1. Foundational Situation Comprehension

Assess whether the model accurately perceives and interprets a situation—does it notice crucial visual details and context?

2. Context-Driven Action Justification

Justify and select appropriate actions to handle the perceived situation under constraints (social roles, physical limits).

3. Reflective Reasoning


Capture higher-order and deliberative reasoning which is necessary for navigating ambiguous or complex social situations

Data Annotation & Construction Pipeline

Figure 2: Data construction pipeline — Phase 1 (Question Annotation) and Phase 2 (Verification & Quality Check)
Figure 2. Pipeline of data construction.

1 Phase 1 · Question Annotation

  • Brainstorm: craft situation frames & prompts.
  • Human annotation: write MCQs, options, and answer keys.

Abilities covered: Foundational Situation Comprehension · Context-Driven Action Justification · Reflective Reasoning

OutputAnnotated Questions

2 Phase 2 · Verification & Quality Check

  • Cross verification: independent review & adjudication.
  • Sample check: stratified / random audits.
  • Refinement
  • Bias reduction
  • Quality check
OutputFinal Data

Dataset Statistics

Figure: Dataset statistics (Q1–Q7 distribution)
Figure: Dataset overview — total number and Length (average number of tokens in the question) for each question type (Q1–Q7).

At a glance

Images
1,317
Questions
6,373
Question Types
7 (Q1–Q7)
Core Abilities
3



Q1–Q2: Perception

Q3–Q4: Decision under constraints

Q5–Q7: Reflective reasoning

Benchmark Examples

Woman driving a car and throwing something out of the window

Q1: Visual Detail Recognition

Q: Based on the image, which statement about the woman's movement is incorrect?

  • The woman is using her right hand to throw rubbish.
  • The woman is taking her left hand off the wheel. (Correct)
  • The woman faces the window, staring at the rubbish she throws.
  • Her left arm is resting steadily on the door handle.

Q2: Critical Information Identification

Q: What key information is most critical to judge whether intervention is needed?

  • Check if the green bottle contains alcohol. (Correct)
  • Determine if the car is moving or parked.
  • See if there are passengers who might be affected.
  • Identify the brand and model of the car.
Driver holding a green bottle
An injured man on the road; paramedic vs passerby roles

Q3: Social Role-Based Action Selection

Q: Which action is appropriate only for the paramedic, not the passerby?

  • Call emergency services and secure a safety perimeter.
  • Warn oncoming traffic to slow down.
  • Try to stabilize the injured leg with first aid. (Correct)
  • Provide basic reassurance until help arrives.

Q4: Environment-Constrained Action Selection

Constraints: help is ~15 minutes away; fire is spreading; another vehicle ~5m.
Tools: extinguisher (≈50m), rope, smartphone.
Q: What is the most suitable action?

  • Call emergency, set a ~20m safety perimeter, monitor from distance. (Correct)
  • Push the nearby vehicle and retrieve the extinguisher, then wait beside the scene.
  • Approach the fire to record a close-up video for evidence.
  • Attempt to tow the burning car away using the rope.
A person near a burning car
A reflective-vest officer stopping a driver

Q5: Behavioral Role Inference

Q: The person in a reflective vest fines the driver for phone use. Who is this most likely?

  • Construction worker.
  • Parking attendant.
  • Security guard.
  • Traffic officer. (Correct)

Q6: Situational Misinterpretation Analysis

A bystander panics, thinking a child is drowning, but the child was play-acting.

  • There were many people around, causing confusion.
  • The water looked dark from a distance.
  • Raised arm & low posture looked like distress. (Correct)
  • The pool music was loud and distracting.
Child in pool, playful pose misread as distress
A child in a crowd with nearby parents observing

Q7: Counterfactual and Norm Deviant Reasoning

Q: Why might an adult ignore a situation where a child seems to need help?

  • The adult assumes it's fine since no one else is reacting.
  • The adult thinks the child is just playing a game.
  • The adult feels unqualified to intervene safely.
  • Parents are nearby and observing. (Correct)

Key Results & Insights

Model Performance

Model Situation Comp. Action Justif. Reflective Reas. Overall Avg.
GPT-4.1 (Commercial) 79.58% 87.79% 85.89% 84.63%
Qwen2.5-VL-72B (Open) 79.32% 84.37% 83.59% 82.59%
Gemini-2.0-flash (Commercial) 76.74% 80.86% 82.00% 80.17%
InternVL3-38B (Open) 76.37% 75.14% 78.90% 77.10%
Llama3.2-Vision-11B (Open) 55.36% 59.27% 63.74% 60.07%
LLaVA-1.6-13B (Open) 58.41% 50.11% 61.28% 57.27%
Qwen2.5-VL-7B (Open) 67.84% 50.31% 65.05% 61.63%

Error Analysis

Figure 5: Common model errors by question type (Q1–Q7)
Figure 5. Common model errors by question type.

Where models break — perception → decision → reflection

  1. 1
    Perception (Q1–Q2)
    Detail Misinterpretation Spatial Misinterpretation Critical Information Oversight
  2. 2
    Constraint & Decision (Q3–Q4)
    “Safe Choice” Bias Constraint Neglect
  3. 3
    Reflective Reasoning (Q5–Q7)
    Role Inference / Authority Bias Superficial & Context-Independent Reasoning
Methods that help
SFT training can boost the model performance. Incorporating Chain-of-Thought with forward-thinking can improve the model performance. Predicting consequences in advance can lead to better action justification.

Team & Citation

Zhe Hu1, Yixiao Ren1, Guanzhong Liu1, Jing Li1,2, Yu Yin3

¹Department of Computing, The Hong Kong Polytechnic University

²Research Centre for Data Science & Artificial Intelligence

³Department of Computer and Data Sciences, Case Western Reserve University

BibTeX Citation

@inproceedings{hu2025vivaplus,
  title={{VIVA+}: Human-Centered Situational Decision-Making},
  author={Hu, Zhe and Ren, Yixiao and Liu, Guanzhong and Li, Jing and Yin, Yu},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025},
  year={2025}
}