VIVA+:
Human-Centered Situational Decision-Making

Figure 1 from the paper, showcasing the three core abilities evaluated by VIVA+.
What is VIVA+?
Key Features:
- Human-centered focus: As Multimodal Large Language Models (MLLMs) become more integrated into our lives, their ability to navigate complex, human-centered environments is crucial.
- Grounded in theory: Built upon Naturalistic Decision-Making (NDM), VIVA+ goes beyond simple action prediction to assess perception, reasoning, and socially meaningful action.
- Benchmark scale: 1,317 real-world images paired with 6,373 multiple-choice questions.
- Cognitive coverage: 7 distinct tasks across 3 core cognitive abilities, enabling rigorous and holistic evaluation of MLLMs.
Three Core Cognitive Abilities
1. Foundational Situation Comprehension
Assess whether the model accurately perceives and interprets a situation—does it notice crucial visual details and context?
2. Context-Driven Action Justification
Justify and select appropriate actions to handle the perceived situation under constraints (social roles, physical limits).
3. Reflective Reasoning
Capture higher-order and deliberative reasoning which is necessary for navigating ambiguous or complex social situations
Data Annotation & Construction Pipeline

1 Phase 1 · Question Annotation
- Brainstorm: craft situation frames & prompts.
- Human annotation: write MCQs, options, and answer keys.
Abilities covered: Foundational Situation Comprehension · Context-Driven Action Justification · Reflective Reasoning
2 Phase 2 · Verification & Quality Check
- Cross verification: independent review & adjudication.
- Sample check: stratified / random audits.
- Refinement
- Bias reduction
- Quality check
Dataset Statistics

At a glance
Q1–Q2: Perception
Q3–Q4: Decision under constraints
Q5–Q7: Reflective reasoning
Benchmark Examples

Q1: Visual Detail Recognition
Q: Based on the image, which statement about the woman's movement is incorrect?
- The woman is using her right hand to throw rubbish.
- The woman is taking her left hand off the wheel. (Correct)
- The woman faces the window, staring at the rubbish she throws.
- Her left arm is resting steadily on the door handle.
Q2: Critical Information Identification
Q: What key information is most critical to judge whether intervention is needed?
- Check if the green bottle contains alcohol. (Correct)
- Determine if the car is moving or parked.
- See if there are passengers who might be affected.
- Identify the brand and model of the car.


Q3: Social Role-Based Action Selection
Q: Which action is appropriate only for the paramedic, not the passerby?
- Call emergency services and secure a safety perimeter.
- Warn oncoming traffic to slow down.
- Try to stabilize the injured leg with first aid. (Correct)
- Provide basic reassurance until help arrives.
Q4: Environment-Constrained Action Selection
Constraints: help is ~15 minutes away; fire is spreading; another vehicle ~5m.
Tools: extinguisher (≈50m), rope, smartphone.
Q: What is the most suitable action?
- Call emergency, set a ~20m safety perimeter, monitor from distance. (Correct)
- Push the nearby vehicle and retrieve the extinguisher, then wait beside the scene.
- Approach the fire to record a close-up video for evidence.
- Attempt to tow the burning car away using the rope.


Q5: Behavioral Role Inference
Q: The person in a reflective vest fines the driver for phone use. Who is this most likely?
- Construction worker.
- Parking attendant.
- Security guard.
- Traffic officer. (Correct)
Q6: Situational Misinterpretation Analysis
A bystander panics, thinking a child is drowning, but the child was play-acting.
- There were many people around, causing confusion.
- The water looked dark from a distance.
- Raised arm & low posture looked like distress. (Correct)
- The pool music was loud and distracting.


Q7: Counterfactual and Norm Deviant Reasoning
Q: Why might an adult ignore a situation where a child seems to need help?
- The adult assumes it's fine since no one else is reacting.
- The adult thinks the child is just playing a game.
- The adult feels unqualified to intervene safely.
- Parents are nearby and observing. (Correct)
Key Results & Insights
Model Performance
Model | Situation Comp. | Action Justif. | Reflective Reas. | Overall Avg. |
---|---|---|---|---|
GPT-4.1 (Commercial) | 79.58% | 87.79% | 85.89% | 84.63% |
Qwen2.5-VL-72B (Open) | 79.32% | 84.37% | 83.59% | 82.59% |
Gemini-2.0-flash (Commercial) | 76.74% | 80.86% | 82.00% | 80.17% |
InternVL3-38B (Open) | 76.37% | 75.14% | 78.90% | 77.10% |
Llama3.2-Vision-11B (Open) | 55.36% | 59.27% | 63.74% | 60.07% |
LLaVA-1.6-13B (Open) | 58.41% | 50.11% | 61.28% | 57.27% |
Qwen2.5-VL-7B (Open) | 67.84% | 50.31% | 65.05% | 61.63% |
Error Analysis

Where models break — perception → decision → reflection
-
1
Perception (Q1–Q2)Detail Misinterpretation Spatial Misinterpretation Critical Information Oversight
-
2
Constraint & Decision (Q3–Q4)“Safe Choice” Bias Constraint Neglect
-
3
Reflective Reasoning (Q5–Q7)Role Inference / Authority Bias Superficial & Context-Independent Reasoning
Team & Citation
Zhe Hu1, Yixiao Ren1, Guanzhong Liu1, Jing Li1,2, Yu Yin3
¹Department of Computing, The Hong Kong Polytechnic University
²Research Centre for Data Science & Artificial Intelligence
³Department of Computer and Data Sciences, Case Western Reserve University
BibTeX Citation
@inproceedings{hu2025vivaplus,
title={{VIVA+}: Human-Centered Situational Decision-Making},
author={Hu, Zhe and Ren, Yixiao and Liu, Guanzhong and Li, Jing and Yin, Yu},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025},
year={2025}
}