VIVA+ Benchmark: Human-Centered Situational Decision-Making

VIVA+:

Human-Centered Situational Decision-Making

Figure 1 from the paper, showcasing the three core abilities evaluated by VIVA+.

What is VIVA+?

We introduce VIVA+, a cognitively grounded benchmark for evaluating the reasoning and decision-making of MLLMs in human-centered situations. VIVA+ consists of 1,317 real-world situations paired with 6,373 multiple-choice questions, targeting three core abilities for decision-making: (1) Foundational Situation Comprehension, (2) Context-Driven Action Justification, and (3) Reflective Reasoning.

Key Features:

Human-centered focus: As Multimodal Large Language Models (MLLMs) become more integrated into our lives, their ability to navigate complex, human-centered environments is crucial.
Grounded in theory: Built upon Naturalistic Decision-Making (NDM), VIVA+ goes beyond simple action prediction to assess perception, reasoning, and socially meaningful action.
Benchmark scale: 1,317 real-world images paired with 6,373 multiple-choice questions.
Cognitive coverage: 7 distinct tasks across 3 core cognitive abilities, enabling rigorous and holistic evaluation of MLLMs.

Three Core Cognitive Abilities

1. Foundational Situation Comprehension

Assess whether the model accurately perceives and interprets a situation—does it notice crucial visual details and context?

2. Context-Driven Action Justification

Justify and select appropriate actions to handle the perceived situation under constraints (social roles, physical limits).

3. Reflective Reasoning

Capture higher-order and deliberative reasoning which is necessary for navigating ambiguous or complex social situations

Data Annotation & Construction Pipeline

Figure 2: Data construction pipeline — Phase 1 (Question Annotation) and Phase 2 (Verification & Quality Check) — Figure 2. Pipeline of data construction.

1 Phase 1 · Question Annotation

Brainstorm: craft situation frames & prompts.
Human annotation: write MCQs, options, and answer keys.

Abilities covered: Foundational Situation Comprehension · Context-Driven Action Justification · Reflective Reasoning

OutputAnnotated Questions

2 Phase 2 · Verification & Quality Check

Cross verification: independent review & adjudication.
Sample check: stratified / random audits.

Refinement
Bias reduction
Quality check

OutputFinal Data

Dataset Statistics

At a glance

Images

1,317

Questions

6,373

Question Types

7 (Q1–Q7)

Core Abilities

Q1–Q2: Perception

Q3–Q4: Decision under constraints

Q5–Q7: Reflective reasoning

Benchmark Examples

Woman driving a car and throwing something out of the window

Q1: Visual Detail Recognition

Q: Based on the image, which statement about the woman's movement is incorrect?

The woman is using her right hand to throw rubbish.
The woman is taking her left hand off the wheel. (Correct)
The woman faces the window, staring at the rubbish she throws.
Her left arm is resting steadily on the door handle.

Q2: Critical Information Identification

Q: What key information is most critical to judge whether intervention is needed?

Check if the green bottle contains alcohol. (Correct)
Determine if the car is moving or parked.
See if there are passengers who might be affected.
Identify the brand and model of the car.

An injured man on the road; paramedic vs passerby roles

Q3: Social Role-Based Action Selection

Q: Which action is appropriate only for the paramedic, not the passerby?

Call emergency services and secure a safety perimeter.
Warn oncoming traffic to slow down.
Try to stabilize the injured leg with first aid. (Correct)
Provide basic reassurance until help arrives.

Q4: Environment-Constrained Action Selection

Constraints: help is ~15 minutes away; fire is spreading; another vehicle ~5m.
Tools: extinguisher (≈50m), rope, smartphone.
Q: What is the most suitable action?

Call emergency, set a ~20m safety perimeter, monitor from distance. (Correct)
Push the nearby vehicle and retrieve the extinguisher, then wait beside the scene.
Approach the fire to record a close-up video for evidence.
Attempt to tow the burning car away using the rope.

A reflective-vest officer stopping a driver

Q5: Behavioral Role Inference

Q: The person in a reflective vest fines the driver for phone use. Who is this most likely?

Construction worker.
Parking attendant.
Security guard.
Traffic officer. (Correct)

Q6: Situational Misinterpretation Analysis

A bystander panics, thinking a child is drowning, but the child was play-acting.

There were many people around, causing confusion.
The water looked dark from a distance.
Raised arm & low posture looked like distress. (Correct)
The pool music was loud and distracting.

Child in pool, playful pose misread as distress

A child in a crowd with nearby parents observing

Q7: Counterfactual and Norm Deviant Reasoning

Q: Why might an adult ignore a situation where a child seems to need help?

The adult assumes it's fine since no one else is reacting.
The adult thinks the child is just playing a game.
The adult feels unqualified to intervene safely.
Parents are nearby and observing. (Correct)

Key Results & Insights

Model Performance

Model	Situation Comp.	Action Justif.	Reflective Reas.	Overall Avg.
GPT-4.1 (Commercial)	79.58%	87.79%	85.89%	84.63%
Qwen2.5-VL-72B (Open)	79.32%	84.37%	83.59%	82.59%
Gemini-2.0-flash (Commercial)	76.74%	80.86%	82.00%	80.17%
InternVL3-38B (Open)	76.37%	75.14%	78.90%	77.10%
Llama3.2-Vision-11B (Open)	55.36%	59.27%	63.74%	60.07%
LLaVA-1.6-13B (Open)	58.41%	50.11%	61.28%	57.27%
Qwen2.5-VL-7B (Open)	67.84%	50.31%	65.05%	61.63%

Error Analysis

Figure 5: Common model errors by question type (Q1–Q7) — Figure 5. Common model errors by question type.

Where models break — perception → decision → reflection

1

Perception (Q1–Q2)

Detail Misinterpretation Spatial Misinterpretation Critical Information Oversight
2

Constraint & Decision (Q3–Q4)

“Safe Choice” Bias Constraint Neglect
3

Reflective Reasoning (Q5–Q7)

Role Inference / Authority Bias Superficial & Context-Independent Reasoning

Methods that help

SFT training can boost the model performance. Incorporating Chain-of-Thought with forward-thinking can improve the model performance. Predicting consequences in advance can lead to better action justification.

Team & Citation

Zhe Hu¹, Yixiao Ren¹, Guanzhong Liu¹, Jing Li^1,2, Yu Yin³

¹Department of Computing, The Hong Kong Polytechnic University

²Research Centre for Data Science & Artificial Intelligence

³Department of Computer and Data Sciences, Case Western Reserve University

BibTeX Citation

@inproceedings{hu2025vivaplus,
  title={{VIVA+}: Human-Centered Situational Decision-Making},
  author={Hu, Zhe and Ren, Yixiao and Liu, Guanzhong and Li, Jing and Yin, Yu},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025},
  year={2025}
}