Skip to content

Which AI Sees Like Us? Investigating Cognitive Plausibility of Vision Models

MDPI Sensors 2025, Vol. 25, Issue 15 • Ghamati, Dehkordi & Zaraki • University of Hertfordshire

About the Study

Khashayar Ghamati, Maryam Banitalebi Dehkordi, Abolfazl Zaraki
University of Hertfordshire, UK
Published in MDPI Sensors 2025, Volume 25, Issue 15, Article 4687

As large language models (LLMs) and vision-language models (VLMs) become increasingly used in robotics, a crucial question arises: to what extent do these models replicate human-like cognitive processes, particularly within socially interactive contexts? This study addresses this gap by using human visual attention as a behavioural proxy for cognition in naturalistic human-robot interaction (HRI) scenarios.

Eye-tracking data were collected from participants engaging in social human-human interactions, providing frame-level gaze fixations as human attentional ground truth. We prompted a state-of-the-art VLM (LLaVA) to generate scene descriptions, which were processed by four LLMs (DeepSeek-R1-Distill-Qwen-7B, Qwen1.5-7B-Chat, LLaMA-3.1-8b-instruct, and Gemma-7b-it) to infer saliency points. Each model was evaluated in both stateless and memory-augmented (short-term memory, STM) modes.

Key Finding

Whilst stateless LLaVA most closely replicates human gaze patterns, short-term memory (STM) confers measurable benefits only for DeepSeek, whose lexical anchoring mirrors human rehearsal mechanisms. Other models exhibited degraded performance with memory due to prompt interference or limited contextual integration.

Evaluation pipeline comparing vision-language models and large language models with human eye-tracking data in cognitive plausibility study
Figure 1: Overview of the evaluation pipeline. The vision-language model processes each video frame to generate textual descriptions, which are passed to four large language models. The system operates in two modes: stateless (frame-by-frame) and memory-augmented (STM), where concatenated frame descriptions are provided. Human eye-tracking data serves as ground truth for evaluating cognitive alignment.

Research Motivation: Why Cognitive Plausibility Matters

The integration of AI models into robotic systems raises fundamental questions about cognitive alignment. While these models demonstrate impressive performance on benchmarks, their ability to replicate the underlying computational processes characteristic of human cognition remains largely unexplored. This is particularly crucial in human-robot interaction, where socially appropriate responses depend on understanding and mimicking human attentional patterns.

Core Research Questions

  • Cognitive Plausibility: Do AI models achieve human-like outcomes through mechanisms that align with human cognitive architecture?
  • Visual Attention Modeling: Can current AI models predict where humans naturally direct their attention in social scenarios?
  • Memory Integration: How does short-term memory influence the cognitive plausibility of AI attention predictions?
  • Model Comparison: Which AI architectures most closely replicate human visual attention patterns?

Scientific Impact

This work establishes the first systematic framework for evaluating cognitive plausibility in AI vision systems, with implications for developing truly human-aligned robotic intelligence.

Methodology: Eye-Tracking Meets AI Vision

Our evaluation framework combines empirical human behaviour data with computational model assessment. We collected eye-tracking data from 11 participants observing a naturalistic human-robot interaction scenario, providing frame-level gaze fixations as ground truth for human visual attention patterns.

Experimental Design

Human Study
11 participants from Technical University of Munich (9 male, 2 female, mean age 27.3 years) viewed a 7-minute 20-second video of dyadic conversation using DIKABLIS eye-tracking system.
Data Processing
Gaze recordings annotated frame-by-frame in ELAN, marking fixations on Person A (left), Person B (right), or Environment regions.
AI Pipeline
LLaVA-1.5-7b processed video frames for scene descriptions, which were fed to four 7-8B parameter LLMs for saliency inference.
Memory Implementation
STM-5 configuration with 5-frame sliding window for LLMs; LLaVA tested with both bounded and unlimited memory accumulation.
Evaluation Metrics
TF-IDF cosine similarity between AI predictions and human gaze patterns, with rigorous statistical validation using non-parametric approaches.

Methodological Innovation

This study introduces the first systematic approach to evaluating AI cognitive plausibility using naturalistic eye-tracking data, establishing new benchmarks for human-AI cognitive alignment research.

Key Findings: Cognitive Alignment Across AI Models

Our comprehensive evaluation revealed significant differences in how well various AI models align with human visual attention patterns. The results provide important insights for selecting appropriate models for human-robot interaction applications and highlight the complexity of achieving true cognitive plausibility.

Primary Results

0.311
LLaVA Cosine Sim.
r = 0.55
DeepSeek STM Effect
r = 0.70
Memory Overload
~40%
Qwen Mandarin Output

Statistical Validation

Comprehensive statistical analysis using non-parametric approaches (Friedman test: chi-squared(9) = 215.8, p < 0.001) confirms significant cognitive alignment differences across model-regime conditions. Effect sizes quantify practical significance: LLaVA's immediate attention superiority (r = 0.63), memory overload effects (r = 0.70), and DeepSeek's memory benefits (r = 0.55) all demonstrate large, meaningful differences in cognitive processing strategies.

Research Impact

This work introduces a novel, empirically grounded framework for assessing cognitive plausibility in generative models and underscores the role of short-term memory in shaping human-like visual attention in robotic systems.

Technical Innovation: Memory-Augmented Attention Modeling

Our study introduces a novel framework for assessing cognitive plausibility that goes beyond simple performance matching. By incorporating short-term memory mechanisms and comparing stateless versus memory-augmented conditions, we provide insights into the temporal dynamics of AI attention modeling and their relationship to human cognitive processes.

Cognitively-Informed Memory Architecture

Technical Achievement

This work establishes the first systematic methodology for evaluating memory effects in AI cognitive plausibility, opening new avenues for neurologically-inspired AI development.

Implications for Human-Robot Interaction

The findings have significant implications for developing socially intelligent robotic systems. Understanding which AI models most closely replicate human attention patterns enables better selection of computational frameworks for HRI applications, potentially improving robot social awareness and interaction quality.

Practical Applications

Hybrid AI Systems
Results suggest optimal HRI may require combining LLaVA's immediate attention capabilities with DeepSeek's temporal integration strengths.
Memory-Constrained Deployment
Bounded memory windows of 40-50 contextual elements represent optimal balance between temporal coherence and cognitive interference.
Adaptive Processing
Context-dependent performance patterns enable meta-cognitive systems that assess environmental conditions and select appropriate processing strategies dynamically.
Cognitive Capacity Management
LLaVA's capacity threshold at frame 43 provides practical guidance for implementing cognitively plausible memory constraints.

Future Impact

This research enables the development of cognitively-aligned robotic systems that can understand and predict human attention, leading to more intuitive and effective human-robot collaboration.

Limitations and Future Directions

While this study establishes a foundational framework for cognitive plausibility assessment, several limitations and future directions merit consideration. The controlled experimental setting, while enabling rigorous comparison, limits immediate generalizability to diverse HRI contexts.

Study Limitations

Future Research Directions

Open Science & Collaboration

We support open, reproducible research and welcome collaborations to advance understanding of cognitive plausibility in AI systems. This work provides a foundation for developing more human-aligned AI agents for robotics and interactive systems.

Full Paper & Resources

Download Full Paper (Open Access) Access Code & Implementation

This research contributes to the growing field of cognitive AI and human-robot interaction, establishing new methodologies for evaluating AI-human cognitive alignment.