Which AI Sees Like Us? Cognitive Plausibility of Vision Models

About the Study

Khashayar Ghamati, Maryam Banitalebi Dehkordi, Abolfazl Zaraki

University of Hertfordshire, UK

Published in MDPI Sensors 2025, Volume 25, Issue 15, Article 4687

As large language models (LLMs) and vision-language models (VLMs) become increasingly used in robotics, a crucial question arises: to what extent do these models replicate human-like cognitive processes, particularly within socially interactive contexts? This study addresses this gap by using human visual attention as a behavioural proxy for cognition in naturalistic human-robot interaction (HRI) scenarios.

Eye-tracking data were collected from participants engaging in social human-human interactions, providing frame-level gaze fixations as human attentional ground truth. We prompted a state-of-the-art VLM (LLaVA) to generate scene descriptions, which were processed by four LLMs (DeepSeek-R1-Distill-Qwen-7B, Qwen1.5-7B-Chat, LLaMA-3.1-8b-instruct, and Gemma-7b-it) to infer saliency points. Each model was evaluated in both stateless and memory-augmented (short-term memory, STM) modes.

Key Finding

Whilst stateless LLaVA most closely replicates human gaze patterns, short-term memory (STM) confers measurable benefits only for DeepSeek, whose lexical anchoring mirrors human rehearsal mechanisms. Other models exhibited degraded performance with memory due to prompt interference or limited contextual integration.

Research Motivation: Why Cognitive Plausibility Matters

The integration of AI models into robotic systems raises fundamental questions about cognitive alignment. While these models demonstrate impressive performance on benchmarks, their ability to replicate the underlying computational processes characteristic of human cognition remains largely unexplored. This is particularly crucial in human-robot interaction, where socially appropriate responses depend on understanding and mimicking human attentional patterns.

Core Research Questions

Cognitive Plausibility: Do AI models achieve human-like outcomes through mechanisms that align with human cognitive architecture?
Visual Attention Modeling: Can current AI models predict where humans naturally direct their attention in social scenarios?
Memory Integration: How does short-term memory influence the cognitive plausibility of AI attention predictions?
Model Comparison: Which AI architectures most closely replicate human visual attention patterns?

Scientific Impact

This work establishes the first systematic framework for evaluating cognitive plausibility in AI vision systems, with implications for developing truly human-aligned robotic intelligence.

Methodology: Eye-Tracking Meets AI Vision

Our evaluation framework combines empirical human behaviour data with computational model assessment. We collected eye-tracking data from 11 participants observing a naturalistic human-robot interaction scenario, providing frame-level gaze fixations as ground truth for human visual attention patterns.

Experimental Design

Human Study

11 participants from Technical University of Munich (9 male, 2 female, mean age 27.3 years) viewed a 7-minute 20-second video of dyadic conversation using DIKABLIS eye-tracking system.

Data Processing

Gaze recordings annotated frame-by-frame in ELAN, marking fixations on Person A (left), Person B (right), or Environment regions.

AI Pipeline

LLaVA-1.5-7b processed video frames for scene descriptions, which were fed to four 7-8B parameter LLMs for saliency inference.

Memory Implementation

STM-5 configuration with 5-frame sliding window for LLMs; LLaVA tested with both bounded and unlimited memory accumulation.

Evaluation Metrics

TF-IDF cosine similarity between AI predictions and human gaze patterns, with rigorous statistical validation using non-parametric approaches.

Methodological Innovation

This study introduces the first systematic approach to evaluating AI cognitive plausibility using naturalistic eye-tracking data, establishing new benchmarks for human-AI cognitive alignment research.

Key Findings: Cognitive Alignment Across AI Models

Our comprehensive evaluation revealed significant differences in how well various AI models align with human visual attention patterns. The results provide important insights for selecting appropriate models for human-robot interaction applications and highlight the complexity of achieving true cognitive plausibility.

Primary Results

0.311

LLaVA Cosine Sim.

r = 0.55

DeepSeek STM Effect

r = 0.70

Memory Overload

~40%

Qwen Mandarin Output

LLaVA (Stateless): Achieved the highest cognitive alignment with human gaze patterns (mean cosine similarity = 0.311), demonstrating superior immediate visual-linguistic integration
DeepSeek with STM: Only model to significantly benefit from bounded memory integration (0.038 to 0.057, r = 0.55), using lexical anchoring mechanisms that mirror human rehearsal processes
Memory Interference: LLaVA's performance degraded with unlimited context accumulation (0.311 to 0.121, r = 0.70), demonstrating cognitively plausible capacity limitations
Cross-lingual Processing: Qwen produced ~40% Mandarin outputs despite English prompting, revealing sophisticated multilingual cognitive processing capabilities
Architectural Constraints: Gemma and LLaMA showed systematic degradation with temporal context, indicating fundamental capacity limitations

Statistical Validation

Comprehensive statistical analysis using non-parametric approaches (Friedman test: chi-squared(9) = 215.8, p < 0.001) confirms significant cognitive alignment differences across model-regime conditions. Effect sizes quantify practical significance: LLaVA's immediate attention superiority (r = 0.63), memory overload effects (r = 0.70), and DeepSeek's memory benefits (r = 0.55) all demonstrate large, meaningful differences in cognitive processing strategies.

Research Impact

This work introduces a novel, empirically grounded framework for assessing cognitive plausibility in generative models and underscores the role of short-term memory in shaping human-like visual attention in robotic systems.

Technical Innovation: Memory-Augmented Attention Modeling

Our study introduces a novel framework for assessing cognitive plausibility that goes beyond simple performance matching. By incorporating short-term memory mechanisms and comparing stateless versus memory-augmented conditions, we provide insights into the temporal dynamics of AI attention modeling and their relationship to human cognitive processes.

Cognitively-Informed Memory Architecture

Bounded Context Window: 5-frame sliding window maintains bounded memory buffer reflecting capacity limitations observed in human working memory [Cowan, 2001]
Selective Information Processing: DeepSeek demonstrated strategic lexical recycling, reusing relevant phrases across temporal windows in 25 of final 30 frames
Capacity Limitations: LLaVA showed systematic performance degradation beyond 43 frames, following predictable exponential decay patterns (R-squared = 0.89)
Interference Patterns: Memory overload produced predictable interference effects quantifiable through performance metrics, aligning with computational principles of bounded processing systems

Technical Achievement

This work establishes the first systematic methodology for evaluating memory effects in AI cognitive plausibility, opening new avenues for neurologically-inspired AI development.

Implications for Human-Robot Interaction

The findings have significant implications for developing socially intelligent robotic systems. Understanding which AI models most closely replicate human attention patterns enables better selection of computational frameworks for HRI applications, potentially improving robot social awareness and interaction quality.

Practical Applications

Hybrid AI Systems

Results suggest optimal HRI may require combining LLaVA's immediate attention capabilities with DeepSeek's temporal integration strengths.

Memory-Constrained Deployment

Bounded memory windows of 40-50 contextual elements represent optimal balance between temporal coherence and cognitive interference.

Adaptive Processing

Context-dependent performance patterns enable meta-cognitive systems that assess environmental conditions and select appropriate processing strategies dynamically.

Cognitive Capacity Management

LLaVA's capacity threshold at frame 43 provides practical guidance for implementing cognitively plausible memory constraints.

Future Impact

This research enables the development of cognitively-aligned robotic systems that can understand and predict human attention, leading to more intuitive and effective human-robot collaboration.

Limitations and Future Directions

While this study establishes a foundational framework for cognitive plausibility assessment, several limitations and future directions merit consideration. The controlled experimental setting, while enabling rigorous comparison, limits immediate generalizability to diverse HRI contexts.

Study Limitations

Controlled Scope: Single-video dyadic scenario enables rigorous methodological validation but limits immediate generalizability to diverse HRI contexts
Sample Size: 11-participant study aligns with established eye-tracking research practices but represents modest sample for broad population claims
Memory Implementation: Asymmetric memory implementations (STM-5 for LLMs vs. unlimited for LLaVA) necessitated by computational constraints limit direct comparability
Cross-lingual Evaluation: TF-IDF framework systematically disadvantages multilingual models like Qwen, highlighting need for language-agnostic evaluation approaches

Future Research Directions

Scale-Up Studies: Diverse interaction contexts, larger participant samples, and varied demographic populations to establish broader framework applicability
Sophisticated Memory Mechanisms: Implementation of capacity-limited, selective, and temporally sensitive memory systems that better approximate human cognitive architecture
Multilingual Evaluation: Cross-lingual sentence transformers and language-specific ground truth generation to fairly assess multilingual cognitive capabilities
Real-World Deployment: Field validation in naturalistic HRI environments including multi-agent scenarios and task-oriented behaviours

Open Science & Collaboration

We support open, reproducible research and welcome collaborations to advance understanding of cognitive plausibility in AI systems. This work provides a foundation for developing more human-aligned AI agents for robotics and interactive systems.

Full Paper & Resources

Download Full Paper (Open Access) Access Code & Implementation

This research contributes to the growing field of cognitive AI and human-robot interaction, establishing new methodologies for evaluating AI-human cognitive alignment.