About the Study
As large language models (LLMs) and vision-language models (VLMs) become increasingly used in robotics, a crucial question arises: to what extent do these models replicate human-like cognitive processes, particularly within socially interactive contexts? This study addresses this gap by using human visual attention as a behavioural proxy for cognition in naturalistic human-robot interaction (HRI) scenarios.
Eye-tracking data were collected from participants engaging in social human-human interactions, providing frame-level gaze fixations as human attentional ground truth. We prompted a state-of-the-art VLM (LLaVA) to generate scene descriptions, which were processed by four LLMs (DeepSeek-R1-Distill-Qwen-7B, Qwen1.5-7B-Chat, LLaMA-3.1-8b-instruct, and Gemma-7b-it) to infer saliency points. Each model was evaluated in both stateless and memory-augmented (short-term memory, STM) modes.
Key Finding
Whilst stateless LLaVA most closely replicates human gaze patterns, short-term memory (STM) confers measurable benefits only for DeepSeek, whose lexical anchoring mirrors human rehearsal mechanisms. Other models exhibited degraded performance with memory due to prompt interference or limited contextual integration.
Research Motivation: Why Cognitive Plausibility Matters
The integration of AI models into robotic systems raises fundamental questions about cognitive alignment. While these models demonstrate impressive performance on benchmarks, their ability to replicate the underlying computational processes characteristic of human cognition remains largely unexplored. This is particularly crucial in human-robot interaction, where socially appropriate responses depend on understanding and mimicking human attentional patterns.
Core Research Questions
- Cognitive Plausibility: Do AI models achieve human-like outcomes through mechanisms that align with human cognitive architecture?
- Visual Attention Modeling: Can current AI models predict where humans naturally direct their attention in social scenarios?
- Memory Integration: How does short-term memory influence the cognitive plausibility of AI attention predictions?
- Model Comparison: Which AI architectures most closely replicate human visual attention patterns?
Scientific Impact
This work establishes the first systematic framework for evaluating cognitive plausibility in AI vision systems, with implications for developing truly human-aligned robotic intelligence.
Methodology: Eye-Tracking Meets AI Vision
Our evaluation framework combines empirical human behaviour data with computational model assessment. We collected eye-tracking data from 11 participants observing a naturalistic human-robot interaction scenario, providing frame-level gaze fixations as ground truth for human visual attention patterns.
Experimental Design
Methodological Innovation
This study introduces the first systematic approach to evaluating AI cognitive plausibility using naturalistic eye-tracking data, establishing new benchmarks for human-AI cognitive alignment research.
Key Findings: Cognitive Alignment Across AI Models
Our comprehensive evaluation revealed significant differences in how well various AI models align with human visual attention patterns. The results provide important insights for selecting appropriate models for human-robot interaction applications and highlight the complexity of achieving true cognitive plausibility.
Primary Results
- LLaVA (Stateless): Achieved the highest cognitive alignment with human gaze patterns (mean cosine similarity = 0.311), demonstrating superior immediate visual-linguistic integration
- DeepSeek with STM: Only model to significantly benefit from bounded memory integration (0.038 to 0.057, r = 0.55), using lexical anchoring mechanisms that mirror human rehearsal processes
- Memory Interference: LLaVA's performance degraded with unlimited context accumulation (0.311 to 0.121, r = 0.70), demonstrating cognitively plausible capacity limitations
- Cross-lingual Processing: Qwen produced ~40% Mandarin outputs despite English prompting, revealing sophisticated multilingual cognitive processing capabilities
- Architectural Constraints: Gemma and LLaMA showed systematic degradation with temporal context, indicating fundamental capacity limitations
Statistical Validation
Comprehensive statistical analysis using non-parametric approaches (Friedman test: chi-squared(9) = 215.8, p < 0.001) confirms significant cognitive alignment differences across model-regime conditions. Effect sizes quantify practical significance: LLaVA's immediate attention superiority (r = 0.63), memory overload effects (r = 0.70), and DeepSeek's memory benefits (r = 0.55) all demonstrate large, meaningful differences in cognitive processing strategies.
Research Impact
This work introduces a novel, empirically grounded framework for assessing cognitive plausibility in generative models and underscores the role of short-term memory in shaping human-like visual attention in robotic systems.
Technical Innovation: Memory-Augmented Attention Modeling
Our study introduces a novel framework for assessing cognitive plausibility that goes beyond simple performance matching. By incorporating short-term memory mechanisms and comparing stateless versus memory-augmented conditions, we provide insights into the temporal dynamics of AI attention modeling and their relationship to human cognitive processes.
Cognitively-Informed Memory Architecture
- Bounded Context Window: 5-frame sliding window maintains bounded memory buffer reflecting capacity limitations observed in human working memory [Cowan, 2001]
- Selective Information Processing: DeepSeek demonstrated strategic lexical recycling, reusing relevant phrases across temporal windows in 25 of final 30 frames
- Capacity Limitations: LLaVA showed systematic performance degradation beyond 43 frames, following predictable exponential decay patterns (R-squared = 0.89)
- Interference Patterns: Memory overload produced predictable interference effects quantifiable through performance metrics, aligning with computational principles of bounded processing systems
Technical Achievement
This work establishes the first systematic methodology for evaluating memory effects in AI cognitive plausibility, opening new avenues for neurologically-inspired AI development.
Implications for Human-Robot Interaction
The findings have significant implications for developing socially intelligent robotic systems. Understanding which AI models most closely replicate human attention patterns enables better selection of computational frameworks for HRI applications, potentially improving robot social awareness and interaction quality.
Practical Applications
Future Impact
This research enables the development of cognitively-aligned robotic systems that can understand and predict human attention, leading to more intuitive and effective human-robot collaboration.
Limitations and Future Directions
While this study establishes a foundational framework for cognitive plausibility assessment, several limitations and future directions merit consideration. The controlled experimental setting, while enabling rigorous comparison, limits immediate generalizability to diverse HRI contexts.
Study Limitations
- Controlled Scope: Single-video dyadic scenario enables rigorous methodological validation but limits immediate generalizability to diverse HRI contexts
- Sample Size: 11-participant study aligns with established eye-tracking research practices but represents modest sample for broad population claims
- Memory Implementation: Asymmetric memory implementations (STM-5 for LLMs vs. unlimited for LLaVA) necessitated by computational constraints limit direct comparability
- Cross-lingual Evaluation: TF-IDF framework systematically disadvantages multilingual models like Qwen, highlighting need for language-agnostic evaluation approaches
Future Research Directions
- Scale-Up Studies: Diverse interaction contexts, larger participant samples, and varied demographic populations to establish broader framework applicability
- Sophisticated Memory Mechanisms: Implementation of capacity-limited, selective, and temporally sensitive memory systems that better approximate human cognitive architecture
- Multilingual Evaluation: Cross-lingual sentence transformers and language-specific ground truth generation to fairly assess multilingual cognitive capabilities
- Real-World Deployment: Field validation in naturalistic HRI environments including multi-agent scenarios and task-oriented behaviours
Open Science & Collaboration
We support open, reproducible research and welcome collaborations to advance understanding of cognitive plausibility in AI systems. This work provides a foundation for developing more human-aligned AI agents for robotics and interactive systems.
Full Paper & Resources
This research contributes to the growing field of cognitive AI and human-robot interaction, establishing new methodologies for evaluating AI-human cognitive alignment.