Skip to main content

Learning to Gaze: Bio-Inspired Attention Adaptation Strategy for Social Robots

IEEE TCDS 2025 • Ghamati, Dehkordi, Nohooji, Voos, Amirabdollahian & Zaraki • University of Hertfordshire

Authors

Khashayar Ghamati1,*, Maryam Banitalebi Dehkordi1, Hamed Rahimi Nohooji2, Holger Voos2, Farshid Amirabdollahian1, and Abolfazl Zaraki1,*

1School of Physics, Engineering and Computer Science (SPECS), Robotics Research Group, University of Hertfordshire, AL10 9AB, UK
2Automation Robotics Research Group, Interdisciplinary Centre for Security, Reliability and Trust (SnT), University of Luxembourg, Luxembourg

Abstract

Key Innovation: This study presents a bio-inspired reinforcement learning framework for robotic gaze control that incorporates a habituation mechanism to regulate the exploration-exploitation trade-off, mirroring how biological attention systems filter redundant stimuli whilst remaining responsive to novel events.

Adaptive attention allocation in dynamic social environments remains a fundamental challenge for autonomous robots, requiring the integration of perceptual saliency, social context, and real-time decision-making. We present a bio-inspired reinforcement learning framework for robotic gaze control that incorporates a habituation mechanism to regulate the exploration-exploitation trade-off, mirroring how biological attention systems filter redundant stimuli whilst remaining responsive to novel events.

Through a comprehensive ablation study comparing Deep Q-Learning (DQL), Vanilla Q-Learning (VQL), and Multi-Objective Q-Learning (MOL), we uncover a critical insight: habituation significantly enhances DQL performance, improving response efficiency and policy stability, yet causes systematic degradation in MOL due to fundamental incompatibilities between fixed-threshold resets and the extended episodes required for multi-objective optimisation.

This differential effect reveals that bio-inspired mechanisms cannot be applied universally across learning architectures but must be carefully matched to algorithmic characteristics. Real-world deployment on the ARI humanoid robot validates the framework's practical applicability, achieving 95.1% accuracy (95% CI: [92.7%, 96.7%]) across 448 trials with well-calibrated confidence metrics that reliably distinguish correct from incorrect predictions.

Video Demo

Research Context & Motivation

Social robotics is expanding rapidly across assistive care, education, and entertainment, demanding robots with bio-inspired, human-like characteristics. Among these, the ability to direct and regulate attention stands as a fundamental challenge. Real-time attention allocation in multiparty scenarios requires identifying and prioritising salient stimuli in dynamic, ambiguous environments — essential for effective human-robot interaction.

Current approaches suffer from fundamental constraints that limit real-world deployment. Performance degrades beyond 2-3 participants as computational complexity grows exponentially, systems demonstrate poor temporal modelling, and reliance on deterministic handcrafted mappings constrains adaptability to novel scenarios. This gap stems from the underdeveloped translation of biological attention mechanisms to robotics.

Among missing mechanisms, habituation — fundamental to biological learning and attention regulation — is particularly promising yet severely underutilised. Current implementations employ simplistic exponential decay, failing to capture stimulus specificity, spontaneous recovery, and dishabituation responses. A critical question emerges: under what conditions do bio-inspired mechanisms enhance learning, and when might they interfere with algorithmic requirements?

76.9% RASA (dyadic only)
94.2% Rule-Based Controller
12.5% Random Policy (1/8)
95.1% RLBAM (this work)
ARI humanoid robot in triadic social interaction at Robot House, University of Hertfordshire

Figure 1: A triadic social HRI between ARI humanoid robot and two study humans, demonstrating real-world deployment of the bio-inspired attention adaptation framework at the Robot House, University of Hertfordshire, UK.

Methodology: Bio-Inspired Attention Framework

Habituation Mechanism

Implements stimulus specificity, spontaneous recovery, and dishabituation responses within the RL exploration-exploitation framework. When the agent becomes stuck (exceeding a step threshold), dishabituation temporarily restores full exploration; spontaneous recovery prevents erasing prior learning progress after a reset.

54-Experiment Ablation

Comprehensive 3 x 2 factorial design: three RL methods (DQL, VQL, MOL) x two exploration modes (standard epsilon decay vs. habituation) x 9 independent runs per configuration, yielding over 55,000 test episodes.

Elicited Attention Reward

Reward function grounded in empirical human eye-tracking data using the Elicited Attention model. Integrates social features (Gaze Control Scores) with proxemic zones (personal, social, public) for ecologically valid learned behaviours.

Real-Time Social Adaptation

Discrete 8-action gaze control (6 people + environment + objects) with real-time inference at 30 Hz on ARI's onboard computer. Processes social cues, gestures, and proximity for human-like attention allocation in multiparty scenarios.

Algorithm 1: Bio-Inspired Habituation Mechanism

Bio-Inspired Habituation Mechanism Initialise: ε ← 1.0, εprev ← 1.0, τ ← 10 Parameters: decay δ = 0.995, minimum εmin = 0.01 for each episode e = 1, 2, … do steps ← 0, reset_occurred ← False while not terminal state do Select action via ε-greedy policy Execute action, observe reward and next state steps ← steps + 1 if steps > τ and not goal_reached then εprev ← ε; ε ← 1.0 {Dishabituation} reset_occurred ← True end if end while if goal_reached and reset_occurred then ε ← εprev {Spontaneous recovery} else ε ← max(εmin, ε × δ) {Normal decay} end if end for

The mechanism operates through three key biological properties: Habituation corresponds to the standard exponential decay of the exploration rate. Dishabituation occurs when the agent becomes stuck (exceeding τ steps without reaching the goal state), temporarily restoring full exploration. Spontaneous recovery occurs after successful goal achievement following a dishabituation reset, restoring the previous ε value to prevent erasing prior learning progress.


Reward Structure: Elicited Attention Model

The reward function is grounded in empirical human eye-tracking data from a study with 11 participants at the Technical University of Munich, who viewed a 7-minute dyadic conversation video recorded with synchronised HD and Kinect RGB-D cameras. The Elicited Attention (EA) formula integrates social features, proxemics, orientation, and attention memory:

EAs,j(t) = Fs,j + P(d) + O(θ) + EAMs,j

The total reward is simplified to rt(s, a) = Fs,j + P(d), combining social features with proximity. The high discount factor (γ = 0.988) ensures the agent plans over ~83 steps into the future, capturing temporal dynamics of social attention.

Gaze Control Scores (Social Feature Priorities)

PrioritySocial CueGaze Control Score
1Entering100
2Speaking100
3Hand motion / Gesture65
4Leaving55
5Facial expression45
Social cue priorities from the Elicited Attention model, defining Fs,j in the reward function.

Proxemic Zone Weights

Personal Space
P(d) = 1000
Social Space
P(d) = 100
Public Space
P(d) = 10

Six Key Contributions

  • Bio-inspired habituation mechanism: Implements stimulus specificity, spontaneous recovery, and dishabituation — addressing the critical gap in bio-inspired attention mechanisms that rely on simplistic exponential decay
  • Comprehensive ablation study: 54 independent experiments with rigorous statistical analysis (paired t-tests, one-way ANOVA, Cohen's d) revealing the differential impact of habituation across learning architectures
  • Systematic baseline comparisons: Results contextualised against a rule-based controller (94.2% success), standard epsilon-greedy exploration, and random policy baseline (12.5%)
  • Empirically-grounded reward structure: Derived from human eye-tracking data using the Elicited Attention model, ensuring ecological validity in learned policies
  • Real-time performance: 30 Hz inference on ARI's onboard computer, with training completing in approximately 45 minutes per 10,000-episode run
  • Real-world deployment: 95.1% accuracy (95% CI: [92.7%, 96.7%]) across 448 trials with 3 experimenters, with per-class F1-scores of 0.63-0.78 for human-directed attention
RLBAM framework architecture showing habituation mechanism and multi-objective learning

Figure 2: The RLBAM framework architecture demonstrating the integration of habituation mechanisms with multi-objective Q-learning for adaptive robotic attention control.

System Architecture & Training

Three RL Architectures Compared

Deep Q-Learning (DQL)

Neural network function approximator with two hidden layers (128 and 64 units), ReLU activations, and experience replay for decorrelating sequential samples. A target network stabilises training. Naturally suited to the discrete 8-action gaze decision space with value enumeration and interpretable Q-value confidence metrics.

Vanilla Q-Learning (VQL)

Tabular Q-table with discrete state representation and learning rate α = 0.1. Provides a non-parametric baseline — its tabular structure produces near-uniform softmax distributions regardless of policy quality, yielding low confidence scores (0.22-0.23) but perfect success rates.

Multi-Objective QL (MOL)

Vector Q-table maintaining separate Q-values for six objectives: task success, proximity to target, gaze direction alignment, social appropriateness, movement smoothness, and energy efficiency. Naturally requires longer episodes, creating fundamental incompatibility with the fixed habituation threshold (τ = 10).

State & Action Spaces

Each state s ∈ ℝn encodes person activities, proximity measurements, and count derived from Kinect-based sensing. The environment comprises interactive states (one or more people present) and non-interactive states (no individuals or no active engagement). The action space comprises 8 discrete gaze control options:

Actions 1-6
Gaze at Person 1-6
Action 7
Gaze at Object
Action 8
Gaze at Environment
Random Baseline
12.5% (1/8 chance)

Hyperparameters & Training Configuration

Learning Rate (α)
0.0016
Discount Factor (γ)
0.988
Decay Rate (δ)
0.995
Min Epsilon (εmin)
0.01
Step Threshold (τ)
10 steps
Training Episodes
10,000 per run
Test States
51 x 20 iterations
Planning Horizon
~83 steps

Simulation Environment

Training was conducted in NVIDIA IsaacSim, constructing environments featuring humans and a social robot receptionist with scenarios including person entry, interaction activities (hand-waving, speech), and multiparty conversations. Each 10,000-episode training run completes in approximately 45 minutes on a standard workstation, with a memory footprint below 2 GB. Real-time inference operates at 30 Hz on the ARI robot's onboard computer.

Six Evaluation Metrics

Success Rate

Proportion of test episodes where the agent successfully directs gaze to the appropriate target.

Avg Steps to Goal

Response efficiency — directly relevant to real-time HRI where delays disrupt interaction flow.

Softmax Confidence

Probability mass assigned to the selected action, reflecting decisiveness of the learned policy.

Q-Margin

Gap between best and second-best Q-values — confidence metric independent of the softmax function.

Transfer Score

Composite: success (40%) + efficiency (30%) + confidence (20%) + normalised reward (10%).

Reset Frequency

How often habituation triggers dishabituation — reveals compatibility between mechanism and algorithm.

Results & Experimental Validation

Performance Highlights:

  • DQL+HAB achieves 100% success rate across 9,180 test episodes with the highest transfer score (0.963 +/- 0.010)
  • 95.1% real-world accuracy (95% CI: [92.7%, 96.7%]) across 448 trials with 3 experimenters on the ARI humanoid robot
  • Critical finding: Habituation enhances DQL but causes systematic degradation in MOL (97.8% success, 164x more resets) — bio-inspired mechanisms are architecture-dependent
  • Well-calibrated confidence: Spearman correlation between confidence and correctness of 0.42 (p < 0.001) enables principled online error detection

Ablation Study: Simulation Results (54 Experiments)

The experimental design follows a 3 x 2 factorial structure: three RL methods (DQL, VQL, MOL), two exploration modes (standard epsilon decay EPS vs. habituation HAB), and 9 independent runs per configuration. Each run comprises 10,000 training episodes followed by evaluation on 51 test states with 20 iterations each, yielding over 55,000 test episodes across all conditions.

Table II: Complete Ablation Results (mean +/- SD across 9 runs)

MethodModeSuccess RateAvg StepsConfidenceQ-MarginTransfer Score
DQLHAB1.000 +/- 0.0002.08 +/- 0.560.878 +/- 0.0435.00 +/- 1.700.963 +/- 0.010
DQLEPS1.000 +/- 0.0002.31 +/- 0.730.848 +/- 0.0964.89 +/- 2.230.956 +/- 0.021
VQLEPS1.000 +/- 0.0003.70 +/- 0.230.227 +/- 0.0560.29 +/- 0.650.748 +/- 0.018
VQLHAB1.000 +/- 0.0003.72 +/- 0.100.224 +/- 0.0250.25 +/- 0.310.747 +/- 0.008
MOLEPS0.999 +/- 0.0017.84 +/- 1.520.200 +/- 0.0000.001 +/- 0.0010.717 +/- 0.005
MOLHAB0.978 +/- 0.01316.14 +/- 5.020.200 +/- 0.0000.000 +/- 0.0000.684 +/- 0.020
Bold green = best per metric. Red = worst per metric. DQL-HAB achieves the optimal configuration across all metrics.

Key Findings

  • DQL dominance: Both DQL configurations achieved perfect 100% success rates. DQL-HAB shows the best overall transfer score (0.963), highest confidence (0.878), and lowest variance in average steps. DQL assigns 85-88% probability to its chosen action, producing confident, purposeful gaze shifts within 70-80 ms (2 steps x 33 ms) — well within the 200-300 ms window of natural human gaze shifts
  • VQL robustness: Also achieved 100% success rate; habituation has essentially zero effect on VQL performance (negligible Cohen's d values < 0.2), consistent with its tabular structure that doesn't benefit from adaptive exploration bursts
  • MOL degradation: Habituation causes catastrophic interference in MOL — success drops from 99.9% to 97.8% (paired t-test: t = 4.89, p = 0.001, Cohen's d = 2.28), with average steps more than doubling from 7.84 to 16.14. At 30 Hz, MOL-HAB's 16.14 steps translate to >500 ms — perceptibly unnatural in social interaction
  • Root cause: MOL averages 7,368 habituation resets per run (164x more than DQL's 44.9), because the fixed step threshold (τ = 10) misinterprets the legitimately longer episodes required for multi-objective optimisation as stuck states. In 73.7% of MOL-HAB training episodes, habituation is triggered — preventing stable exploitation

Habituation Reset Statistics

Reset frequency reveals fundamental compatibility between the habituation mechanism and the learning architecture:

MethodResets / RunReset Ratio (vs DQL)Interpretation
DQL44.9 +/- 9.81.0xEfficient
VQL93.3 +/- 7.32.1xAcceptable
MOL7,368.2 +/- 28.6164xPathological
DQL's reset pattern represents efficient learning; MOL's 164x ratio represents a fundamental mismatch between mechanism and algorithm.

Statistical Analysis (HAB vs. EPS within each method)

MethodMetrict-statisticp-valueCohen's dEffect Size
DQLAvg Steps0.590.5730.35Small
DQLConfidence-0.740.4820.40Small
DQLTransfer Score-0.620.5510.37Small
VQLAvg Steps-0.210.8360.10Negligible
VQLConfidence0.150.8820.07Negligible
VQLTransfer Score0.210.8400.09Negligible
MOLSuccess Rate4.890.001**2.28Large
MOLAvg Steps-5.40<0.001**2.24Large
MOLTransfer Score3.560.007**1.67Large
**p < 0.05 indicates statistical significance. Only MOL shows significant differences, all with large effect sizes (d >= 0.8) confirming systematic degradation.

Cross-Method ANOVA (DQL vs. VQL vs. MOL)

All metrics show highly significant method effects, confirming that architecture choice matters enormously for attention control performance:

MetricEPS F-statEPS p-valueHAB F-statHAB p-value
Success Rate4.000.032*25.07<0.001***
Avg Steps77.23<0.001***62.61<0.001***
Confidence292.33<0.001***1580.88<0.001***
Q-Margin34.12<0.001***98.47<0.001***
Transfer Score189.56<0.001***277.34<0.001***
*p < 0.05, ***p < 0.001. The dramatically higher F-statistic for confidence under HAB (1580.88 vs 292.33) suggests habituation further amplifies architectural differences between neural and tabular value functions.

Transfer Score Rankings (with 95% Confidence Intervals)

#1 DQL-HAB: 0.963
#2 DQL-EPS: 0.956
#3 VQL-EPS: 0.748
#4 VQL-HAB: 0.747
#5 MOL-EPS: 0.717
#6 MOL-HAB: 0.684

Non-overlapping confidence intervals between DQL configurations and all other methods provide statistical confirmation (>95% confidence) that DQL+HAB will outperform any non-DQL configuration in future deployments.


Real-World Deployment (448 Trials)

The trained DQL-HAB model was deployed on the ARI humanoid robot at the Robot House, University of Hertfordshire, transitioning from the controlled precision of simulation to the messy complexity of physical embodiment. The experiment involved 3 experimenters, all affiliated with the university, engaged in structured social interactions through 448 distinct trials across four state categories.

Accuracy by Scenario Category

CategoryTrialsAccuracy95% CI
Two experimenters157 (35%)98.1%[94.5%, 99.3%]
Three experimenters112 (25%)96.4%[91.2%, 98.6%]
Single experimenter112 (25%)93.8%[87.7%, 96.9%]
Low saliency67 (15%)88.1%[78.2%, 93.8%]
Overall44895.1%[92.7%, 96.7%]
Multi-experimenter scenarios outperform single-experimenter because clearly differentiated engagement scores enable decisive DQL predictions (mean softmax: 0.782 for two-experimenter states).

Per-Class Performance Metrics

Action ClassPrecisionRecallF1-ScoreSupport
Gaze_At_Experimenter_20.6670.9520.784105
Gaze_At_Experimenter_10.7240.7280.726169
Gaze_At_Experimenter_30.6090.6540.631107
Gaze_At_Environment0.4000.0300.05667
Gaze_At_Object0.0000.0000.0000
Macro Average0.6000.5910.549448
F1-scores of 0.63-0.78 for human-directed gaze reflect strong social attention capability. Low environmental gaze F1 (0.056) stems from the agent's bias toward human targets.

Confidence Calibration & Error Detection

Correct Predictions
Confidence: 0.752 +/- 0.221
Incorrect Predictions
Confidence: 0.562 +/- 0.318
Separation Test
t = 3.21, p = 0.002
Spearman Correlation
ρ = 0.42, p < 0.001
Q-Margin (Correct)
2.906
Q-Margin (Incorrect)
0.867 (3.35x gap)

The strong correlation between confidence and correctness suggests well-calibrated uncertainty estimates that could enable adaptive behaviours: the robot could request clarification when confidence drops below 0.60, or flag predictions with Q-margins below 1.5 for human oversight.

Observed Real-World Behaviours

The experimental protocol progressed through increasingly complex scenarios. Across this progression, RLBAM demonstrated several key capabilities:

  • Immediate gaze redirection: Upon person entry, the robot redirected gaze within an average of 2.3 steps, consistent with simulation performance
  • Appropriate disengagement: When humans adopted passive stances (attending to objects rather than the robot), the agent appropriately shifted attention
  • Seamless departure recovery: Following person departure, the agent redirected to the next most salient stimulus rather than fixating on empty space
  • Dynamic priority adjustment: In multiparty scenarios, smooth tracking of gesturing persons and rapid attention switching between simultaneously present individuals confirmed sim-to-real generalisation without catch-up saccades or fixation loss

Sim-to-Real Gap Decomposition

The 4.9 percentage point performance gap (100% simulation to 95.1% real-world) can be decomposed into identifiable sources:

2.5pp Sensor noise (Kinect)
1.5pp Distribution shift
0.9pp Policy limitations

Critically, performance does not increase with scenario complexity — three-experimenter states (96.4%) match or exceed single-experimenter (93.8%), indicating robust policy generalisation rather than memorisation.

Training dynamics showing habituation-guided learning progression

Figure 3: Training dynamics and behavioural analysis showing habituation-guided exploration, reward progression, and convergence patterns across 10,000 episodes per run (54 independent experiments).

Impact & Future Directions

This work provides the first systematic evidence that bio-inspired habituation mechanisms are architecture-dependent rather than universally beneficial — a finding with broad implications for the intersection of cognitive science and reinforcement learning. By rigorously demonstrating that the same biological principle can enhance, leave unchanged, or degrade performance depending on the underlying learning architecture, this research challenges the common assumption that bio-inspired mechanisms are generically advantageous.

Broader Implications

  • Social Robotics: Delivers a deployable, real-time (30 Hz) gaze control system achieving 95.1% accuracy in unstructured multiparty environments, advancing natural human-robot interaction in care, education, and entertainment domains
  • Cognitive Science: Provides computational validation of biological habituation phenomena (stimulus specificity, spontaneous recovery, dishabituation) while revealing that not all learning systems benefit equally — mirroring observations in biological neural circuits
  • Reinforcement Learning: Offers a principled methodology for integrating bio-inspired exploration mechanisms with RL, including diagnostic tools (reset frequency analysis, step-count distributions) to predict compatibility before deployment
  • Sim-to-Real Transfer: Demonstrates that policies trained with bio-inspired exploration exhibit robust transfer (only 4.9% accuracy drop), with a decomposition framework attributing the gap to sensor noise (2.5%), distribution shift (1.5%), and policy limitations (0.9%)

Comparison with Prior Work

RLBAM advances over the current landscape of learning-based gaze systems:

  • vs. RASA (76.9% dyadic accuracy): RLBAM achieves 95.1% in more challenging multiparty scenarios, demonstrating the advantage of RL over rule-assisted approaches
  • vs. Multi-party systems (97% effectiveness): These systems mask critical limitations in scalability, temporal modelling, and adaptability; RLBAM achieves comparable accuracy with genuine online adaptation capability
  • vs. Rule-Based Controller (94.2% success): RLBAM's learned policy (95.1%) exceeds the deterministic baseline whilst handling ambiguous scenarios requiring temporal reasoning and adaptation to novel situations
  • vs. Policy gradient / Actor-critic methods: These require large amounts of training data with limited transferability across social contexts; RLBAM trains in ~45 minutes and transfers robustly from simulation
  • vs. Transformer-based attention prediction: Whilst achieving high accuracy on fixation patterns, these supervised approaches lack the adaptive online learning capability essential for personalised HRI

Limitations

  • Scalability: The current system supports up to 6 simultaneous persons, constrained by Kinect sensor detection capacity. DQL's neural network handles state space growth more gracefully than tabular methods, but scenarios exceeding 6 participants would require architectural extensions such as hierarchical attention or graph-based state representations
  • Social context: The reward function encodes Western-centric gaze norms where direct eye contact and proximity signal engagement. However, gaze norms vary significantly across cultures — in some East Asian cultures, sustained direct gaze may be perceived as confrontational; in certain Middle Eastern cultures, gender-differentiated gaze patterns are socially expected. Adapting to diverse cultural contexts would require culture-specific reward function parameterisation
  • Generalisation beyond triadic interaction: Whilst real-world validation involved up to 3 experimenters, the architecture accommodates up to 6 persons. Validation with larger groups, unstructured environments, and naturalistic (non-scripted) interactions remains essential future work
  • Fixed habituation threshold: The fixed τ = 10 steps works well for single-objective learning but becomes pathological for multi-objective methods. Only changing the threshold to 30-50 steps or making it adaptive could potentially salvage the approach for MOL

Future Research Directions

  • Adaptive threshold selection: Adjusting τ based on task complexity, with thresholds of 30-50 steps being more appropriate for multi-objective scenarios
  • Continual policy adaptation: Through lifelong learning or meta-reinforcement learning, enabling the system to personalise attention strategies for individual users over extended deployments
  • Semantic scene parsing integration: Adding affective state recognition to enrich the state representation beyond activity and proximity features
  • Cross-cultural reward functions: Potentially learned through observation of culturally situated interactions
  • Large-scale longitudinal studies: Open-world environments to assess both generalisability and social acceptability of the robot's gaze behaviours across diverse demographic groups

Paper Access & Resources

View on IEEE Xplore Source Code & Data Supplementary Materials

IEEE Transactions on Cognitive and Developmental Systems (TCDS)