In this work, we explore the interplay between text and visual attention mechanisms in a robot reinforcement learning setting, where robotic tasks are conveyed through natural language instructions. Specifically, we propose a novel approach aimed at enhancing robot task learning and execution by leveraging an integrated multimodal attention model that associates task-relevant environmental features with related words in the natural language mission text. We illustrate the overall framework architecture along with the learning process, emphasizing the interaction between textual and visual feature-based attention mechanisms. The method is trained in MiniGrid environments using the Proximal Policy Optimization algorithm, and its performance is evaluated by comparing the proposed architecture with a baseline that lacks attentional mechanisms. Experimental results demonstrate the efficacy of the approach also highlighting its potential in behavior transparency.

Combined Text-Visual Attention Models for Robot Task Learning and Execution / Rauso, Giuseppe; Caccavale, Riccardo; Finzi, Alberto. - 15450 LNAI:(2025), pp. 228-240. ( 23rd International Conference of the Italian Association for Artificial Intelligence, AIxIA 2024 ita 2024) [10.1007/978-3-031-80607-0_18].

Combined Text-Visual Attention Models for Robot Task Learning and Execution

Rauso, Giuseppe;Caccavale, Riccardo;Finzi, Alberto
2025

Abstract

In this work, we explore the interplay between text and visual attention mechanisms in a robot reinforcement learning setting, where robotic tasks are conveyed through natural language instructions. Specifically, we propose a novel approach aimed at enhancing robot task learning and execution by leveraging an integrated multimodal attention model that associates task-relevant environmental features with related words in the natural language mission text. We illustrate the overall framework architecture along with the learning process, emphasizing the interaction between textual and visual feature-based attention mechanisms. The method is trained in MiniGrid environments using the Proximal Policy Optimization algorithm, and its performance is evaluated by comparing the proposed architecture with a baseline that lacks attentional mechanisms. Experimental results demonstrate the efficacy of the approach also highlighting its potential in behavior transparency.
2025
9783031806063
9783031806070
Combined Text-Visual Attention Models for Robot Task Learning and Execution / Rauso, Giuseppe; Caccavale, Riccardo; Finzi, Alberto. - 15450 LNAI:(2025), pp. 228-240. ( 23rd International Conference of the Italian Association for Artificial Intelligence, AIxIA 2024 ita 2024) [10.1007/978-3-031-80607-0_18].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/996851
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact