In this work, we explore the interplay between text and visual attention mechanisms in a robot reinforcement learning setting, where robotic tasks are conveyed through natural language instructions. Specifically, we propose a novel approach aimed at enhancing robot task learning and execution by leveraging an integrated multimodal attention model that associates task-relevant environmental features with related words in the natural language mission text. We illustrate the overall framework architecture along with the learning process, emphasizing the interaction between textual and visual feature-based attention mechanisms. The method is trained in MiniGrid environments using the Proximal Policy Optimization algorithm, and its performance is evaluated by comparing the proposed architecture with a baseline that lacks attentional mechanisms. Experimental results demonstrate the efficacy of the approach also highlighting its potential in behavior transparency.
Combined Text-Visual Attention Models for Robot Task Learning and Execution / Rauso, Giuseppe; Caccavale, Riccardo; Finzi, Alberto. - 15450 LNAI:(2025), pp. 228-240. ( 23rd International Conference of the Italian Association for Artificial Intelligence, AIxIA 2024 ita 2024) [10.1007/978-3-031-80607-0_18].
Combined Text-Visual Attention Models for Robot Task Learning and Execution
Rauso, Giuseppe;Caccavale, Riccardo;Finzi, Alberto
2025
Abstract
In this work, we explore the interplay between text and visual attention mechanisms in a robot reinforcement learning setting, where robotic tasks are conveyed through natural language instructions. Specifically, we propose a novel approach aimed at enhancing robot task learning and execution by leveraging an integrated multimodal attention model that associates task-relevant environmental features with related words in the natural language mission text. We illustrate the overall framework architecture along with the learning process, emphasizing the interaction between textual and visual feature-based attention mechanisms. The method is trained in MiniGrid environments using the Proximal Policy Optimization algorithm, and its performance is evaluated by comparing the proposed architecture with a baseline that lacks attentional mechanisms. Experimental results demonstrate the efficacy of the approach also highlighting its potential in behavior transparency.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


