A wide array of new real-world applications using social robots and virtual agents are driving humans towards closer connections with these artificial systems. Consequently, nonverbal human-robot interactions have become a major research focus, aiming for more versatile and natural exchanges and communication. In this work, we utilize a diffusion model to generate fine-grained, highly natural motions, coupled with a latent gesture representation obtained via a Vector Quantized Variational Auto-Encoder (VQVAE) architecture. This approach addresses the well-known limitations of training and inference time. As a result, we achieved up to a 5-fold increase in generation speed. In addition, we conducted a subjective evaluation which demonstrated that, despite using discrete gesture representations, the quality of the generated nonverbal behavior has been preserved.
Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation / Favali, F., Schmuck, V., Villani, V., Celiktutan, O. (SPRINGER PROCEEDINGS IN ADVANCED ROBOTICS). - In: Springer Proceedings in Advanced RoboticsGEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND : SPRINGER INTERNATIONAL PUBLISHING AG, 2025. - ISBN 9783031816871. - pp. 30-44 [10.1007/978-3-031-81688-8_3]
Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation
Favali, Filippo;Villani, Valeria;
2025
Abstract
A wide array of new real-world applications using social robots and virtual agents are driving humans towards closer connections with these artificial systems. Consequently, nonverbal human-robot interactions have become a major research focus, aiming for more versatile and natural exchanges and communication. In this work, we utilize a diffusion model to generate fine-grained, highly natural motions, coupled with a latent gesture representation obtained via a Vector Quantized Variational Auto-Encoder (VQVAE) architecture. This approach addresses the well-known limitations of training and inference time. As a result, we achieved up to a 5-fold increase in generation speed. In addition, we conducted a subjective evaluation which demonstrated that, despite using discrete gesture representations, the quality of the generated nonverbal behavior has been preserved.Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris




