Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation

Favali, Filippo; Schmuck, Viktor; Villani, Valeria; Celiktutan, Oya

doi:10.1007/978-3-031-81688-8_3

A wide array of new real-world applications using social robots and virtual agents are driving humans towards closer connections with these artificial systems. Consequently, nonverbal human-robot interactions have become a major research focus, aiming for more versatile and natural exchanges and communication. In this work, we utilize a diffusion model to generate fine-grained, highly natural motions, coupled with a latent gesture representation obtained via a Vector Quantized Variational Auto-Encoder (VQVAE) architecture. This approach addresses the well-known limitations of training and inference time. As a result, we achieved up to a 5-fold increase in generation speed. In addition, we conducted a subjective evaluation which demonstrated that, despite using discrete gesture representations, the quality of the generated nonverbal behavior has been preserved.

Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation / Favali, F., Schmuck, V., Villani, V., Celiktutan, O. (SPRINGER PROCEEDINGS IN ADVANCED ROBOTICS). - In: Springer Proceedings in Advanced RoboticsGEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND : SPRINGER INTERNATIONAL PUBLISHING AG, 2025. - ISBN 9783031816871. - pp. 30-44 [10.1007/978-3-031-81688-8_3]

Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation

Favali, Filippo;Schmuck, Viktor;Villani, Valeria;Celiktutan, Oya

2025

Abstract

A wide array of new real-world applications using social robots and virtual agents are driving humans towards closer connections with these artificial systems. Consequently, nonverbal human-robot interactions have become a major research focus, aiming for more versatile and natural exchanges and communication. In this work, we utilize a diffusion model to generate fine-grained, highly natural motions, coupled with a latent gesture representation obtained via a Vector Quantized Variational Auto-Encoder (VQVAE) architecture. This approach addresses the well-known limitations of training and inference time. As a result, we achieved up to a 5-fold increase in generation speed. In addition, we conducted a subjective evaluation which demonstrated that, despite using discrete gesture representations, the quality of the generated nonverbal behavior has been preserved.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Codice DOI
	
				https://dx.doi.org/10.1007/978-3-031-81688-8_3
			
	Codice WoS
	
				WOS:001456030700003
			
	Serie
	
				SPRINGER PROCEEDINGS IN ADVANCED ROBOTICS
			
	Titolo del Volume
	
				Springer Proceedings in Advanced Robotics
			
	Codice ISBN del Volume
	
				9783031816871
9783031816888
			
	Nome Editore
	
				SPRINGER INTERNATIONAL PUBLISHING AG
			
	Citazione
	
				Towards More Expressive Human-Robot Interactions: Combining Latent Representations and Diffusion Models for Co-speech Gesture Generation / Favali, F., Schmuck, V., Villani, V., Celiktutan, O. (SPRINGER PROCEEDINGS IN ADVANCED ROBOTICS). - In: Springer Proceedings in Advanced RoboticsGEWERBESTRASSE 11, CHAM, CH-6330, SWITZERLAND : SPRINGER INTERNATIONAL PUBLISHING AG, 2025. - ISBN 9783031816871. - pp. 30-44 [10.1007/978-3-031-81688-8_3]
			
	Tutti gli autori
	
						Favali, Filippo; Schmuck, Viktor; Villani, Valeria; Celiktutan, Oya
					
	Tipologia
	
				Capitolo/Saggio

File in questo prodotto:

Non ci sono file associati a questo prodotto.

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris