Tiny Inference-Time Scaling with Latent Verifiers

Bucciarelli, Davide; Turri, Evelyn; Baraldi, Lorenzo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita

Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.

Tiny Inference-Time Scaling with Latent Verifiers / Bucciarelli, D., Turri, E., Baraldi, L., Cornia, M., Baraldi, L., Cucchiara, R.. - (2026). (IEEE/CVF Conference on Computer Vision and Pattern Recognition Denver (CO), United States June 3-7, 2026).

Tiny Inference-Time Scaling with Latent Verifiers

Davide Bucciarelli;Evelyn Turri;Lorenzo Baraldi;Marcella Cornia;Lorenzo Baraldi;Rita Cucchiara

2026

Abstract

Inference-time scaling has emerged as an effective way to improve generative models at test time by using a verifier to score and select candidate outputs. A common choice is to employ Multimodal Large Language Models (MLLMs) as verifiers, which can improve performance but introduce substantial inference-time cost. Indeed, diffusion pipelines operate in an autoencoder latent space to reduce computation, yet MLLM verifiers still require decoding candidates to pixel space and re-encoding them into the visual embedding space, leading to redundant and costly operations. In this work, we propose Verifier on Hidden States (VHS), a verifier that operates directly on intermediate hidden representations of Diffusion Transformer (DiT) single-step generators. VHS analyzes generator features without decoding to pixel space, thereby reducing the per-candidate verification cost while improving or matching the performance of MLLM-based competitors. We show that, under tiny inference budgets with only a small number of candidates per prompt, VHS enables more efficient inference-time scaling reducing joint generation-and-verification time by 63.3%, compute FLOPs by 51% and VRAM usage by 14.5% with respect to a standard MLLM verifier, achieving a +2.7% improvement on GenEval at the same inference-time budget.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2026
			
	Titolo del Convegno
	
				IEEE/CVF Conference on Computer Vision and Pattern Recognition
			
	Luogo del Convegno
	
				Denver (CO), United States
			
	Data del Convegno
	
				June 3-7, 2026
			
	Tutti gli autori
	
						Bucciarelli, Davide; Turri, Evelyn; Baraldi, Lorenzo; Cornia, Marcella; Baraldi, Lorenzo; Cucchiara, Rita
					
	Citazione
	
				Tiny Inference-Time Scaling with Latent Verifiers / Bucciarelli, D., Turri, E., Baraldi, L., Cornia, M., Baraldi, L., Cucchiara, R.. - (2026). (IEEE/CVF Conference on Computer Vision and Pattern Recognition Denver (CO), United States June 3-7, 2026).
			
	Tipologia
	
				Relazione in Atti di Convegno

File in questo prodotto:

File	Dimensione	Formato
2026_CVPR_VHS.pdf Open access Tipologia: AAM - Versione dell'autore revisionata e accettata per la pubblicazione Dimensione 7.28 MB Formato Adobe PDF Visualizza/Apri	7.28 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris