Speech-aided facial video super resolution with accurate lip motion and enhanced frequency details

ORCID

Abstract

Despite recent breakthroughs in face hallucination, video face hallucination remains a challenging task due to the issue of consistency across video frames. The temporal dimension in videos makes it difficult to learn facial motion and maintain color uniformity throughout the sequence. To address these challenges, we propose a novel audio-visual cross-modality support based video face hallucination network. The framework excels in learning fine spatiotemporal motion patterns by leveraging the correlation between movement of the facial structure and associated speech signal. Another significant challenge generic to face hallucination is blurriness around the key facial regions, such as mouth and lips. These areas show higher spatial displacement rendering their recovery in low-resolution images particularly difficult. The proposed approach explicitly defines a lip reading loss to learn the fine-grain motion in these facial regions. Further, during training, GANs show a higher potential to overfit to small frequency bands, which results in missing hard-to-synthesize frequencies. As a remedy, we introduce a frequency based loss function compelling the model to grasp salient frequency features. Visual and quantitative comparisons with state-of-the-art demonstrate significant improvements in visual results as well as higher coherence in the generated outputs across successive frames.

Publication Date

2025-01-01

Publication Title

Machine Vision and Applications

Volume

36

Issue

4

ISSN

0932-8092

Embargo Period

2026-05-09

Keywords

Cross-modality, Face hallucination, Fourier transform, Generative adversarial networks, Speech recognition

This document is currently not available here.

This item is under embargo until 09 May 2026

Share

COinS