GDPO-Listener: Expressive Interactive Head Generation via Auto-Regressive Flow Matching and Group reward-Decoupled Policy Optimization

1University of Southern California, Institute for Creative Technologies

Abstract

Generating realistic 3D head motion for dyadic interactions is a significant challenge in virtual human synthesis. While recent methods achieve impressive results with speaking heads, they frequently suffer from the `Regression-to-the-Mean' problem in listener motions, collapsing into static faces, and lack the parameter space for complex nonverbal motions. In this paper, we propose GDPO-Listener, a novel framework that achieves highly expressive speaking and listening motion generation. First, we introduce an Auto-Regressive Flow Matching architecture enabling stable supervised learning. Second, to overcome kinematic stillness, we apply the Group reward-Decoupled Policy Optimization (GDPO). By isolating reward normalization across distinct FLAME parameter groups, GDPO explicitly incentivizes high variance expressive generations. Finally, we enable explicit semantic text control for customizable responses. Extensive evaluations across the Seamless Interaction and DualTalk datasets demonstrate superior performance compared to existing baselines on long-term kinematic variance, visual expressivity and semantic controllability.

Overview of GDPO-Listener

Our framework generates expressive speaking and listening head reactions from multimodal dyadic conversational inputs. By utilizing an expanded FLAME parameter space, it naturally supports eye blinking and head nodding. Furthermore, it enables explicit semantic text control to ensure contextually appropriate responses and maintains stable dynamics during long-sequence inference.

GDPO-Listener Architecture

Our framework has two training stages. (a) Supervised Learning. Multimodal inputs are encoded as prefix conditions, and an Auto-Regressive Flow Matching model iteratively predicts actor motion latents from noise and history via ODE sampling. (b) Reinforcement Learning. We then post-train the policy model via GDPO. We compute fine-grained, decoupled rewards for distinct FLAME parameters under SDE sampling to explicitly optimize expressiveness.

Visualization Comparisons

Qualitative comparisons show that other methods have low-expressive speaking and static listening, but our method shows better lip sync and highly expressive reactions.

Advanced Generation Capabilities

(Top) Semantic text explicitly controls emotional states. (Middle) We sustain dynamic reactions during long sequences, avoiding baseline static decay. (Bottom) CFG scaling seamlessly modulates expressiveness intensity without retraining.

Benchmark Results