Unbiased Missing-modality Multimodal Learning

International Conference on Computer Vision, ICCV 2025

1University of Electronic Science and Technology of China,

2School of Computer Science, Peking University

3Ubiquitous Intelligence and Trusted Services Key Laboratory of Sichuan Province

Abstract

Recovering missing modalities in multimodal learning has recently been approached using diffusion models to synthesize absent data conditioned on available modalities. However, existing methods often suffer from modality generation bias: while certain modalities are generated with high fidelity, others—such as video—remain challenging due to intrinsic modality gaps, leading to imbalanced training. To address this issue, we propose MD2N (Multi-stage Duplex Diffusion Network), a novel framework for unbiased missing-modality recovery. MD2N introduces a modality transfer module within a duplex diffusion architecture, enabling bidirectional generation between available and missing modalities through three stages: (1) global structure generation, (2) modality transfer, and (3) local cross-modal refinement. By training with duplex diffusion, both available and missing modalities generate each other in an intersecting manner, effectively achieving a balanced generation state. Extensive experiments demonstrate that MD2N significantly outperforms existing state-of-the-art methods, achieving up to 4% improvement over IMDer on the CMU-MOSEI dataset.

Background

Multimodal learning enables machines to understand complex real-world information by integrating data from multiple modalities such as text, audio, and video. However, real-world applications often suffer from missing modalities due to sensor failures or communication issues—posing serious challenges to model robustness and generalization. Existing solutions attempt to recover missing data using generative models like diffusion networks, but they often suffer from modality generation bias—some modalities (e.g., text) are much easier to reconstruct than others (e.g., video). This imbalance leads to inconsistent performance across different missing-modality scenarios. To address this, our research introduces a novel multi-stage duplex diffusion framework that systematically reconstructs missing modalities in three steps: global structure generation, bidirectional modality transfer, and local detail refinement. This design improves recovery quality, balances cross-modal generation, and enables more robust multimodal learning in real-world settings.

Motivation

In real-world multimodal applications, data from various modalities (e.g., text, audio, video) are often incomplete due to device failures, communication drops, or privacy constraints. This leads to two key challenges:

• Modality Missing Imbalance
Some modalities are more prone to loss or corruption. For instance, video data is often unavailable due to bandwidth or privacy concerns, while text or audio is more easily preserved. However, most existing models assume all modalities are available during both training and inference, resulting in poor robustness when this assumption breaks.

• Modality Generation Bias
Existing generative models struggle to recover missing modalities uniformly. For example, generating text from video is easier and more accurate than generating video from text. This asymmetry leads to biased recovery performance, where models perform well in some missing-modality scenarios but fail drastically in others.

Method

The paper introduces a novel Multi-stage Duplex Diffusion Network (MD²N) to address modality generation bias in missing-modality recovery. The key innovation lies in a three-stage framework: global structure generation, modality transfer, and local cross-modal refinement. By leveraging duplex diffusion models, the approach enables available and missing modalities to generate each other intersectively, fostering balanced training and overcoming the modality gap. A modality transfer module within a specific time interval facilitates bidirectional knowledge transfer, ensuring semantic alignment and learning modality-invariant representations. The framework dynamically regulates noise using a timestep-based variance function, maintaining global coherence while enhancing local details.

MD²N decomposes the recovery process to first establish a robust global structure, then progressively integrate conditional information for cross-modal alignment, and finally refine fine-grained features. This multi-stage design allows the model to handle the heterogeneity of different modalities effectively. Experimental results show that MD²N outperforms state-of-the-art methods, achieving up to a 4% improvement over IMDer on the CMU-MOSEI dataset, demonstrating its effectiveness in unbiased missing-modality recovery.

MD2N

BibTeX

@article{dai2025unbiased,
      title = {Unbiased Missing-modality Multimodal Learning},
      author = {Raiting Dai, Chenxi Li, Yandong Yan, Liso Mo, Ke Qin, Tao He},
      journal = {ICCV},
      year = {2025}
}