Background
Multimodal learning enables machines to understand complex real-world information by integrating data from multiple modalities such as text, audio, and video. However, real-world applications often suffer from missing modalities due to sensor failures or communication issues—posing serious challenges to model robustness and generalization. Existing solutions attempt to recover missing data using generative models like diffusion networks, but they often suffer from modality generation bias—some modalities (e.g., text) are much easier to reconstruct than others (e.g., video). This imbalance leads to inconsistent performance across different missing-modality scenarios. To address this, our research introduces a novel multi-stage duplex diffusion framework that systematically reconstructs missing modalities in three steps: global structure generation, bidirectional modality transfer, and local detail refinement. This design improves recovery quality, balances cross-modal generation, and enables more robust multimodal learning in real-world settings.