An important paradigm in 3D object detection is the use of multiple modalities to enhance accuracy in both normal and challenging conditions, particularly for long-tail scenarios. To address this, recent studies have explored two directions of adaptive approaches: MoE-based adaptive fusion, which struggles with uncertainties arising from distinct object configurations, and late fusion for output-level adaptive fusion, which relies on separate detection pipelines and limits comprehensive understanding. In this work, we introduce Cocoon, an object- and feature-level uncertainty-aware fusion framework. The key innovation lies in uncertainty quantification for heterogeneous representations, enabling fair comparison across modalities through the introduction of a feature aligner and a learnable surrogate ground truth, termed feature impression. We also define a training objective to ensure that their relationship provides a valid metric for uncertainty quantification. Cocoon consistently outperforms existing static and adaptive methods in both normal and challenging conditions, including those with natural and artificial corruptions. Furthermore, we show the validity and efficacy of our uncertainty metric across diverse datasets.
Figure 1: Cocoon Online Procedure (left) and Example Results (right). Cocoon operates on top of base model components. In the feature aligner, per-object features (⬤, ▲) are aligned or projected into a common representation space. Next, uncertainty quantification is performed for each pair of features (⬤, ▲). These uncertainties are converted into weights (α and β) for adaptive fusion, which either amplify or attenuate the contribution of each modality’s original feature (⬤, ▲) to the fused feature. The resulting fused feature is then used in the main decoder of the base model.
Figure 2: Prior Work vs. Cocoon. In the offline stage with calibration data, Feature CP identifies
the surrogate ground truth (⬤) for each feature (⬤) through iterative search.
Each ⬤ is derived using the real ground truth label in the output space and the decoder g (serving as a classifier).
However, in a multi-modal setting, each feature lacks a modality-specific g. To resolve this, Cocoon leverages joint training of the feature aligner
(which projects heterogeneous features (⬤, ▲) into a common representation space)
and the surrogate ground truth (⬤ – termed FI).
Through our proposed training objective, which makes each FI to be the geometric median of aggregated features for valid uncertainty quantification.
In both cases, the nonconformity scores (i.e., distances) are collected to create a calibration set, which will be used as a criterion to gauge the uncertainty of online test inputs.
In the online stage with test data, while Feature CP iteratively searches for ⬤, Cocoon saves time by projecting
input features via our feature aligner h and using a pre-trained ⬤.
Figure 3: Qualitative Comparison on Challenging Objects.
@inproceedings{cho2025cocoon,
title={Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion},
author={Cho, Minkyoung and Cao, Yulong and Sun, Jiachen and Zhang, Qingzhao and Pavone, Marco and Park, Jeong Joon and Yang, Heng and Mao, Zhuoqing},
booktitle={The Thirteenth International Conference on Learning Representations}
}