VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos

Abstract

We propose VisFusion, a visibility-aware online 3D scene reconstruction approach from posed monocular videos. In particular, we aim to reconstruct the scene from volumetric features. Unlike previous reconstruction methods which aggregate features for each voxel from input views without considering its visibility, we aim to improve the feature fusion by explicitly inferring its visibility from a similarity matrix, computed from its projected features in each image pair. Following previous works, our model is a coarse-to-fine pipeline including a volume sparsification process. Different from their works which sparsify voxels globally with a fixed occupancy threshold, we perform the sparsification on a local feature volume along each visual ray to preserve at least one voxel per ray for more fine details. The sparse local volume is then fused with a global one for online reconstruction. We further propose to predict TSDF in a coarse-to-fine manner by learning its residuals across scales leading to better TSDF predictions. Experimental results on benchmarks show that our method can achieve superior performance with more scene details.

Video

Method

Given a fragment of a video, we first construct 3D feature volumes of different resolutions. The feature of each voxel is obtained by projecting it back to every camera view. The features from different camera views are then fused via the predicted visibility (local feature fusion). We then extract the local occupancy and TSDF from the fused feature followed by a ray-based sparsification process to remove empty voxels. The sparse local feature volume is finally fused to global via GRU and used to produce the final TSDF. The global feature and final TSDF of coarse level are further upsampled and fed to the next level for refinement.

Reconstructions on ScanNet v2

BibTeX

@inproceedings{gao2023visfusion,
  title     = {VisFusion: Visibility-aware Online 3D Scene Reconstruction from Videos},
  author    = {Gao, Huiyu and Mao, Wei and Liu, Miaomiao},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages     = {17317--17326},
  year      = {2023}
}

Acknowledgements

This research was supported in part by the Australia Research Council DECRA Fellowship (DE180100628) and ARC Discovery Grant (DP200102274).