This paper tackles the problem of novel view audio-visual synthesis along an arbitrary trajectory in an indoor scene, given the audio-video recordings from other known trajectories of the scene. Existing methods often overlook the effect of room geometry, particularly wall occlusion to sound propagation, making them less accu- rate in multi-room environments. In this work, we propose a new approach called Scene Occlusion-aware Acoustic Field (SOAF) for accurate sound generation. Our approach derives a prior for sound energy field using distance-aware parametric sound-propagation modelling and then transforms it based on scene transmittance learned from the input video. We extract features from the local acoustic field centred around the receiver using a Fibonacci Sphere to generate binaural audio for novel views with a direction-aware attention mechanism. Extensive experiments on the real dataset RWAVS and the synthetic dataset SoundSpaces demonstrate that our method outperforms previous state-of-the-art techniques in audio generation.
Review. Starting with calibrated video, we reconstruct the scene using NeRF and build the global acoustic field. For novel view audio synthesis, the distribution of the local acoustic field Fac, combined with visual features Fvis and the receiver's position prc, is fed into the network to predict the acoustic mixture mask mm and difference masks mld and mrd. Our key contributions fall in the global and local acoustic field for occlusion modeling and the direction-aware attention mechanism for binaural channel distinction.
Global acoustic field. (A) The global acoustic field analyzes the distribution of energy absorption (M), acoustic transmittance (T), and sound energy (E) along the target ray. As M decreases with increased propagation distance, T varies with energy decay during sound wave collisions, and E is the combination of M and T. (B) Sound waves propagate through walls along the target ray, with collisions marked by yellow crosses. Subfigure (A,B) illustrates the distribution of M and T in the global acoustic field. (C,D) Compared with distance-aware acoustic field, occlusion-aware acoustic field considers the occlusion of walls during sound propagation, enabling more accurate sound generation in multi-room environments.
Local acoustic field shows the distribution of sound energy around the receiver. A Fibonacci Sphere around the receiver collects sound energy from the global acoustic field, with colors from red to blue indicating high to low sound energy.
Direction-aware attention mechanism. Binaural features are generated by a direction-aware attention mechanism. (A) Predefined left-right attention. (B) Local distribution of two receivers: Receiver 1 in the hallway, close to the sound source with higher energy; Receiver 2 in the kitchen, further away and obstructed with lower energy. Energy comparison is highlighted in the sub-figure C color bar. (C) Direction-aware attention mechanism: The binaural features describe the spatial and directional sound characteristics, generated by the combination of the left-right attention and the local acoustic field distribution.
We synthesize demo video and audio in real-world indoor scenes on the RWAVS dataset. Left displays the synthesized video with associate audio following an arbitrary trajectory. Middle shows the global acoustic field distribution, accounting for sound propagation distance and multi-room occlusion, with colors from yellow to blue representing high to low sound energy. Right depicts the local acoustic field around the receiver, with colors from red to blue indicating high to low sound energy. NOTICE: please wear headphones when watching videos!
@inproceedings{gao2024soaf,
title = {SOAF: Scene Occlusion-aware Neural Acoustic Field},
author = {Gao, Huiyu and Ma, jiahao and Ahmedt-Aristizabal, David and Nguyen,Chuong and Liu, Miaomiao},
booktitle = {ArXiv},
year = {2024}
}