Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues

1The University of Hong Kong, 2King Abdullah University of Science and Technology
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025 (Oral)

*Corresponding author

Abstract

Extracting high-fidelity RGBD information from two-dimensional (2D) images is essential for various visual computing applications. Stereo imaging, as a reliable passive imaging technique for obtaining three-dimensional (3D) scene information, has benefited greatly from deep learning advancements. However, existing stereo depth estimation algorithms struggle to perceive high-frequency information and resolve high-resolution depth maps in realistic camera settings with large depth variations. These algorithms commonly neglect the hardware parameter configuration, limiting the potential for achieving optimal solutions solely through software-based design strategies. This work presents a hardware-software co-designed RGBD imaging framework that leverages both stereo and focus cues to reconstruct texture-rich color images along with detailed depth maps over a wide depth range. A pair of rank-2 parameterized diffractive optical elements (DOEs) is employed to encode perpendicular complementary information optically during stereo acquisitions. Additionally, we employ an IGEV-UNet-fused neural network tailored to the proposed rank-2 encoding for stereo matching and image reconstruction. Through prototyping a stereo camera with customized DOEs, our deep stereo imaging paradigm has demonstrated superior performance over existing monocular and stereo imaging systems in both image PSNR by 2.96 dB gain and depth accuracy in highfrequency details across distances from 0.67 to 8 meters.

Method

Our end-to-end learned stereo imaging pipeline consists of an accurate differentiable image formation model, an advanced stereo matching algorithm, and a CNN-based RGBD reconstruction network. This model leverages a rank-2 parameterization to efficiently represent and optimize the two DOEs (bottom-left) that are positioned on lenses’ apertures, aiming to encode the stereo measurements which is capable to promote the interaction and complementarity of acquired scene information between the left and right imaging channels. Resolved RGBD imaging results from real-captures of our stereo camera prototype are presented (right-most).

Prototype

The top-left 3D model presents our stereo camera prototype, which comprises optimized diffractive optical elements (DOEs) positioned at the aperture planes, lens groups, and sensors connected via adapters. The top-right images showcase the fabricated DOE and its assembly process. The bottom two rows depict the point spread function (PSF) distributions for the left and right imaging channels across various depth ranges. Notably, the PSFs from the left and right channels exhibit complementary features along the vertical direction while preserving overall morphological consistency. Furthermore, the axial distributions retain as much high-frequency information as possible.

Results

BibTeX


        @inproceedings{liu2024learned,
        title={Learned binocular-encoding optics for RGBD imaging using joint stereo and focus cues (to be updated)},
        author={Liu, Yuhui and Ou, Liangxun and Fu, Qiang and Amata,Hadi and Heidrich, Wolfgang and Peng, Yifan},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
        year={2025},
        }