Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues

¹The University of Hong Kong, ²King Abdullah University of Science and Technology
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025 (Oral)
*Corresponding author

Abstract

Extracting high-fidelity RGBD information from two-dimensional (2D) images is essential for various visual computing applications. Stereo imaging, as a reliable passive imaging technique for obtaining three-dimensional (3D) scene information, has benefited greatly from deep learning advancements. However, existing stereo depth estimation algorithms struggle to perceive high-frequency information and resolve high-resolution depth maps in realistic camera settings with large depth variations. These algorithms commonly neglect the hardware parameter configuration, limiting the potential for achieving optimal solutions solely through software-based design strategies. This work presents a hardware-software co-designed RGBD imaging framework that leverages both stereo and focus cues to reconstruct texture-rich color images along with detailed depth maps over a wide depth range. A pair of rank-2 parameterized diffractive optical elements (DOEs) is employed to encode perpendicular complementary information optically during stereo acquisitions. Additionally, we employ an IGEV-UNet-fused neural network tailored to the proposed rank-2 encoding for stereo matching and image reconstruction. Through prototyping a stereo camera with customized DOEs, our deep stereo imaging paradigm has demonstrated superior performance over existing monocular and stereo imaging systems in both image PSNR by 2.96 dB gain and depth accuracy in highfrequency details across distances from 0.67 to 8 meters.

Method

Our end-to-end learned stereo imaging pipeline consists of an accurate differentiable image formation model, an advanced stereo matching algorithm, and a CNN-based RGBD reconstruction network. This model leverages a rank-2 parameterization to efficiently represent and optimize the two DOEs (bottom-left) that are positioned on lenses’ apertures, aiming to encode the stereo measurements which is capable to promote the interaction and complementarity of acquired scene information between the left and right imaging channels. Resolved RGBD imaging results from real-captures of our stereo camera prototype are presented (right-most).

Prototype

The top-left 3D model presents our stereo camera prototype, which comprises optimized diffractive optical elements (DOEs) positioned at the aperture planes, lens groups, and sensors connected via adapters. The top-right images showcase the fabricated DOE and its assembly process. The bottom two rows depict the point spread function (PSF) distributions for the left and right imaging channels across various depth ranges. Notably, the PSFs from the left and right channels exhibit complementary features along the vertical direction while preserving overall morphological consistency. Furthermore, the axial distributions retain as much high-frequency information as possible.

Results

Indoor experimental results of our learned stereo camera prototype. Baseline is the proposed stereo RGBD reconstruction network without optical encoding, aka., conventional lens group only. From left: Images captured by our stereo camera (Row 1 in each scene) and conventional stereo camera (Row 2 in each scene), AiF images recovered by our reconstruction network, zoomed-in comparison for recovered RGB images, and zoomed-in comparison for estimated depth maps.

Outdoor experimental results of our learned stereo camera prototype. Baseline is the proposed stereo RGBD reconstruction network without optical encoding, aka., conventional lens group only. From left: Images captured by our stereo camera (Row 1 in each scene) and conventional stereo camera (Row 2 in each scene), AiF images recovered by our reconstruction network, zoomed-in comparison for recovered RGB images, and zoomed-in comparison for estimated depth maps.

We have captured images of the same scene under identical illuminance conditions and perspective. The first row represents the results from a traditional stereo camera, serving as our baseline model. The second row exhibits our proposed stereo camera prototype utilizing a pair of Rank-2 encoding DOEs. Rows 3 and 4 demonstrate the RGBD reconstruction results achieved using DOEs optimized with Rank-1 and Ring encoding schemes, respectively.

Simulation comparisons of RGB and depth estimation. Zoom-in views highlight the improved RGB-D reconstruction achieved by our rank-2 complementary stereo method, compared to previous monocular deep optics, Ring-Coded Stereo approaches, and identical ring encoding implemented within our architecture.

BibTeX

@InProceedings{Liu_2025_CVPR, author = {Liu, Yuhui and Ou, Liangxun and Fu, Qiang and Amata, Hadi and Heidrich, Wolfgang and Peng, Yifan}, title = {Learned Binocular-Encoding Optics for RGBD Imaging Using Joint Stereo and Focus Cues}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {15833-15842} }