RayMap3R: Inference-Time RayMap
for Dynamic 3D Reconstruction

1University of Illinois Chicago 2Cisco Research
Corresponding author

TL;DR A training-free dual-branch scheme that exploits RayMap's static-scene bias for dynamic-aware streaming 3D reconstruction at inference time.

RayMap3R teaser: streaming 3D reconstruction comparison

Streaming 3D Reconstruction for Dynamic Scenes. Existing streaming methods such as CUT3R and TTT3R can suffer from camera drift caused by moving objects. RayMap3R identifies and suppresses dynamic regions at inference time without additional training or external models, producing more stable trajectories and geometrically faithful reconstruction.

Abstract

Streaming feed-forward 3D reconstruction enables real-time joint estimation of scene geometry and camera poses from RGB images. However, without explicit dynamic reasoning, streaming models can be affected by moving objects, causing artifacts and drift. In this work, we propose RayMap3R, a training-free streaming framework for dynamic scene reconstruction. We observe that RayMap-based predictions exhibit a static-scene bias, providing an internal cue for dynamic identification. Based on this observation, we construct a dual-branch inference scheme that identifies dynamic regions by contrasting RayMap and image predictions, suppressing their interference during memory updates. We further introduce reset metric alignment and state-aware smoothing to preserve metric consistency and stabilize predicted trajectories. Our method achieves leading performance among streaming approaches on dynamic scene reconstruction across multiple benchmarks.

Static-Scene Bias of RayMap

We observe that when only camera rays (RayMap) are provided as input, without the actual image, the model tends to reconstruct only the static background and ignore dynamic objects. This bias provides a built-in signal for dynamic identification.

Static bias across Sintel and DAVIS

The RayMap branch reconstructs primarily static structure, while the main branch captures the full scene including dynamic objects. Their per-pixel depth discrepancy aligns well with the ground-truth dynamic mask.

Extended dynamic identification across datasets

The same pattern holds across diverse scenes including animals, vehicles, and pedestrians, showing that the static-scene bias is not limited to specific object types.

We exploit this bias through a dual-branch inference scheme that contrasts the main branch (image + RayMap) with the RayMap-only branch. The per-pixel depth discrepancy yields a dynamic map that serves as a proxy for dynamic content identification.

Dual-branch concept and IoU analysis

Left: Dual-branch contrast: the depth difference reveals dynamic regions. Right: Dynamic mask IoU versus ground-truth dynamic ratio across 108 sequences from MPI Sintel, DAVIS 2017, and TUM RGB-D (Spearman ρ = 0.77). Scenes with higher dynamic content yield stronger detection signals.

Method

RayMap3R pipeline overview

Pipeline Overview. At each timestep, the main branch predicts depth and pose from both image and RayMap features, while the RayMap branch queries the same frozen state using only camera-ray tokens. The per-pixel depth discrepancy between branches is projected onto state tokens via cross-attention to form staticness weights, which gate memory updates (st = st-1 + αt ⊙ Δst). Reset metric alignment and state-aware smoothing further stabilize the output trajectory.

Results

Qualitative Comparison

We compare 3D reconstruction quality with CUT3R and TTT3R on dynamic DAVIS sequences. RayMap3R produces more coherent point clouds with fewer ghosting artifacts and reduced camera drift, as suppressing dynamic content prevents corrupted geometry from accumulating in the memory state.

Qualitative comparison on DAVIS

From top to bottom: sequences with dynamic animals, human motion, and a moving boat. RayMap3R preserves fine-grained details such as legible text on the boat surface, where baselines produce collapsed or blurred geometry.

Dynamic Identification Visualization

Frame-by-frame visualization of the dual-branch contrast on the DAVIS longboard sequence. The dynamic map maintains spatial consistency across consecutive frames without explicit temporal smoothing, tracking moving subjects as their position and apparent scale change.

Quantitative Evaluation

We evaluate RayMap3R on camera pose estimation, video depth estimation, and 3D reconstruction across multiple benchmarks. Among streaming (online) methods, RayMap3R achieves the lowest ATE on all three pose benchmarks and the lowest Abs Rel on KITTI and Bonn, while maintaining real-time efficiency and constant memory usage.

Camera pose estimation

Camera pose estimation (ATE, RPE) on Sintel, TUM-dynamics, and ScanNet.

Video depth estimation

Video depth estimation on KITTI, BONN, and Sintel.

3D reconstruction

3D reconstruction quality (Acc, Comp, NC, Chamfer) on 7-Scenes.

BibTeX

@article{wang2026raymap3r,
  title   = {RayMap3R: Inference-Time RayMap for Dynamic 3D Reconstruction},
  author  = {Wang, Feiran and Shang, Zezhou and Liu, Gaowen and Yan, Yan},
  year    = {2026}
}