STORM:


Spatio-Temporal Reconstruction Model for Large-Scale Outdoor Scenes

Authors anonymized







Abstract

We present STORM, a spatio-temporal reconstruction model designed to reconstruct in-the-wild dynamic outdoor scenes from sparse observations. Existing dynamic reconstruction methods rely heavily on dense observations across space and time and strong motion supervision, therefore suffering from lengthy optimization time, limited generalizability to novel views or scenes, and degenerated quality caused by noisy pseudo-labels. To bridge the gap, STORM introduces a data-driven Transformer architecture that jointly infers 3D scenes and their dynamics in a single forward pass. A key design of our scene representation is to aggregate 3D Gaussians and their motion predicted from all frames, which are later transformed to the target timestep for a more complete (i.e. “amodal”) reconstruction at any given time from any viewpoint. As an emergent property, STORM can automatically capture dynamic instances and their high-quality masks using just the reconstruction loss. Extensive experiments show that STORM accurately reconstructs dynamic scenes and outperforms other per-scene optimization (+3.7 PSNR) or feed-forward approaches (+1.5 PSNR), it can reconstruct large-scale outdoor scenes within just 200ms and render in real-time. Beyond reconstruction, we qualitatively demonstrate four additional applications of our model, demonstrating the potential of self-supervised learning for advancing dynamic scene understanding. Our code and model will be released.

method overview

TL;DR: STORM predicts 3D Gaussians and their motions from sparse observations in a feed-forward manner, outperforming existing methods in speed, accuracy, and generalization while enabling real-time rendering and additional applications.




STORM Feed-forward Reconstruction Examples

STORM reconstructs 3D representations and scene motions in a feed-forward manner. For each example, we present the input frames (Context RGB), the reconstructed RGB, depth maps, predicted scene flows, and motion segmentation. Ground truth scene flows are included for qualitative comparison, though they are not used for supervision.
Observe that STORM sometimes predicts scene motions that are not annotated in Ground Truth (e.g., the first example).

Novel View Synthesis Results

We show the novel view synthesis results of STORM. These novel views are directly rendered from the 3D scene representation and the predicted scene flows predicted by STORM.


4D Visualization

Point trajectories Visualization

We visualize the trajectories of dynamic Gaussians by chaining per-frame scene flows. Specifically, at each frame t, we use the predicted scene flow to transform the Gaussians to their estimated positions in the next frame (t+1). For every Gaussian in frame t+1 ,we identify its nearest transformed Gaussian from frame t and connect them to form the trajectory. This process is recursively applied across all frames to construct the complete trajectories. The color of each trajectory is determined by applying PCA to the motion assignment weights. Specifically, the motion assignment weights (N,16) are projected to (N,3) using PCA, and the first three components are used as RGB values. It is important to note that estimating point trajectories is not the primary objective of STORM. The trajectory visualizations are provided solely for interesting qualitative visualization.

* Currently, these demos are presented as video recordings due to time constraints. We plan to make them interactive in the future.

Latent-STORM Feed-forward Reconstruction Examples

Latent-STORM operates in latent space, initially rendering an 8x downsampled feature image and then upsampling it to RGB-D output using deconvolution layers.
For each example, we show the input frames (Context RGB), the predicted RGB, depth maps, scene flows, and motion segmentation.
Note that the predicted depth maps, flows, opacity maps, and motion segmentation masks are all in the downsampled space, i.e., they are rasterized in the 8x downsampled space.


Human Modeling with Latent-STORM

We present a side-by-side comparison of Latent-STORM (left) and STORM (right) for human motion modeling.
Modeling leg motion in pixel space is extremely challenging. By operating in latent space and utilizing an additional latent decoder, Latent-STORM reconstructs humans better.
*Footnote: We found that training Latent-STORM with our default perceptual loss weight led to strong perceptual-loss-derived artifacts and caused flickering in the human region. To reduce these artifacts, we post-trained our model with a lower perceptual loss weight for an additional 40k iterations. We also oversampled scenes with humans in the training set to improve human modeling. The results shown here are from the post-trained model, while the numerical results in the paper are from the default model.


Editing with STORM

We show different editing results with STORM. All dynamic instances here are selected by choosing corresponding motion token, without the need for bounding boxes.


Limitations

STORM occasionally struggles to account for lighting effects caused by water droplets on the camera lens and predicts noisy velocities in textureless regions, such as roads.

Latent-STORM shows sensitivity to the perceptual loss weight. Higher weights can introduce artifacts, while lower weights may smooth the output but reduce detail. Improving the decoder and loss designs to address this will be explored in future work.

STORM is the very first model to reconstruct dynamic scenes in a feed-forward manner. However, it is not perfect and has limitations. We show two examples of limitations here. We believe that addressing these limitations will be interesting directions for future research. We hope our approaches and results will encourage future efforts to further enhance the feed-forward dynamic scene reconstruction model.