We present STORM, a spatio-temporal reconstruction model designed to reconstruct in-the-wild dynamic outdoor scenes from sparse observations. Existing dynamic reconstruction methods rely heavily on dense observations across space and time and strong motion supervision, therefore suffering from lengthy optimization time, limited generalizability to novel views or scenes, and degenerated quality caused by noisy pseudo-labels. To bridge the gap, STORM introduces a data-driven Transformer architecture that jointly infers 3D scenes and their dynamics in a single forward pass. A key design of our scene representation is to aggregate 3D Gaussians and their motion predicted from all frames, which are later transformed to the target timestep for a more complete (i.e. “amodal”) reconstruction at any given time from any viewpoint. As an emergent property, STORM can automatically capture dynamic instances and their high-quality masks using just the reconstruction loss. Extensive experiments show that STORM accurately reconstructs dynamic scenes and outperforms other per-scene optimization (+3.7 PSNR) or feed-forward approaches (+1.5 PSNR), it can reconstruct large-scale outdoor scenes within just 200ms and render in real-time. Beyond reconstruction, we qualitatively demonstrate four additional applications of our model, demonstrating the potential of self-supervised learning for advancing dynamic scene understanding. Our code and model will be released.