Apple research generates realistic long-term motion with a 64x temporally compressed embedding

Apple ML Research has published a method for generating long-term motion by operating in a compressed motion embedding space rather than in pixel space. The authors describe the approach in their research page as “orders of magnitude more efficient” than full video synthesis for exploring multiple possible motion futures.

The paper is authored by Nick Stracke, Kolja Bauer, Stefan Andreas Baumann, Miguel Angel Bautista, Josh Susskind, and Bjorn Ommer, with the first three listed as equal contributors affiliated with CompVis at LMU, Germany and the Munich Center for Machine Learning.

The core approach

The method works in two stages. First, a motion embedding is learned from large-scale trajectory data obtained from tracker models. The embedding uses a temporal compression factor of 64x, meaning sequences 64 frames long in the original trajectory space correspond to a single embedding vector. The authors say this compression is what enables efficiency gains in generation.

Second, a conditional flow-matching model is trained to generate motion latents within this compressed space, conditioned on text prompts or spatial inputs — what the paper calls “spatial pokes” specifying where motion should occur in the scene.

Why not use video models

The paper frames direct video synthesis as “prohibitively inefficient” for exploring motion possibilities, noting that video computation is spent largely on appearance rather than motion dynamics.

By abstracting away appearance and working in a pure motion embedding space, the approach can generate many possible motion trajectories quickly. The paper reports that the resulting motion distributions outperform those of both contemporary video generation models and specialized task-specific motion generation approaches. The comparison spans both categories — general video synthesis and dedicated motion models.

Efficiency as a design goal

The 64x temporal compression factor is the design choice that drives the efficiency claim. The authors describe the result as modeling “scene dynamics orders of magnitude more efficiently” compared to operating directly on video. By generating motion in this compressed space rather than in pixel space, the approach makes comparing multiple plausible motion futures tractable in a way that full video synthesis is not, according to the paper.

The paper was published in April 2026. The research page does not mention source code or model weights.