Long-Term Motion Embeddings for Kinematics Generation

Apple Machine Learning Research, working alongside researchers at LMU Munich's CompVis group, has published a new method for generating realistic scene motion without relying on costly full video synthesis. The approach, called ZipMo, compresses motion information by a factor of 64 and produces long, coherent motion sequences from either text descriptions or simple spatial cues. The paper has been accepted at CVPR 2026.

Why Motion Embeddings Beat Video Synthesis

Current video generation systems already understand how scenes evolve over time. But testing alternative outcomes for any given scene carries a massive computational cost, since every alternative requires generating an entirely new video pixel by pixel.

The team behind ZipMo sidesteps that bottleneck by separating motion from appearance. Rather than synthesizing full video frames, their system learns a compact latent representation that encodes how objects and surfaces move over extended time horizons. This representation is built from dense point-tracking data collected at scale using off-the-shelf tracking tools, then compressed into a dense latent grid.

The result is a motion-only workspace where generating and comparing different possible futures costs a fraction of what video synthesis demands.

How the Two-Stage System Works

ZipMo operates in two stages. First, sparse tracker trajectories and a single start frame are encoded into a compressed latent motion grid that achieves a 64x temporal compression factor. Despite that aggressive compression, the system can reconstruct dense motion predictions at any spatial query point.

Second, a conditional flow-matching model generates new motion sequences directly inside this latent space. Users can condition generation on text prompts describing a desired action or on spatial "pokes" that specify where specific points should start and end. The model then fills in plausible trajectories that satisfy those goals.

One surprising finding from the team's experiments: stronger temporal compression did not just save compute. It actually improved the quality of generated motions and produced a more semantically structured latent space, making the system both faster and better.

Outperforming Video Models and Robot Baselines

According to the project page, ZipMo outperformed both state-of-the-art video models, including Wan and Veo 3, in goal-conditioned motion generation. The performance gap widened further when the researchers compared methods under equal wall-clock time budgets, since ZipMo generates motion far faster than full video synthesis pipelines.

The researchers also applied their motion embeddings to robotic action prediction using the LIBERO benchmark. A small policy head mapped generated motions into 7D robot actions. Under both the ATM and Tra-MoE evaluation protocols, ZipMo outperformed existing baselines across tasks, suggesting the learned motion space captures meaningful planning and scene understanding.

The research team includes Nick Stracke, Kolja Bauer, and Stefan Andreas Baumann from CompVis at LMU Munich, alongside Apple researchers Miguel Angel Bautista and Josh Susskind, with Björn Ommer supervising. Code and model weights are publicly available on GitHub and Hugging Face.

For robotics, animation, and AR/VR applications, this kind of motion-first approach could reshape how systems plan and predict physical behavior, without paying the computational price of rendering every pixel along the way.