GRASP makes long-horizon planning with learned world models practical through three targeted fixes

Researchers at Berkeley AI Research have published a blog post describing GRASP, a gradient-based planner for learned dynamics models that targets the failure modes that make long-horizon planning fragile in practice. The work, conducted with collaborators including Mike Rabbat, Aditi Krishnapriyan, Yann LeCun, and Amir Bar, proposes three modifications to standard gradient-based trajectory optimization: lifting the trajectory into virtual states to enable parallelization, adding stochasticity to iterates for exploration, and reshaping gradients to avoid brittle signals through high-dimensional vision encoders.

The post opens by framing the core tension: learned world models are becoming capable of predicting long sequences of future observations in visual spaces, but the ability to plan through these models does not follow automatically from their predictive accuracy. As the post states, “having a powerful predictive model is not the same as being able to use it effectively for control/learning/planning.”

Why long-horizon planning breaks

The post identifies three distinct failure modes that compound as planning horizons grow.

The first is the exploding and vanishing gradient problem. When optimizing a sequence of actions by differentiating through a world model applied repeatedly to itself, the Jacobian conditioning scales exponentially with the horizon length. For earlier actions in the sequence, this means either vanishingly small gradients (the action has no effective learning signal) or exploding gradients (numerically unstable updates). The post notes that this is structurally identical to the backpropagation-through-time problem familiar from recurrent network training.

The second failure mode is a landscape problem. Short-horizon planning can often succeed with a greedy strategy — move toward the goal at each step. But as the post describes, longer tasks “are more likely to require non-greedy behavior: going around a wall, repositioning before pushing, backing up to take a better path.” The distance to goal along the optimal path is non-monotonic, and the resulting loss landscape contains bad local minima that greedy gradient descent gets stuck in. Simultaneously, the dimensionality of the search space grows linearly with the horizon.

The third failure mode is specific to deep learning-based world models that operate in high-dimensional learned latent spaces: gradients flowing through vision encoders are particularly brittle and can produce misleading signals when used to update actions.

The GRASP solution: lifting, stochasticity, and reshaping

GRASP addresses the exploding gradient and local minima problems together through a technique called lifting, also described in planning literature as collocation. Instead of rolling out the trajectory sequentially and differentiating through the entire chain, the planner treats intermediate states as free variables to be jointly optimized alongside actions. The dynamics constraint — that each state must equal the world model applied to the previous state and action — becomes a soft penalty rather than a hard constraint enforced by sequential rollout.

The post describes two immediate benefits of this formulation. First, because each world model evaluation now depends only on local variables (adjacent state-action pairs), all time steps can be computed in parallel rather than sequentially. Second, the optimization landscape changes fundamentally: the lifted objective shares the same global minimizers as the original rollout objective, but its local behavior is different and more tractable.

The stochasticity component adds noise directly to the state iterates during optimization, providing a form of exploration that helps escape bad local minima without requiring discrete search. The gradient reshaping component addresses the vision model brittleness by separating the gradient signals through states from the gradient signals through actions, providing cleaner updates for action variables.

Demonstrated on visual control tasks

The post includes demonstrations on two tasks: BallNav and Push-T, both shown as animated GIFs in the blog. These are described as visual control problems where the world model operates in pixel or latent visual space. The post frames these as concrete evidence that the approach is practical, not merely theoretically motivated.

The broader claim is that GRASP makes gradient-based planning “much more robust” for long horizons specifically — the regime where prior methods break down. The post argues that as world models scale in capability, the planning bottleneck will become the limiting factor, and addressing it now is more tractable while models are still advancing than it will be once they plateau.

For robotics, dialogue systems, and any domain where the environment is represented as a learned visual model, the fragility of long-horizon gradient planning has been a practical constraint. GRASP’s approach — addressing the optimization geometry directly rather than switching to sampling-based methods — keeps gradient-based planning on the table for horizons where it was previously unreliable.