Researchers from Microsoft Research, Korea University, and the University of Wisconsin-Madison have released GroundedPlanBench, a benchmark for evaluating whether vision-language models can simultaneously plan robot actions and determine where in an image those actions should occur. According to the Microsoft Research blog post, the work accompanies a paper titled “Spatially Grounded Long-Horizon Task Planning in the Wild.”

The research was supported by an IITP grant (No. RS-2025-25439490) funded by the Korea government.

The problem

The post describes a structural weakness in how vision-language models are currently used to plan robot actions. Most systems split the problem into two stages: a VLM generates a plan in natural language, and a separate model translates it into executable actions. The post says this approach “often breaks down for long, complex tasks because natural-language plans can be ambiguous or even hallucinated when specifying actions and locations.” Because the two stages are handled independently, errors in planning can propagate into grounding.

The post provides a concrete example of this failure. When a decoupled system using Qwen3-VL-4B was asked to handle a scene containing four napkins, the planning model referred to “napkin on the table” for all four grasp actions. The downstream grounding model, Embodied-R1, resolved all four references to the same napkin. GPT-5.2 produced more descriptive phrases such as “top-left napkin” or “upper-center napkin,” but the post says these were “still too imprecise for the model to reliably distinguish between them and were again grounded to the same object.”

GroundedPlanBench

The benchmark is built from 308 robot manipulation scenes drawn from the Distributed Robot Interaction Dataset (DROID), a collection of recordings of robots performing tasks. Experts reviewed each scene and defined tasks across two instruction styles: explicit (e.g., “put a spoon on the white plate”) and implicit (e.g., “tidy up the table”).

Each task is broken into four basic actions — grasp, place, open, and close — each tied to a specific location in the image. Grasp, open, and close actions are linked to a bounding box around the target object; place actions are linked to a box indicating where the object should be placed.

The post reports that GroundedPlanBench contains 1,009 tasks in total, divided by length: 345 tasks require 1–4 actions, 381 require 5–8, and 283 require 9–26.

V2GP

The Video-to-Spatially Grounded Planning (V2GP) framework is designed to generate training data for joint planning and grounding from existing robot demonstration videos. The post describes the process: the system first detects moments when the robot interacts with objects using recorded gripper signals, generates a text description of the manipulated object with a multimodal language model, then tracks the object across the video using Meta’s SAM3 for open-vocabulary segmentation. It then identifies the object’s location at the moment it is grasped and where it is placed, constructing a grounded plan from those results.

The process yielded 43,000 grounded plans: 34,646 with 1–4 actions, 4,368 with 5–8, and 4,448 with 9–26.

Evaluation results

The post uses Qwen3-VL, a vision-language model that processes text, images, and video, as the base model for evaluation. It was evaluated without task-specific training on GroundedPlanBench alongside other proprietary models, then fine-tuned on V2GP training data and compared against the decoupled approach.

The post reports that training Qwen3-VL-4B and Qwen3-VL-32B with V2GP “led to significant improvements in grounded planning” compared to the decoupled baseline. Multi-step planning and handling of implicit instructions were described as challenging for all models tested.

Real-world robot experiments were also conducted using the fine-tuned model, with the post saying grounded planning “improves both task success and action accuracy, outperforming decoupled approaches in benchmark and real-world evaluations.”

Implications

The post frames joint planning and grounding as a path to more reliable robot manipulation. Current models, it says, still struggle with longer multi-step tasks and implicit instructions. The post identifies a direction it describes as promising: combining grounded planning with world models that allow robots to predict action outcomes before executing them. This, the post says, “could allow robots to decide what to do, where to act, and what will happen next.”

GroundedPlanBench is available on GitHub.