Google DeepMind has released Gemini Robotics-ER 1.6, an upgrade to its reasoning-first model for physical agents. The announcement describes improvements over both Gemini Robotics-ER 1.5 and Gemini 3.0 Flash in spatial and physical reasoning capabilities — specifically pointing, counting, and success detection. It also introduces a new capability, instrument reading, developed through collaboration with Boston Dynamics.

The model is available to developers via the Gemini API and Google AI Studio, with a developer Colab providing examples of configuration and prompting for embodied reasoning tasks.

Pointing as a foundation for spatial reasoning

The announcement gives considerable weight to pointing — the model’s ability to identify specific locations or objects in a scene — as a foundational capability for robotics. Pointing in this context is not simply object detection; the post describes it as supporting multiple reasoning modes including spatial reasoning (precision detection and counting), relational logic (comparing objects, defining from-to relationships), motion reasoning (mapping trajectories and identifying grasp points), and constraint compliance (identifying objects meeting specified conditions, like “every object small enough to fit inside the blue cup”).

According to the post, Gemini Robotics-ER 1.6 can use points as intermediate reasoning steps toward more complex tasks — for example, using points to count items in an image or to identify salient positions that inform mathematical operations for metric estimation.

Success detection and multi-view understanding

The post identifies success detection — determining when a task is finished — as “a cornerstone of autonomy.” For a robot to operate without constant human supervision, it needs to know whether an action succeeded, whether to retry, or whether to move to the next step in a sequence. The announcement says this requires sophisticated perception combined with broad world knowledge to handle factors like occlusions, poor lighting, and ambiguous instructions.

Multi-view reasoning compounds the difficulty. Most production robotics setups use multiple cameras — the post mentions overhead and wrist-mounted feeds as a typical configuration. Gemini Robotics-ER 1.6 advances multi-view reasoning to better handle multiple camera streams and the relationships between them, including in dynamic or occluded environments.

Instrument reading: a capability developed with Boston Dynamics

The new instrument reading capability is described as emerging from a specific real-world need surfaced by Boston Dynamics. Industrial facilities require constant monitoring of gauges, thermometers, pressure instruments, and chemical sight glasses. The Spot robot can visit instruments throughout a facility and capture images of them; the question is whether the vision-language model can accurately interpret what it sees.

Gauges often have text describing units, multiple needles corresponding to different decimal places, and ambiguous tick marks. Sight glasses require estimating liquid fill levels while accounting for camera perspective distortion. The post describes how the model approaches this using what it calls “agentic vision”: a combination of visual reasoning and code execution. The model zooms into an image to read small details, uses pointing and code execution to estimate proportions and intervals, and then applies world knowledge to interpret the result.

Marco da Silva, Vice President and General Manager of Spot at Boston Dynamics, said: “Capabilities like instrument reading and more reliable task reasoning will enable Spot to see, understand, and react to real-world challenges completely autonomously.”

Safety improvements

The announcement states that Gemini Robotics-ER 1.6 is the lab’s safest robotics model to date. It demonstrates superior compliance with Gemini safety policies on adversarial spatial reasoning tasks relative to all previous generations. The post also notes “substantially improved capacity to adhere to physical safety constraints” — specifically, better decisions about which objects can be safely manipulated under physical or material constraints such as “don’t handle liquids” or “don’t pick up objects heavier than 20kg.”

DeepMind also tested the model against real-life injury reports, using text and video scenarios. The post reports that Gemini Robotics-ER models improve over baseline Gemini 3.0 Flash performance by 6% on text injury-risk perception and 10% on video injury-risk perception.