Ai2 releases WildDet3D, an open model for 3D object detection from a single image

The Allen Institute for AI released WildDet3D, an open model for monocular 3D object detection. Given a single RGB image, the model predicts 3D bounding boxes — estimating an object’s position, size, and orientation in metric coordinates — and accepts text queries, point prompts, and 2D bounding boxes as input. The Ai2 post describes the model, the dataset released alongside it, and benchmark results. An April 21 update added training and updated inference code.

What WildDet3D does

Most vision systems can identify objects in an image. Recovering where those objects exist in three dimensions — how far away, how large, how oriented — from a single photograph is substantially harder, particularly when the system must work across arbitrary object categories and camera types without fine-tuning.

WildDet3D accepts three prompt modalities. Category-name prompts let a user query by object type; the model finds every instance in the scene and localizes it in 3D. Point prompts allow interactive selection by clicking on an object. Box prompts let a user supply a 2D bounding box and have the model infer the full 3D extent. The post notes the model can handle inputs from phone cameras, wide-angle action cameras, and robotic camera feeds without fine-tuning. When additional geometric signals — sparse depth, LiDAR, or time-of-flight sensors — are available, the model incorporates them to improve predictions.

The post describes three internal components. A 2D detector built on the SAM3 vision backbone accepts the three prompt types and identifies objects in the image. A separate geometry backend — a frozen DINOv2 encoder with a trainable depth decoder — estimates per-pixel depth and produces geometry-aware features. These two branches run in parallel. A 3D detection head fuses the 2D detections with the depth features through cross-attention, lifting 2D evidence into full 3D bounding box predictions.

The geometry backend is described as modular: it is decoupled from the detection backbone so that different depth models can be swapped in without rearchitecting the system. The backend uses a ray-aware decoder that incorporates camera geometry directly using spherical harmonic encodings of camera ray directions, removing the need for a separate camera calibration branch.

Data

Alongside the model, Ai2 released WildDet3D-Data: over one million images with 3.7 million verified 3D annotations spanning more than 13,000 object categories, including over 100,000 human-annotated images. The dataset was built by generating candidate 3D boxes for objects in existing large-scale 2D detection datasets — including COCO, LVIS, Objects365, and V3Det — using five complementary 3D estimation methods, then refining candidates through vision-language model selection and human review.

The post states that training on this dataset enables WildDet3D to generalize beyond narrow benchmark taxonomies and perform across 700-plus object categories in the wild.

Benchmark results

On Omni3D — a standard evaluation spanning six indoor and outdoor datasets across 50 categories — WildDet3D reaches 34.2 AP with text prompts, a 5.8-point improvement over the previous best (3D-MOOD), and 36.4 AP with oracle box prompts, surpassing DetAny3D by 2.0 points. The post notes this is achieved with 12 training epochs, compared to 80–120 for prior methods.

When sparse depth is provided at test time, performance increases further: 41.6 AP with text prompts and 45.8 AP with oracle box prompts, with the largest gains on indoor datasets where depth sensors are common.

For zero-shot generalization, the post reports 40.3 Open Detection Score on Argoverse 2, an autonomous driving dataset with 26 categories, described as nearly doubling the previous best of 23.8, and 48.9 ODS on ScanNet, an indoor dataset with 18 categories.

Applications and iOS demo

The post describes several downstream uses. Because WildDet3D accepts a 2D bounding box from any upstream detector and lifts it into 3D frame by frame, the post says it can provide continuous 3D localization across a video stream without training on tracking data. Paired with a vision-language model such as Molmo 2, WildDet3D can serve as a spatial reasoning layer in larger pipelines. The post also describes pairing it with smart glasses for persistent spatial awareness, noting that the full model currently requires server-side compute or further optimization for real-time on-device use.

Ai2 released an iOS demo app that uses live camera input and LiDAR depth to render 3D bounding boxes as AR overlays in real time, as well as an interactive web demo. Training code, updated inference code, and data preparation instructions were added in the April 21 update.