Databricks' own OfficeQA benchmark shows GPT-5.5 scoring 52.63% on full-agent enterprise eval, up from GPT-5.4's 36.10%

Databricks published benchmark results for GPT-5.5 on its internal OfficeQA evaluation, finding substantial gains over GPT-5.4 on document-heavy analytical tasks, according to a post by Hanlin Tang, Ahmed Bilal, Arnav Singhvi, Ivan Zhou, and Harish Gaur.

The post also confirms the Databricks-OpenAI partnership on GPT-5.5 and describes Codex, OpenAI’s coding agent, as now running on the model.

OfficeQA results

Databricks describes OfficeQA as built from 89,000 pages of US Treasury Bulletins. The benchmark tests a model’s ability to retrieve information across documents, interpret complex tables, and perform precise calculations grounded in real enterprise data.

The evaluation ran in two configurations. In the first — OfficeQA Pro LLM with Oracle PDF and Web Search, which Databricks describes as testing “the ceiling of what the model can do when retrieval is already handled” — GPT-5.5 scored 64.66%, up from GPT-5.4’s 57.14%. Databricks characterises this as a roughly 13% improvement and a new result for this benchmark.

The second configuration, OfficeQA Pro Agent Harness, requires the model to find documents, parse them, and compute answers independently using the Codex agent harness. GPT-5.5 scored 52.63% in this configuration, compared to GPT-5.4’s 36.10%. Databricks describes the gap as “a 46% reduction in errors, showing that GPT-5.5’s gains aren’t just theoretical; they hold up in realistic, end-to-end enterprise workflows.”

What Databricks says GPT-5.5 is suited for

Databricks describes GPT-5.5 as “OpenAI’s strongest frontier model for agentic work in enterprise, complex document reasoning, and long-horizon coding agents” — a characterisation that should be understood as Databricks’s framing, not an independent assessment.

The post describes GPT-5.5’s strengths in terms of understanding intent more quickly, moving through knowledge work tasks — finding information, using tools, checking output — without requiring step-by-step management, and recovering from ambiguity in multi-part tasks.

Codex is described as “now powered by GPT-5.5, with stronger reasoning and execution capabilities for developer workflows.”

Availability

The Databricks post states GPT-5.5 is “coming soon to Databricks,” which is in tension with the companion post from the same date stating it is “available now.” The available excerpt does not resolve this discrepancy. No pricing or throughput commitments for GPT-5.5 on Databricks are included in the available excerpt.

OfficeQA is Databricks’ own benchmark. The post does not cite independent validation of the results or disclose the conditions under which GPT-5.4 baselines were measured.