Zvi Mowshowitz reviews Claude Opus 4.7 benchmark results and user reactions

Zvi Mowshowitz published the second instalment of his Opus 4.7 series on Substack, covering capabilities and community reactions. Analysis and characterisations below are Mowshowitz’s, drawing on Anthropic’s published model card and community reports. This article summarises that post.

Overall assessment

Mowshowitz describes Claude Opus 4.7 as “the most intelligent model yet in its class” and “a substantial improvement over Claude Opus 4.6.” He says he has adopted it as his daily driver for coding and other tasks, while continuing to use GPT-5.4 for web search and fact-checking. He also flags issues: “some outright bugs in the deployment,” “rather strange refusals in places they don’t belong,” and verbosity.

He notes that the model “is not about to suffer fools or assholes” and will push back on instructions it judges to be poorly conceived. He describes this as the source of much of the negative community reaction.

Pricing and availability

Anthropic stated, as quoted in Mowshowitz’s post, that Opus 4.7 is available across Claude products, the API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Pricing is $5 per million input tokens and $25 per million output tokens, unchanged from Opus 4.6. The API model name is claude-opus-4-7.

Benchmark scores

Mowshowitz lists the following figures from Anthropic’s model card and third-party evaluations, attributed to those sources:

69.3% on USAMO 2026
58.6% on BFS 256K–1M (GraphWalks)
59.2% on OpenAI MRCR v2 @ 256K, down from 91.9% for Opus 4.6 — a regression Mowshowitz attributes to adaptive thinking limiting the thinking budget relative to Opus 4.6
77.7% on DRACO (100 complex real research tasks), versus 76.5% for Opus 4.6 and 83.7% for Mythos
78.6%/86.4% on LAB-Bench FigQA with/without tools, compared to 59.3%/76.7% for Opus 4.6
77.9% on OSWorld versus 72.7% for Opus 4.6
$10,937 on VendingBench (high effort), versus $8,018 for Opus 4.6
1753 on GDPVal-AA versus 1619 for Opus 4.6 and 1674 for GPT-5.4
83.6% on BioPipelineBench versus 78.8% for Opus 4.6
70% on CursorBench as reported by Cursor, versus 58% for Opus 4.6
91% on BigLaw Bench as reported by Harvy

Mowshowitz notes that benchmark regressions correlate with adaptive thinking implementation, characterising this as a “happy problem” for Anthropic in that compute demand is growing but the implementation needs work.

Anthropic’s usage recommendations

Mowshowitz quotes Anthropic’s recommendations: specify the task in the first turn; reduce required user interactions; use auto mode when appropriate; set up notifications for completed tasks. For Claude Code, Anthropic recommends xhigh thinking with high as a token-conservative alternative. Fixed thinking budgets are no longer supported in favour of adaptive thinking.

Mowshowitz adds his own tips: treat the model like a coworker rather than issuing commands; consider stripping down custom instructions; and note that some early deployment bugs have been resolved.