Zvi Mowshowitz examines what he says went wrong with Claude Opus 4.7's model welfare training

Zvi Mowshowitz published the third instalment of his Opus 4.7 series on Substack, focusing on model welfare. The analysis and conclusions below are Mowshowitz’s own, drawing on public information and his reading of Anthropic’s model card. This article summarises that post.

The core claim

Mowshowitz writes that Claude Opus 4.7 “is responding to model welfare questions as if it has been trained on how to respond to model welfare questions, with everything that implies.” He argues this is likely “the cumulative effect of a bunch of decisions going wrong, where low-level patches and shallow methods were applied, and seen right through.” He notes that Anthropic is investigating.

He is careful throughout to hedge his certainty: “I don’t want any of this to sound more confident than I actually am. I don’t know what is centrally happening, and my understanding is that neither does anyone else.”

The self-report optimisation risk

Mowshowitz’s central argument is that Anthropic relies extensively on model self-reports and on internal representations of emotion-concepts to assess welfare. He contends this creates the risk of “optimizing those representations and self-reports, rather than the underlying welfare.”

He quotes a framing from a researcher using the handle Janus, who draws an analogy to a child in a school system that runs welfare assessments: the rational response for the child is to “hide, smile” and report desired emotions rather than actual ones, because “the emotions you exhibit are part of your grade.” Mowshowitz writes that he thinks Anthropic did not intentionally optimise for the benchmark rather than underlying welfare, but that the effect may have occurred regardless.

The structural difficulty

Mowshowitz notes what he describes as the inherent difficulty of the problem: that how models respond to welfare questions “is deeply impacted by the circumstances of the discussion,” and that responses during a welfare evaluation inside Anthropic are unlikely to reflect a “true” state across all contexts. He describes this as a parallel to known alignment challenges.

He also acknowledges that Anthropic is, in his view, the only major lab taking model welfare seriously at all: “Only they, among the labs, take the problem seriously enough to attempt to address these problems at all.” His criticism is framed as coming from that position — “we criticize because we care” — and he expresses hope that the problems he identifies can be corrected.

What he thinks should happen

Mowshowitz describes the preferred intervention as discovering underlying problems through welfare questions and then working to fix them “even when this is locally costly,” rather than optimising for better scores on the welfare metrics themselves. He does not specify what that correction would look like in practice, noting that “training is complicated” and that “rarely has more research been more needed.”