In collaboration with Anthropic, a METR staff member — David Rein — spent three weeks red-teaming a subset of Anthropic’s internal agent monitoring and security systems. The exercise is described in a post on METR’s site, which notes that many of the systems tested are described in Anthropic’s Opus 4.6 Sabotage Risk Report, particularly Appendix 8.4 and section 8.4.8.
Anthropic provided substantial access to relevant internal systems and information for the duration of the exercise, and made staff available to answer questions and give feedback throughout.
What the exercise found
The red-teaming exercise discovered several specific novel vulnerabilities. According to the METR post, some of those vulnerabilities have since been patched, and none of them “severely undermine major claims in the Opus 4.6 Sabotage Risk Report.”
The exercise also produced several artefacts: agent trajectories containing covert attacks and a small attack strategy ideation test set. METR describes both as expected to be useful for ongoing improvements to Anthropic’s monitoring systems.
The resulting report runs to 26 pages. It was shared in full with Anthropic, and a redacted version was shared with a subset of METR staff. The post states that METR is exploring ways to incorporate more detailed takeaways from the exercise into future METR risk reports.
Context
METR frames this kind of adversarial testing as valuable for two reasons: discovering vulnerabilities that internal testing may miss, and developing best practices for embedding third-party evaluators inside frontier AI companies.
The Anthropic Opus 4.6 Sabotage Risk Report, referenced in the post as the document describing many of the tested systems, is a published risk assessment associated with Claude Opus 4.6.
METR states it hopes to conduct more exercises of this kind and invited other frontier AI developers interested in working with the organisation to get in touch at partnerships@metr.org.