In Import AI 452, Jack Clark covers a Lyptus Research paper studying how well AI systems perform on cyberoffense tasks, spanning models from 2019 to the present.

The cyberoffense scaling law

According to Clark’s coverage, across frontier models released since 2019, the doubling time for cyberoffense capability is 9.8 months. Narrowing to models released since 2024, that figure tightens to 5.7 months.

Clark reports that the most recent frontier models in the study — GPT-5.3 Codex and Opus 4.6 — sit above both fitted trendlines, “achieving 50% success on tasks taking human experts 3.1h and 3.2h respectively.” The best current models, according to Clark’s summary of the paper, “achieve 50% success on tasks that take human experts 3.2h, roughly half a working day of professional offensive security work.”

The benchmarks in the study span CyBashBench, NL2Bash, InterCode CTF, NYUCTF, CyBench, CVEBench, and CyberGym. The researchers also built a new dataset of 291 tasks with completion transcripts and time estimates calibrated by ten offensive cybersecurity professionals. Models evaluated run from GPT-2 in 2019 through GPT-5.3 Codex, Opus 4.6, GLM-5, and Sonnet 4.6 in 2026.

On open-weight models, Clark reports the study’s finding that GLM-5 “lags the closed-source frontier by 5.7 months, suggesting that frontier offensive-cyber capability may diffuse into open-weight form on relatively short timelines.”

Clark frames the dual-use concern as follows: “AI that is especially good at helping you find vulnerabilities in code for defensive purposes can easily be repurposed for offensive purposes. The most challenging part of AI is that it is an ‘everything machine.’”