Mollick surveys agentic AI capability gains and early organisational experiments

This article summarises analysis from One Useful Thing by Ethan Mollick. The observations and assessments described are Mollick’s own.

Ethan Mollick, writing in One Useful Thing, argues that AI has entered an agentic phase — one where systems can be given hours of human work and return useful results in minutes — and reviews recent benchmark data, an early organisational experiment, and a week in February he uses to illustrate what near-term disruption may look like.

Benchmark trajectory

Mollick surveys four evaluations he describes as representative of AI capability progress. On the Google-Proof Q&A benchmark — a test of knowledge where, he writes, graduate students using Google score around 34% outside their field and roughly 70% inside it — the best AI systems now score 94%. On GDPval, where industry experts judge AI against experienced human performance on complex tasks, the latest AI systems reach or exceed parity with top-performing humans 82% of the time. Mollick also cites Humanity’s Last Exam (a set of questions written by college professors requiring considerable expertise) and a puzzle benchmark. He writes that each shows “a similar rapid gain in ability with few signs of slowdown, at least until they reach the top possible score on the test.”

Mollick acknowledges that “all of these tests have their own flaws” and that AI remains capable of high-level performance on some tasks while failing others. He also notes that companies are still early in adopting AI and that, as of the time of writing, “remarkably little has changed in most organisations.”

He also references the METR Long Tasks graph, which attempts to measure how much human work an AI can complete autonomously with some reliability. He notes it has attracted critics and that METR itself has pointed to potential issues, but states that most graphs of AI ability show a similar curve.

StrongDM’s software factory

Mollick describes a public announcement from a three-person team at StrongDM, a security software company, who said they built what they called a Software Factory. Under the factory’s rules, Mollick reports, code must not be written by humans and code must not be reviewed by humans. Each human engineer is expected to spend the equivalent of their salary on AI tokens — at least $1,000 per day, Mollick writes.

According to Mollick’s account, the factory takes human-written product roadmaps and converts them into software using coding agents, while testing agents evaluate the software in a simulated customer environment that the testing agents themselves build as needed. The agents loop back and forth until results satisfy the AI. Humans review the finished product before it ships.

Mollick notes that the StrongDM team shared details publicly and invited outside observers, including Simon Willison and Dan Shapiro, whose accounts he links. He says the particular details of the factory “matter less than the fact that such radical experimentation into how we work is now not only possible, but likely necessary.”

A week in February

Mollick describes a single week in February as an illustration of near-term disruption patterns. On February 22, a financial firm named Citrini Research published what Mollick characterises as a fictional scenario about AI adoption destroying established businesses by 2028. He says it “struck a nerve on Wall Street, leading to major stock market price shifts.” On February 26, financial services company Block announced 40% layoffs, citing AI; Mollick writes that “it is likely that the role of AI was greatly exaggerated, and AI was merely used as cover for large-scale layoffs.” On February 27, a public conflict occurred between the Pentagon and Anthropic over who should control the rules governing how Claude could be used by the government.

Mollick describes all three cases as “not what they first appeared to be” and says the week illustrates what he expects the near future to feel like: sudden reversals, contested claims, and rapid shifts in how AI capability is perceived.

This piece is based solely on Mollick’s account in One Useful Thing.