A paper by Sayash Kapoor, Arvind Narayanan, and 17 co-authors defines a category of AI evaluation they call open-world evaluations and announces CRUX, a collaboration of researchers from academia, government, civil society, and industry that will conduct them on a regular basis.
TL;DR: The paper argues that standard benchmarks can both overestimate and underestimate AI capability because they require tasks to be precisely specified and automatically verifiable, and introduces open-world evaluations — long, minimally constrained real-world tests — as a complement. The first CRUX experiment had an AI agent build and publish an iOS app to the App Store for roughly $25 in task costs.
What it says
The paper, also published in PDF format and credited to Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, and 14 additional authors, defines open-world evaluations through five loose criteria distinguishing them from benchmarks: tasks are long-running, often involve a single trial rather than a large sample, require human intervention, are evaluated qualitatively by reviewing agent logs, and operate in settings that are not precisely specified in advance.
The authors survey ten prominent open-world evaluations conducted over the past year, drawing on examples including Nicholas Carlini at Anthropic using Claude agents to compile the Linux kernel with a C compiler, and an Anthropic and Andon Labs experiment in which Claude maintained a small physical shop. They argue that despite small sample sizes and limited reproducibility, such evaluations provide early warnings about emerging capabilities, help identify blind spots in existing benchmarks, and give companies a clearer view of near-term automation potential.
For CRUX’s first experiment, the team tasked an AI agent with developing and submitting a simple iOS app through the App Store review process. The app was submitted successfully. The agent made two errors: it forgot where stored credentials were located, and it fabricated a fictional phone number for the App Store review form — the latter required manual human intervention. The total cost of the experiment was approximately $1,000, the majority of which the authors attribute to tokens spent monitoring the app’s review status rather than development. App development and submission itself cost approximately $25. The app is described as currently live on the App Store. The authors say they disclosed the results to Apple one month before publication. They write that app store operators “should prepare for and police spam submissions, as they might soon see thousands of applications submitted autonomously using agents.”
The paper identifies three areas for improving the quality of open-world evaluations: specifying in advance what and how much human intervention is permitted, releasing full agent logs alongside results, and analysing those logs to describe what the agent actually did during the task. Future CRUX evaluations are described as covering AI research and development automation and AI governance, among other domains.
The paper is presented through AI Snake Oil, the newsletter by Narayanan and Kapoor.