ReasoningBank gives agents a memory that learns from failures, not just successes

Google Cloud researchers have published a new agent memory framework called ReasoningBank, presented at ICLR 2026, that stores high-level reasoning strategies distilled from both successful and failed task trajectories. The key claim in the research post is that existing memory approaches either record exhaustive action logs or only learn from successes — both of which fail to extract the transferable, higher-order reasoning patterns that make an agent get better over time.

The problem being solved is persistent: agents deployed in long-running, real-world roles tend to repeat the same strategic mistakes because they lack a mechanism to consolidate what those mistakes actually teach. Trajectory memory stores what happened; workflow memory summarizes what worked. Neither captures the reasoning that should have been applied.

How ReasoningBank structures memory

Each memory item in ReasoningBank has three fields: a title that concisely identifies the core strategy, a brief description, and a content field holding the distilled reasoning steps, decision rationales, or operational insights. That structure is intentional — the post describes it as storing “tactical foresight” rather than procedural records.

The memory operates in a continuous closed loop. Before taking action, the agent retrieves relevant items from ReasoningBank and includes them in context. After interacting with the environment, a LLM-as-a-judge component self-assesses the resulting trajectory and extracts either success insights or failure reflections. Those are then distilled into new memory items and appended to the bank.

The post notes that the self-judgment does not need to be perfectly accurate — the framework is described as “quite robust against judgment noise.” That robustness matters in practice, since reliable self-assessment in open-ended agentic tasks is not guaranteed.

The failure path is what distinguishes this from prior work. Rather than discarding unsuccessful trajectories, ReasoningBank actively mines them for counterfactual signals and converts mistakes into preventative guardrails. The post gives a concrete example: instead of learning a rule like “click the ‘Load More’ button,” an agent learning from a failure might instead encode “always verify the current page identifier first to avoid infinite scroll traps before attempting to load more results.” The shift from procedural to preventative is the meaningful part.

Memory-aware test-time scaling

The paper also introduces MaTTS — memory-aware test-time scaling — which ties ReasoningBank to compute scaling at inference time. Standard test-time scaling methods run multiple trajectories but treat the final answer as the only useful output, discarding the exploration. MaTTS treats that exploration as learning material and feeds it into ReasoningBank.

Two scaling modes are described. Parallel scaling has the agent generate multiple distinct trajectories for the same query under memory guidance, then uses self-contrast between successful and poorly-reasoned trajectories to distill more robust strategies. Sequential scaling has the agent iteratively refine reasoning within a single trajectory, with ReasoningBank capturing intermediate insights from the trial-and-error process.

The relationship is designed to be mutually reinforcing: better memory guides exploration toward more promising strategies, and richer exploration generates higher-quality learning signals that improve the memory further.

Benchmark results with Gemini 2.5 Flash

The evaluation used Gemini 2.5 Flash with a ReAct prompting strategy as the foundation, and compared four configurations: memory-free baseline (Vanilla ReAct), Synapse (trajectory memory), Agent Workflow Memory (AWM, workflow memory), and ReasoningBank.

On WebArena, ReasoningBank without scaling outperformed the memory-free baseline by 8.3%. On SWE-Bench-Verified, the improvement was 4.6%. Beyond accuracy, the framework also reduced the number of steps required: on SWE-Bench-Verified, ReasoningBank saved almost 3 total execution steps per task compared to the memory-free baseline. The efficiency gain follows from the agent having access to prior decision rationales, which reduces aimless exploration.

Adding MaTTS with parallel scaling at a factor of k=5 produced further improvements: a 3% additional success rate increase over ReasoningBank alone on WebArena, and 0.4 fewer steps per task.

The post also describes an emergent behavior observed during evaluation. Early in a web-browsing run, the agent’s curated rules resembled simple checklists — “Look for page links.” As the agent accumulated more experience, those simple rules evolved into memories with compositional and preventative logic structures: “Cross-reference tasks continuously with active page filters to ensure retrieved datasets aren’t paginated prematurely.” That shift from checklist to strategy is presented as evidence that the framework is distilling something substantively more useful than its raw inputs.

The paper is available from Google Research, and code is available on GitHub. The authors describe memory-driven experience scaling as “a crucial new frontier for agent scaling.”