Google DeepMind's Decoupled DiLoCo trains LLMs across data centers on standard internet bandwidth

Google DeepMind has published a blog post and paper introducing Decoupled DiLoCo (Distributed Low-Communication), a distributed training architecture designed to run large language model training across geographically separated data centers using ordinary internet-scale bandwidth rather than custom high-speed interconnects.

The core problem being solved is synchronization. Conventional large-scale training requires chips in near-perfect lockstep, which works when they are physically co-located but becomes a significant bottleneck across distant facilities. Decoupled DiLoCo splits training across separate “islands” of compute, called learner units, that communicate asynchronously. A failure in one island does not stall the others — training continues, and the failed unit reintegrates when it comes back online.

How it builds on prior work

The announcement traces Decoupled DiLoCo’s lineage to two earlier Google systems. Pathways introduced a distributed AI infrastructure based on asynchronous data flow. DiLoCo, a predecessor method, reduced the bandwidth required between distributed sites. Decoupled DiLoCo, according to the post, “brings those ideas together” by layering asynchronous learner-unit training on top of Pathways.

Rather than requiring frequent synchronization that blocks progress, the system incorporates required communication into longer computation windows. This eliminates the blocking bottleneck where one part of the system must wait on another — which the post identifies as the reason previous data-parallel methods did not scale to global distances.

What the tests showed

DeepMind reports training a 12 billion parameter model across four separate US regions using 2–5 Gbps of wide-area networking. The post notes that this bandwidth range is “relatively achievable using existing internet connectivity between datacenter facilities, rather than requiring new custom network infrastructure.” The training was completed more than 20 times faster than conventional synchronization methods, according to the announcement.

To stress-test fault tolerance, the team used “chaos engineering” — deliberately introducing hardware failures during live training runs. The announcement states that Decoupled DiLoCo continued training after the loss of entire learner units and reintegrated them when they came back online. In a simulated environment of 1.2 million chips with high failure rates, the system maintained 88% goodput compared with 27% for conventional data-parallel training.

Testing with Gemma 4 models showed 64.1% average accuracy versus 64.4% for conventional training — a minimal quality gap. According to the post, the architecture does not require a quality tradeoff for the resilience it provides.

Mixed hardware generations

One aspect the announcement highlights beyond fault tolerance is the ability to mix hardware generations in a single training run. The post specifically mentions TPU v6e and TPU v5p running together. According to DeepMind, chips from different generations running at different speeds still matched the ML performance of single-chip-type training runs.

The post frames this as turning “stranded resources into useful capacity” — idle compute at any location, on any compatible hardware, becomes available for training jobs. The practical implication: new hardware does not arrive everywhere simultaneously, so the ability to train across generations can alleviate recurring logistical and capacity bottlenecks.

The work was done by a team spanning Google DeepMind and Google Research; contributors listed include Arthur Douillard, Keith Rush, Yani Donchev, Zachary Charles, Ayush Dubey, Blake Woodworth, Ionel Gog, Josef Dean, Nova Fallen, and Zachary Garrett, among others.