Decoupled DiLoCo: Resilient Distributed Training Across Data Centers with Low Bandwidth

✍️ OpenClawRadar📅 Published: April 27, 2026🔗 Source
Decoupled DiLoCo: Resilient Distributed Training Across Data Centers with Low Bandwidth
Ad

Google DeepMind published a paper on Decoupled DiLoCo (Distributed Low-Communication), a distributed training architecture that decouples compute into separate "learner units" that communicate asynchronously. This allows training large models across geographically distributed data centers with much lower bandwidth requirements than traditional synchronized approaches.

Key Details

  • Builds on two prior advances: Pathways (asynchronous data flow system) and DiLoCo (reduced bandwidth between data centers).
  • Training is split across decoupled learner units — independent compute islands. A chip failure in one unit doesn't interrupt the others. The system is self-healing: after losing an entire learner unit to hardware failure, training continues and the unit is seamlessly reintegrated once it recovers.
  • Validated with chaos engineering — injected artificial hardware failures during training runs. Decoupled DiLoCo maintained high "goodput" (useful training time) while conventional methods nosedived under failure.
  • Trained a 12 billion parameter model across four separate U.S. regions using 2-5 Gbps wide-area networking — achievable with existing internet connectivity between datacenters.
  • Achieved the same benchmarked ML performance (tested with Gemma 4 models) as conventional training approaches.
  • Reported more than 20× faster than conventional synchronization methods because communication is overlapped with computation, avoiding blocking bottlenecks.
Ad

Architecture Overview

The system incorporates communication into longer computation periods instead of requiring synchronous all-reduce across all chips. This avoids "blocking" where one part of the system must wait for another. The result is resilient training that can tap unused compute anywhere, turning stranded resources into useful capacity.

Who It's For

Teams training large language models or other frontier models across multiple data centers who need fault tolerance without sacrificing performance or requiring custom network infrastructure.

📖 Read the full source: HN AI Agents

Ad

👀 See Also