Decoupled DiLoCo: Resilient Distributed Training Across Data Centers with Low Bandwidth

Google DeepMind published a paper on Decoupled DiLoCo (Distributed Low-Communication), a distributed training architecture that decouples compute into separate "learner units" that communicate asynchronously. This allows training large models across geographically distributed data centers with much lower bandwidth requirements than traditional synchronized approaches.
Key Details
- Builds on two prior advances: Pathways (asynchronous data flow system) and DiLoCo (reduced bandwidth between data centers).
- Training is split across decoupled learner units — independent compute islands. A chip failure in one unit doesn't interrupt the others. The system is self-healing: after losing an entire learner unit to hardware failure, training continues and the unit is seamlessly reintegrated once it recovers.
- Validated with chaos engineering — injected artificial hardware failures during training runs. Decoupled DiLoCo maintained high "goodput" (useful training time) while conventional methods nosedived under failure.
- Trained a 12 billion parameter model across four separate U.S. regions using 2-5 Gbps wide-area networking — achievable with existing internet connectivity between datacenters.
- Achieved the same benchmarked ML performance (tested with Gemma 4 models) as conventional training approaches.
- Reported more than 20× faster than conventional synchronization methods because communication is overlapped with computation, avoiding blocking bottlenecks.
Architecture Overview
The system incorporates communication into longer computation periods instead of requiring synchronous all-reduce across all chips. This avoids "blocking" where one part of the system must wait for another. The result is resilient training that can tap unused compute anywhere, turning stranded resources into useful capacity.
Who It's For
Teams training large language models or other frontier models across multiple data centers who need fault tolerance without sacrificing performance or requiring custom network infrastructure.
📖 Read the full source: HN AI Agents
👀 See Also

Claude Code v2.1.77 Release: Token Limits, Sandbox Controls, and Bug Fixes
Claude Code v2.1.77 increases default maximum output token limits for Claude Opus 4.6 to 64k tokens and adds an allowRead sandbox filesystem setting. The release includes over 30 fixes for issues ranging from memory management to terminal UI behavior.

Anthropic releases Claude Code Remote Control for mobile development
Anthropic has launched Remote Control, a feature that lets Claude Code users control their local development sessions from mobile devices. Available initially to Claude Max subscribers, it requires version 2.1.52 and uses a QR code to sync sessions.

Practical Enhancements in Claude Opus 4.6: Memory Upgrade
Claude Opus 4.6 features a significant upgrade with a 1 million token context, enhancing memory retention and performance in complex tasks.

Diagnosing Operational Drift and Task Amnesia in OpenClaw with Gemini 2.5 Flash on Proxmox
OpenClaw users report issues with persistent workflows on a Proxmox VM, citing operational drift and task amnesia. Despite stable performance in one-off tasks, the Gemini 2.5 Flash model struggles with automation and memory in this setup.