Bug Hunt: WireGuard Crashes and MTU Mismatch in GKE

Lovable's infrastructure team debugged a cluster-wide networking issue on Google Kubernetes Engine (GKE) that caused intermittent connection failures. Using an AI agent to scan Clickhouse logs, they discovered that anetd pods (Google's Cilium implementation) were crashing ~120 times per pod over six days — nearly once per hour. Crash dumps revealed a concurrent map-access panic in Google's WireGuard integration code, not in WireGuard itself.
First Fix: Disable Transparent Encryption
Google support recommended disabling node-to-node encryption to bypass the WireGuard bug. The team applied the change and restarted all anetd pods. Crashes stopped for about four hours — then users started seeing random connection failures to Valkey (their in-memory data store).
Second Bug: MTU Mismatch
Engineer Erik used tcpdump and Wireshark to capture packets. The smoking gun: "Destination unreachable (Fragmentation needed)". Here's the cause:
- With WireGuard enabled, cluster MTU was set to 1420 bytes (accounting for WireGuard's 80-byte encapsulation overhead).
- After disabling WireGuard, configs should have reverted to standard 1500 bytes, but some nodes weren't restarted — they still used the old 1420 MTU.
- Valkey connections crossing nodes with mismatched MTUs failed intermittently.
Resolution
The fix: rolling restart of all nodes to ensure consistent MTU configuration across the cluster. This eliminated fragmentation errors and restored stability.
Key Takeaways
- The first bug was in Google's
anetdintegration of WireGuard — a concurrency bug in map access. It's specific to GKE's implementation. - Disabling encryption bypassed the panic but introduced an MTU mismatch that needed a full node rollout.
- AI agents helped surface the anetd crash pattern quickly from millions of log lines.
📖 Read the full source: HN AI Agents
👀 See Also

Master OpenClaw on Your Android Smartphone: A Comprehensive Tutorial
Curious about harnessing the potential of OpenClaw on your Android smartphone? This tutorial provides step-by-step guidance on getting started, covering essential tips and tricks from the vibrant OpenClaw community.

Running OpenClaw Locally with Ollama to Avoid API Costs
A Reddit user shares their experience switching from API-based OpenClaw to running it locally with Ollama, eliminating API costs while maintaining workflows. They created a step-by-step installation video guide.

Slash Claude costs 60x by offloading mechanical tasks to DeepSeek V4 Flash via MCP
A Reddit user cut Claude API spend 60x by routing file classification, JSON reformatting, and field extraction to DeepSeek V4 Flash via a simple MCP tool and a CLAUDE.md deny-list rule.

Custom 4x RTX PRO 6000 Server vs Dell GB300: Decision for 30 Fine-Tuned Pipelines
A deep dive into two on-prem architectures for running ~30 fine-tuned production pipelines: a custom 4U server with 4-8x RTX PRO 6000 Blackwell (96GB each) vs NVIDIA GB300 Grace Blackwell appliance with 252GB HBM3e + 496GB unified memory.