GPUDirect RDMA on macOS: Zero-Copy Metal Buffer RDMA via libibverbs

A follow-up to the TinyGPU investigation reveals that Apple's RDMA implementation supports zero-copy memory sharing with Metal GPU buffers, and hidden symbols indicate possible GPUDirect RDMA support — undocumented and previously unknown.

Key Findings

The developer tested ibv_reg_mr() with various memory types on a 4-node Mac cluster (3x M3 Ultra + M5 Max MacBook Pro, ~1.5TB unified memory, Thunderbolt 5). Results:

malloc() — FAIL (unexpected; works on Linux)
posix_memalign() — FAIL (unexpected)
mmap(MAP_ANON) — PASS (expected)
IOSurfaceGetBaseAddress() — PASS (no documentation)
MTLBuffer.contents (Metal shared) — PASS (no documentation)

Apple's RDMA validates VM-mapping type, not physical backing. Heap allocations fail; VM-mapped memory (mmap, IOSurface, Metal buffers) passes — a key difference from Linux.

Zero-Copy Proven

A 64MB mmap buffer was triple-registered: as an RDMA memory region, a Metal GPU buffer, and an IOSurface. All registrations succeeded with the same lkey=0x101, confirming zero-copy sharing between GPU and network.

Hidden GPUDirect RDMA Symbols

Analysis of Apple's libibverbs.dylib via nm -a revealed undocumented symbols including ibv_reg_dmabuf_mr, which on Linux enables GPUDirect RDMA. This suggests Apple has already implemented the kernel-level plumbing, but the API is not publicly exposed.

Blackwell eGPU Status

The RTX PRO 5000 Blackwell 72GB in a Razer Core X V2 is detected (PCIe link up, x4 @ 16 GT/s, 80 Gb/s TB5), and TinyGPU's DriverKit extension loads. However, NVIDIA's GSP firmware fails with RuntimeError: RPC call 4097 failed with result 101. NOCAT error decode reveals FBFLCN UNRECOGNIZED_CLIENT — the GPU's memory fabric doesn't recognize the PCIe peer through TB5. This is a known issue (tinygrad#15843); AMD GPUs work fine. The developer requests collaboration with the tinygrad team to fix GSP firmware init over TB5.

Who This Is For

Developers working on macOS GPU compute, RDMA, or eGPU infrastructure, especially those interested in zero-copy data paths for distributed inference or training.

📖 Read the full source: r/LocalLLaMA