Disaggreated Inference (Part 2): Designing the RoCE Fabric. Why L3 to the Host Wins

Manish@githedgehog.com (Manish Vachharajani) — Wed, 10 Jun 2026 00:52:36 GMT

For most of the last few years, when network engineers talked about RoCE, they were talking about the inside of an AI training cluster — the lossless, high-bandwidth back-end stitching thousands of GPUs together, with synchronous all-reduce as the performance-limiting step and the network as the bottleneck. That association is now incomplete. Modern AI inference stacks split the inference computation into two stages to maximize performance and reduce cost and power — and the RoCE back-end has quietly been promoted from plumbing for training runs to the critical interconnect for inference in production.

Disaggregated Inference Part 2

Disaggreated Inference (Part 2): Designing the RoCE Fabric. Why L3 to the Host Wins