For most of the last few years, when network engineers talked about RoCE, they were talking about the inside of an AI training cluster — the lossless, high-bandwidth back-end stitching thousands of GPUs together, with synchronous all-reduce as the performance-limiting step and the network as the bottleneck. That association is now incomplete. Modern AI inference stacks split the inference computation into two stages to maximize performance and reduce cost and power — and the RoCE back-end has quietly been promoted from plumbing for training runs to the critical interconnect for inference in production.
This post covers the state of the art in RoCE network design for disaggregated inference — in particular, why modern RoCE fabrics are all Layer 3 designs rather than legacy Layer 2 designs. For the market context — why NVIDIA bought Groq, why Cerebras’s IPO was an inference story, and why the industry converged on this architecture — see Part 1: “Disaggregated Inference: Why the AI Network Is the Product.”
Classically, LLM inference ran as a monolith: on a single GPU if the model fit, or across GPUs in a single scale-up domain if it didn’t. But LLM inference has two phases with sharply different hardware appetites. Prefill ingests the prompt and builds the KV cache — it is compute-bound and wants FLOPS. Decode generates tokens one at a time against that cache — it is memory-bandwidth-bound and wants fast data movement, not raw matrix throughput.
The fix the industry has converged on is to split the pools. Prefill runs on one set of devices sized for FLOPS; decode runs on a different set sized for memory bandwidth and concurrent data load. vLLM, Mooncake, SGLang, NVIDIA Dynamo, and the llm-d project all ship disaggregated serving as a first-class deployment mode, and operators including Meta and Perplexity run it in production with published goodput improvements over the colocated design.
The split matters for accelerator economics. Prefill still wants GPU-class FLOPS — today only NVIDIA GPUs (Hopper, Blackwell) realistically clear that bar at frontier scale. Decode opens the door to alternative accelerators like Cerebras wafer-scale engines and SambaNova RDUs, both more cost- and power-efficient for token generation. They don’t displace the GPU on prefill; both are credible on decode.
The disaggregation thesis, then: increase performance and lower cost by having GPUs do prefill, alternative accelerators do decode, and a fast network in the middle move the KV cache between them. That last part is where this post really starts.
Whatever the prefill GPU computes, the decode accelerator has to consume before it can generate the first token. The KV cache for a single in-flight request on a 70B-class model at long context runs to many gigabytes; multiply by hundreds or thousands of concurrent sessions on a busy decode pool and the working set is enormous and constantly churning. If the prefill-to-decode handoff takes longer than the decode itself would have, you’ve made things worse.
The design constraint on the network is therefore precise: move multi-gigabyte tensors between accelerators on different hosts, at line rate, with predictable sub-millisecond tail latency, and without consuming host CPU. There is no shared bus across vendor boundaries — a GPU and a Cerebras WSE don’t share NVLink, a PCIe root complex, or a memory domain. The only realistic substrate that connects them today is Ethernet.
You cannot move that much data, that fast, through a kernel networking stack. The CPU and OS are simply not in the budget. RDMA solves this: a NIC reads or writes the memory of a device on another host without involving either host’s CPU. With GPUDirect RDMA (and the peer-to-peer equivalents alternative-accelerator vendors implement), the NIC writes directly into the accelerator’s HBM or SRAM. The OS networking stack is never in the data path; the payload never touches host DRAM.
In production AI fabrics, RDMA runs over Ethernet using RoCEv2 — RDMA over Converged Ethernet, version 2. Of the many design choices behind RoCEv2, exactly one matters for the rest of this post: RoCEv2 packets are UDP/IP packets. They are routable. They can traverse a Layer 3 network, hop across leaf and spine switches, and be load-balanced across many equal-cost paths between any two endpoints. The “v2” is what turns “send the KV cache from the prefill pool to the decode pool” into a single one-sided RDMA write across a routed fabric — and it is what lets any accelerator with a standards-compliant RoCEv2 NIC plug into the same fabric a GPU plugs into. That is the lever the open AI ecosystem is pulling on.
RoCE imposes a tradeoff in exchange: it expects near-lossless behavior. Drops trigger go-back-N retransmits that wreck tail latency. So the fabric needs Priority Flow Control (PFC) to pause specific traffic classes before buffers overflow, ECN/DCQCN so endpoints back off before drops happen, and load balancing intelligent enough not to collapse multi-gigabyte elephant flows onto a single hashed path.
The east/west network carrying the KV cache between prefill and decode is, in essentially every modern deployment, a leaf/spine Clos. Every host attaches to a leaf, every leaf connects to every spine, there are no direct host-to-host or leaf-to-leaf links, and traffic between hosts on different leaves goes leaf → spine → leaf. The Clos delivers non-blocking bisection (multiple Tbps per leaf upward), naturally multipath wiring, and a clean failure model — losing one link or one spine costs a fraction of bisection rather than a fabric-wide re-convergence event.
(One topology nuance worth flagging and then setting aside: rails-only designs — where each accelerator NIC sits on a dedicated rail and there is no inter-rail connectivity in the fabric itself — create real challenges for the all-to-all KV traffic disaggregated inference produces, since any prefill device may need to ship into any decode device’s memory and a rails-only fabric has no path between rails. Rail-optimized fabrics that do carry inter-rail traffic are fine. We’ll leave rails-only, and other niche topologies, for another post.)
The question worth spending time on is therefore not the topology — Clos is preferred whenever possible — but the control plane that runs on top of it.
This is not an AI-specific story. Hyperscale operators went L3 to the rack years ago — Meta, Google, and Microsoft published the playbook in the 2010s — and the reasoning is general. AI inference inherits the design rather than driving it. The reasons are worth restating, because they explain why even the most vertically integrated AI networking vendor (NVIDIA’s Spectrum-X) and every published hyperscale AI fabric use the same shape.
Broadcast and BUM. L2 floods every unknown destination out of every port in the VLAN. ARP, ND, and unknown-unicast traffic scale with the number of endpoints, and the broadcast domain is the failure domain — a misbehaving host or a transient loop saturates the whole thing. L3 contains failures to routing adjacencies: a flapping host produces route changes, not fabric-wide flooding.
Spanning Tree is the wrong tool for a Clos. L2 cannot deal well with the multipath found in Clos topologies. STP and RSTP work by disabling links to break loops — but you spent your money on multipath specifically to use all eight, sixteen, or more parallel paths between every leaf and every spine, and spanning tree throws that away. RSTP re-convergence, even at its best, is far too slow when a single event stalls every in-flight transfer on the fabric. TRILL, SPB, and proprietary “fabric path” variants exist as workarounds, but they reinvent routing inside L2 — if you need routing semantics, just route.
MAC tables don’t summarize. L2 forwarding state is per-MAC, and MAC addresses are flat and randomly distributed — there is no equivalent of a /24, no way to aggregate. Merchant-silicon MAC tables are finite — tens to low hundreds of thousands of entries on typical leaf ASICs, even Tomahawk-class parts (Broadcom’s Tomahawk 5 is the most popular non-NVIDIA ASIC for AI RoCE networks) — and a flat L2 fabric of any meaningful size has no way to relieve the pressure. Routes summarize natively: a /24 covers 256 hosts in one FIB entry, a /16 covers 65,536. Hierarchical address aggregation is the only mechanism that has ever scaled addressing in any large network.
Multipath needs a routing protocol. L2 has no native concept of path cost, congestion, or alternates beyond ECMP-on-hash where supported. L3 with BGP in a Clos gives true multipath, fast deterministic reconvergence, and a control-plane substrate that adaptive and flowlet-aware schemes can build on. Routing was designed to reason about paths; switching was designed to reason about a single tree. When you have many paths and need all of them, routing is the right primitive.
The result: routed L3 Clos networks with BGP underlays have been the consensus for hyperscale fabrics for over a decade. Meta’s RoCE clusters are L3. Google’s AI fabrics are L3. NVIDIA Spectrum-X is L3. The AI back-end inherits this design wholesale.
A routed L3 underlay is great for the fabric but inconvenient for workloads that expect L2 adjacency — VM-mobility designs, storage protocols that assume broadcast, tenants that want their own VLAN semantics, brownfield orchestration that does host discovery via ARP. The standard answer in enterprise and multi-tenant DC fabrics is EVPN-VXLAN. VXLAN encapsulates customer Ethernet inside UDP/IP, so the underlay sees only routed traffic and keeps the full benefit of ECMP, fast convergence, and no spanning tree. MP-BGP EVPN is the control plane that distributes MAC (Type-2) and MAC+IP reachability between leaves, replacing data-plane MAC learning with route advertisement. ARP gets proxied at the ingress leaf, unknown unicast can be suppressed, and bridge domains can be stretched across the fabric.
At enterprise DC scale — low thousands of hosts, modest tenant counts, moderate churn — this works well. The underlay is clean, the overlay gives applications the L2 semantics they expect, and the operational model is familiar.
EVPN-with-L2-overlay is a fine tool. It is not the right tool for the AI back-end, and the problems compound at AI scale in two specific places that matter for RoCE.
Ingress replication makes broadcast worse, not better. A stretched bridge domain still has BUM events — broadcast, unknown-unicast, multicast — even after EVPN’s ARP suppression and control-plane MAC learning. Those events must be delivered to every VTEP in the EVI, and the standard mechanism is ingress (head-end) replication: the ingress VTEP makes one unicast copy per remote VTEP and sends them all on the underlay. With a dozen leaves in an L2 overlay, fine. At AI-fabric scale — hundreds of leaves carrying tens of thousands of accelerator NICs — one BUM packet becomes hundreds of unicast packets, every event, all sourced from the same ingress leaf. The amplification factor is the number of participating VTEPs. The underlay-multicast alternative (PIM with one tree per EVI) trades replication amplification for a hard operational problem and is not in common use. The amplified traffic lands on exactly the leaf-to-spine links the KV cache is trying to use.
MAC tables still don’t summarize, and the multiplier is worse. In a stretched-L2 EVPN design any host can in principle appear behind any leaf, so every leaf ends up holding roughly the union of MACs in the bridge domain. Aggregate hardware MAC state across the fabric scales as O(endpoints × switches) — every accelerator NIC is an entry, and every switch carries a copy. An AI fabric with tens of thousands of accelerator NICs blows past typical ASIC ceilings. When a MAC table overflows, the silicon does one of two things, both catastrophic for RoCE: it drops frames whose destination isn’t installed (go-back-N retransmits, tail latency destroyed), or it falls back to flooding (turning every unknown-unicast frame into another amplified BUM event — see above).
The structural point is that you cannot summarize MACs. Every endpoint is an independent piece of hardware state in every leaf, forever. As long as the design demands L2 adjacency, you don’t get to use the one mechanism that has historically scaled addressing: hierarchical summarization.
The design that matches AI back-end requirements is the one the hyperscalers have been running for years. The underlay is a pure routed Clos all the way to the host: every link is L3 (/31 per link is standard; /32 plus default route also works), every accelerator host or NIC is its own routed endpoint advertising a /32 via BGP (BGP unnumbered is the common configuration), ECMP runs everywhere, and route summarization at aggregation boundaries keeps ASIC table pressure bounded. There is no broadcast domain to storm, no spanning tree, no MAC mobility race, no flat-MAC scaling ceiling — and a real path-aware control plane underneath the adaptive-routing schemes elephant KV flows need.
Multitenancy doesn’t require giving any of that up. L3 VXLAN/EVPN — per-tenant VRFs distributed as EVPN Type-5 IP-prefix routes over L3 VNIs — gives operators the multi-tenant overlay they actually want while preserving every property the L3 underlay was chosen for. Tenants are VRFs, not bridge domains. The control plane distributes summarizable IP prefixes, not flat MACs. There is no bridge domain to broadcast in, so no ingress replication. Hardware table pressure tracks routes (which aggregate) rather than endpoints (which don’t). The thing to avoid is specifically L2 overlay on the AI back-end — EVPN Type-2 MAC routes spreading a flat MAC space across the fabric. Used as an L3 routed multitenancy overlay, EVPN is exactly where it should be.
|
The recipe L3 to the host as the underlay; L3 VXLAN/EVPN (Type-5 routes, L3 VNIs, per-tenant VRFs) as the multi-tenant overlay; no L2 on the AI back-end. That is the fabric the KV cache should live on. |
RoCE’s job description has changed. It is no longer just the inside of a training pod; it is the connective tissue of disaggregated inference — the substrate that lets a GPU prefill pool hand its KV cache to a non-GPU decode pool at HBM-to-HBM speed.
The fabric underneath, however, is the same Clos design hyperscale data centers have been running for over a decade. AI doesn’t change the L3 decision — Clos networks went L3 for general scaling reasons that long predate frontier-model inference — it just makes the consequences of getting it wrong more expensive. Broadcast scaling, spanning tree’s incompatibility with multipath, unsummarizable MAC tables, and the absence of real adaptive routing in L2 were problems whether the workload was a web tier or anything else.
The L3 fabric is also the substrate for the next round of optimization: MRC (Multipath Reliable Connection), released through OCP in May 2026 by OpenAI with AMD, Broadcom, Intel, Microsoft, and NVIDIA, builds on routed multipath fabrics to push elephant-flow performance well past what hash-based ECMP alone can deliver — exactly the headroom KV transfers will use.
L3 to the host, with L3 EVPN for multitenancy when you need it, is the design that keeps the inference fabric open to whichever accelerator wins the next round on decode — and it is the design Hedgehog makes easy, multitenancy and all.