Disaggregated Inference Network | Lossless RoCE Ethernet

The Open RoCE Fabric Built for Disaggregated Inference

Disaggregated inference splits prefill and decode across separate accelerator pools — and the network in the middle becomes the product.

Hedgehog delivers a lossless RoCEv2 Ethernet fabric that moves the KV cache between pools at line rate, with automated PFC and ECN, on open hardware you own. Full RDMA performance.

u8658314242_a_simple_animated-feel_diagram_of_two_GPUaccelera_6cece763-636a-42db-8b86-6ab6fdbb96c9_1

Disaggregated Inference - Read our 2 Part detailed blog

When the network drops packets, your GPUs wait

RoCE expects near-lossless behavior. The moment buffers overflow, drops trigger go-back-N retransmits that wreck tail latency — and every stalled transfer idles hardware you paid a fortune for.

Most teams hit one of two walls: InfiniBand delivers the performance but locks you into a single vendor's stack and supply chain, or a do-it-yourself Ethernet build leaves you hand-tuning PFC, ECN, and multipath across dozens of switches — slow, brittle, and easy to get catastrophically wrong.

There's a third option: open Ethernet that's engineered to be lossless out of the box.

In disaggregated inference, the network is the product

For years, RoCE meant the inside of a training cluster — the lossless back-end stitching thousands of GPUs together. Disaggregated inference changes that. Inference splits into two phases with very different appetites: prefill ingests the prompt and is compute-bound; decode generates tokens and is memory-bandwidth-bound. The industry — vLLM, NVIDIA Dynamo, SGLang, llm-d, and operators like Meta and Perplexity in production — has converged on running them on separate pools of hardware.

That makes the network the critical path of disaggregated inference. The KV cache — multi-gigabyte tensors — has to move from the prefill pool to the decode pool at line rate, with predictable sub-millisecond tail latency, without consuming host CPU. If that handoff is slower than the decode it feeds, disaggregation has made inference worse, not better. Get the RoCE fabric right and the economics of disaggregated inference work. Get it wrong and you've spent more to go slower.

A lossless RoCEv2 fabric, automated end to end

Hedgehog deploys a validated, lossless RoCE fabric without the per-switch CLI grind — and keeps it that way through continuous reconciliation.

RoCEv2, automated. Validated configurations pushed to every switch, declaratively, from one API.
Congestion control built in. PFC plus ECN/DCQCN tuned for AI traffic, with load balancing intelligent enough not to collapse multi-gigabyte elephant flows onto a single hashed path.
L3 to the host. A routed Clos with BGP and ECMP everywhere — no broadcast storms, no spanning tree, no flat-MAC scaling ceiling. It's the same design hyperscalers and NVIDIA Spectrum-X have run for years; Hedgehog just makes it easy.
Multitenancy without compromise. Per-tenant isolation via L3 EVPN Type-5 routes and VRFs — summarizable, scalable, and enforced in the data plane. Not stretched L2.

$roi-design-math@2x$

Because it's routable Ethernet, anything standards-compliant plugs in

RoCEv2 packets are UDP/IP — routable across a Layer 3 fabric. That single fact is the open AI ecosystem's biggest lever: any accelerator with a standards-compliant RoCEv2 NIC plugs into the same fabric a GPU plugs into. Keep prefill on NVIDIA GPUs where they dominate, and stay free to choose the most cost- and power-efficient option on decode — Cerebras, SambaNova, or whatever wins the next round.

No proprietary interconnect gets to dictate your accelerator roadmap. And because the fabric is routed multipath, it's ready for the next round of optimization — like Multipath Reliable Connection (MRC), released through OCP in 2026 — that pushes elephant-flow performance well past hash-based ECMP.

Declare intent. The fabric does the rest.

Operators declare what they need — VPCs, peerings, QoS — as Kubernetes Custom Resource Definitions. The Hedgehog Fabric Controller translates that into exact per-switch RoCEv2, QoS, and L3 configuration, and a lightweight agent on each switch continuously reconciles the live network against your declared state.

Deep, real-time telemetry streams natively into Grafana and Prometheus, exposing queue depths and microbursts so you can prove the fabric is lossless — not hope it is.