A dedicated network segment designed to support the extreme performance demands of AI training and inference workloads. To handle the intensive GPU-to-GPU communication requirements, these networks implement specialized protocols like RoCEv2, RDMA, PFC, and ECN.
Back-end networks form the critical data fabric that enables distributed AI training and inference at scale. These networks are engineered to handle the unique demands of AI workloads, particularly the intensive east-west traffic patterns generated during multi-GPU training operations. The network must maintain consistent high bandwidth and ultra-low latency to prevent AI training slowdowns and ensure responsive inference.
Modern back-end networks leverage RDMA (Remote Direct Memory Access) over Converged Ethernet version 2 (RoCEv2) to enable direct GPU-to-GPU communication over standard Ethernet infrastructure. This powerful combination, along with Priority Flow Control (PFC) to prevent packet loss and Explicit Congestion Notification (ECN) to maintain optimal throughput, creates a lossless fabric essential for AI workloads. The Hedgehog fabric implements these features in a cloud-native architecture that automatically optimizes network performance for AI workloads, ensuring training jobs complete faster and inference remains responsive.
Network architects typically design back-end networks with non-blocking topologies like spine-leaf to maximize bandwidth between any pair of nodes. This architectural approach, combined with intelligent traffic management and hardware acceleration, helps eliminate bottlenecks that could impact AI application performance. The result is a high-performance computational environment where distributed AI workloads can operate at their full potential.