Manish Vachharajani

9 min read

Disaggreated Inference (Part 2): Designing the RoCE Fabric. Why L3 to the Host Wins

For most of the last few years, when network engineers talked about RoCE, they were talking about the inside of an AI training cluster — the lossless, high-bandwidth back-end stitching thousands of GPUs together, with synchronous all-reduce as the...

Read More