1 min read

Industry Leading AI Network Performance with Hedgehog

Industry Leading AI Network Performance with Hedgehog
Industry Leading AI Network Performance with Hedgehog
2:44

SemiAnalysis recently completed ClusterMAX testing on NVIDIA B200 GPU services offered by FarmGPU and RunPod.  NCCL testing of their Hedgehog AI network showed industry leading performance.  FarmGPU and RunPod offer better AI network performance than NVIDIA Israel-1, an AMD MI300X cluster running untuned RoCEv2, and even Crusoe Iceland's Infiniband network. 

Screen Shot 2025-10-19 at 9.41.07 AM

Semi Analysis is an independent analyst that breaks the mold on market research.  They don't try to cover every industry.  They focus on neoclouds.  They don't scheduling briefings to write a lot of subjective content.  If you open your doors, they will test your GPU cluster for a number of ClusterMAX rating criteria.  Then they will give you a ClusterMAX rating.  The intent of all this is to help GPU renters make smart purchase decisions for their AI workloads.  

Semi Analysis evaluate features that GPU renters care about, such as:

  • Security
  • LifeCycle and Technical Expertise
  • Slurm and Kubernetes
  • Storage
  • NCCL/RCCL Networking Performance
  • Reliability and Service Level Agreements (SLAs)
  • Automated Active and Passive Health Checks and Monitoring
  • Consumption Models, Price Per Value, and Availability
  • Technical Partnerships

Hedgehog impacts most of these criteria, but for this report we are focusing on NCCL/RCCL Networking Performance.  Hedgehog optimizes NCCL/RCCL networking performance with open source AI networking software that tunes configuration of network switches and SuperNICs running in GPU servers.  Our software does this in an automated fashion, so you don't have to spend weeks learning the secret sauce.  That means rapid time to value for your very expensive GPU cluster.   

GPU renters need to get lots of data in and out of GPU clusters.  That requires a high performance network.  When they run training workloads, GPUs need to share memory over the network.  That means the AI network needs to run at peak performance if you want to get optimal GPU performance.  

Here's a snapshot of the FarmGPU AI infrastructure stack.  Hedgehog's job is to make it really easy to run an AI network while we squeeze every byte of performance out of the network equipment.  And yes, there are a lot of hardware components in the solution architecture. 

image.png

FarmGPU, Celestica and Hedgehog presented these results together at OCP Global Summit.  We also talked about the challenges of getting this to work for fully automated, hyperscale operations.  It's not easy.  To make it easier for the Open Compute Project community to network like hyperscalers, we committed to contributing this solution as an OCP reference architecture.  We'll post again when this is live in the OCP Marketplace.  

 

Hedgehog AI Network delivers $50K minimum ROI per GPU

Hedgehog AI Network delivers $50K minimum ROI per GPU

In my “AI Needs a New Network” post last week, I noted that NVIDIA reported $13 billion in networking ARR on $18.4 billion of annual data center...

Read More
Kubernetes for AI Workloads: When Perfect Scheduling Meets Imperfect Networks

Kubernetes for AI Workloads: When Perfect Scheduling Meets Imperfect Networks

Why Your Advanced Kubernetes AI Scheduler Might Be Fighting a Losing Battle If you work with AI infrastructure, manage Kubernetes clusters for...

Read More
Collective Communications Explained: The Hidden Coordination Behind Distributed AI Training

Collective Communications Explained: The Hidden Coordination Behind Distributed AI Training

If you work with AI infrastructure, build applications with large language models, or fine-tune models for your organization, you've probably heard...

Read More