10 min read

The Operations Math: How a DevOps Team Can Run an AI Cloud with Hedgehog for Less Than the Cost of a Network Operations Team

The Operations Math: How a DevOps Team Can Run an AI Cloud with Hedgehog for Less Than the Cost of a Network Operations Team

AI Cloud Business Planning Playbook Series — Part 5

Building a GPU cluster is a one-time capital event. Operating it is an ongoing payroll commitment. The engineers who keep a 1,024-GPU multi-tenant AI cloud running — the SREs paged at 3 AM when NCCL hangs, the network specialists debugging a PFC storm, the automation engineers maintaining Ansible playbooks, the remote hands swapping optics in a rack — are on payroll every month, whether the cluster is busy or idle. Getting the staffing model right has a larger long-term impact on unit economics than almost any procurement decision.

This post answers a specific question: how many people does it actually take to operate a production AI cloud, and how does that number change depending on the network architecture you chose?

The headline finding: a Hedgehog-based AI cloud can be operated by a software engineering team, with no specialized network function. This is not a theoretical claim. It is documented in the published Zipline customer case — where Florian Berchtold, a software engineer at Zipline, runs their private AI training cluster using Hedgehog's Kubernetes-native declarative API with no dedicated NetOps team. The equivalent DIY operator, running Ansible-based network automation, needs 4 specialized NetOps engineers, 4 cluster SREs, 2 network automation engineers, and 2 remote hands — 12 people in total. The Hedgehog operator needs 4.

The dollar gap on payroll is $2.6M annually in favour of Hedgehog. That number matters, but it understates the real value — the incremental EBITDA a Hedgehog-based operator earns versus a DIY operator comes primarily from the operational outcomes the staffing difference enables: faster incident remediation, validated RoCE performance, and hardware-enforced tenant isolation. Those outcomes are captured in the reliability, performance, and security analyses. The operations analysis establishes the staffing model that makes them possible.


The Zipline Reference Case

Zipline operates a global autonomous drone delivery service. Their drones generate roughly a gigabyte of telemetry per flight, and they train AI models on that data to fly and land deliveries autonomously. Zipline built a private AI cloud on-premises rather than renting from a hyperscaler — citing significant cost efficiencies and governance advantages. Their case, presented at Networking Field Day 38 in July 2025, is the clearest public account available of what it looks like to operate an AI cloud with Hedgehog.

Florian Berchtold, Zipline's Principal Engineer, described the design constraint:

"Florian, a software engineer rather than a network engineer, sought a high-bandwidth networking solution that didn't demand extensive network CLI expertise. Hedgehog provided a Kubernetes-native, declarative API, allowing Zipline to describe their infrastructure's desired state in a familiar language, abstracting away complex networking configurations like port channels."

The Zipline cluster started as a collapsed-core design on a modest server footprint and has since grown to over 60 servers across multiple racks in a spine-leaf topology. Through both phases, the operations model has remained the same: the existing software engineering team operates the network using the same Kubernetes-style declarative configuration they use for everything else. There is no dedicated NetOps team for the AI cluster. There are no Ansible playbooks to maintain because the Hedgehog controller continuously reconciles desired state. There is no specialized RoCE tuning headcount because the reference architecture ships with the tuning already encoded.

Zipline's published outcome: a private cloud that cut infrastructure costs by 70% compared to public cloud alternatives, with data kept under their own governance. The critical operational win was not just the hardware savings — it was that the staffing model stayed the same.

This is the operational anchor for understanding what Hedgehog enables. The Zipline case demonstrates, with a named customer and a public presentation, that operating an AI cloud with Hedgehog does not require a specialized network engineering function. The headcount required is the headcount any competent DevOps team already has.


What a DIY AI Cloud Operations Team Actually Looks Like

The right comparison is not Hedgehog versus doing nothing. It is Hedgehog versus Ansible-based network automation — the dominant approach for operators who want to automate but don't want to commit to a commercial platform or build their own from scratch. Ansible is well-documented for network automation. Red Hat's official Network Automation course (DO457) covers exactly this scope: writing playbooks to configure switches, validating network state, performing compliance checks, and detecting configuration drift. Cisco's Coursera Ansible for Network Automation specialization adds Jinja2 templating, YAML data structures, and ios_config-style modules.

The Ansible-driven DIY model is legitimate and widely deployed. It is also more expensive to staff than most operators initially budget. Here is the headcount, role by role, with sourced loaded rates.

Role 1: Cluster SREs / DevOps Engineers — 4 FTE @ $300K loaded = $1.2M

These are the engineers who run the GPU cluster itself: Slurm or Kubernetes scheduling, node health monitoring, incident response, and on-call rotation. NVIDIA defines this role explicitly in its Professional Services SRE datasheet as an engineer who "helps customers manage and maintain their cluster remotely by assisting with day-to-day operational cluster management." Insight Cloud's December 2025 review of NVIDIA's Senior SRE role at DGX Cloud documents total compensation in the $250K–$400K range for the experienced version of this role.

For a 1,024-GPU multi-tenant cluster with 24×7 on-call coverage, 4 FTE is the minimum — three shifts plus weekend rotation plus PTO coverage. Loaded cost at $300K per engineer = $1.2M annually.

This role exists in both the DIY and Hedgehog cases — the SREs run the GPU cluster, not just the network. Under Hedgehog, this reduces to 2 FTE because the network operations work that would otherwise add load to the SRE team — incident triage for fabric issues, tenant provisioning, fabric debugging — is absorbed by the declarative controller. Zipline demonstrates this: the same engineers who operate the cluster also operate the network, with no extra headcount required.

Role 2: NetOps Engineers (RoCE/Spectrum-X Specialists) — 4 FTE @ $350K loaded = $1.4M

This is the role most DIY operators underestimate. Even with Ansible automating routine configuration push, substantial manual NetOps work remains:

Diagnosing RoCE congestion. Ansible cannot auto-tune PFC priority classes or ECN thresholds. When DCQCN triggers, a human reads the telemetry, formulates a hypothesis, and adjusts.

Troubleshooting NCCL hangs. Meta's published Llama 3 training paper documents the painstaking process of identifying which rank, on which node, on which switch is responsible for an All-Reduce stall. This is hours-to-days of human work per incident.

Switch failures. Ansible can re-push a configuration to a replacement switch, but it doesn't diagnose which switch is failing, isolate the affected node from the fabric, or re-rail the workload. A human does those steps.

Tenant onboarding. Adding a tenant to a multi-tenant fabric requires VLAN/VRF provisioning, ACL updates, route advertisement changes, and validation. Ansible runs the playbook; a human writes and parameterizes it each time.

PFC storm investigation. Per Cisco Live's AI networking best practices, PFC misconfigurations can cause cascading deadlocks that require human intervention to break.

For 24×7 fabric coverage on a 1,024-GPU multi-tenant cluster, the minimum NetOps headcount is 4 FTE. Loaded compensation is higher than commodity NetOps because of the RoCE/RDMA specialization premium. Per Vitex Tech's 2025 InfiniBand vs Ethernet analysis: "InfiniBand requires specialized network engineering skills that command significant salary premiums and are genuinely scarce in the market." The same applies to RoCE — PFC, ECN, DCQCN, VXLAN EVPN, deadlock prevention are equally specialized skills. Vitex documents a $120,000+ annual specialist premium for production RDMA experience on top of a senior network engineer base ($200–250K), giving a loaded rate of approximately $350K.

4 NetOps × $350K = $1.4M annually. This is the role Hedgehog eliminates entirely. Zipline operates their AI training cluster with zero specialized NetOps engineers.

Role 3: Network Automation Engineers — 2 FTE @ $300K loaded = $600K

Someone has to write and maintain the Ansible playbooks. Red Hat's DO457 course describes the role: "network administrators or infrastructure automation engineers who want to use network automation to centrally manage the switches, routers, and other devices in the organization's network infrastructure." The skill mix is Python + Ansible + Jinja2 + CCIE Data Center-level networking knowledge. Sustaining the playbook portfolio for a complex multi-tenant fabric on Spectrum-X or SONiC is a 2 FTE job:

  • New device onboarding (playbooks for new switch SKUs as the fabric expands)
  • Drift detection and remediation logic
  • Multi-vendor adaptation (Cisco/Arista/SONiC have different module sets)
  • Version upgrades (Ansible modules change; playbooks need maintenance)
  • Integration with monitoring (Prometheus, NetQ, Grafana)
  • GitOps workflow management (CI/CD for playbook changes)

Loaded rate: $300K (base $180–220K + Python/Ansible specialization premium + benefits). Annual cost: $600K.

This role disappears under Hedgehog because the declarative controller replaces the playbook engineering effort. Configuration is expressed as Kubernetes Custom Resource Definitions reconciled continuously by the controller — there are no playbooks to write, version, or debug.

Role 4: Remote Hands Technicians — 2 FTE @ $150K loaded = $300K

Someone has to swap optics, run cables, and replace failed hardware. This role exists in both DIY and Hedgehog cases — declarative fabric does not replace physical access. Loaded cost reflects the premium for AI data center experience (liquid cooling, high-density 800G optics, fiber management) over commodity DC technicians. Per IEEE Spectrum's January 2026 AI Data Centers Face Skilled Worker Shortage report, AFCOM 2025 data identifies multiskilled DC operators as the top growth area for 58% of data center managers — a genuinely competitive market for AI-experienced technicians.

2 FTE provides business-hours coverage with on-call backup. $300K annually, same for both DIY and Hedgehog.


The DIY vs. Hedgehog Operations Side-by-Side

Role DIY FTE Hedgehog FTE DIY Annual Cost Hedgehog Annual Cost
Cluster SRE / DevOps 4 2 $1,200,000 $600,000
NetOps Engineer (RoCE/Spectrum-X) 4 0 $1,400,000 $0
Network Automation Engineer (Ansible) 2 0 $600,000 $0
Remote Hands Technician 2 2 $300,000 $300,000
Operations Staff Subtotal 12 4 $3,500,000 $900,000

 

The staffing picture is stark: 8 fewer engineers, $2.6M less in annual payroll. That $2.6M is real incremental EBITDA — but it is the smallest contributor to the total EBITDA gap between a Hedgehog-based operator and a DIY operator. The larger contributions come from the operational outcomes the staffing model enables.

So why does the operations model matter so much? Because the staffing difference is the mechanism behind the much larger value created in reliability, performance, and security.


What the Staffing Difference Actually Drives

The 12-versus-4 headcount gap is not just a payroll comparison. It is the structural cause of the operational differences that compound across every other dimension of the business.

Reliability. A DIY cluster running Ansible-based automation remediates network incidents through a human-triggered workflow: an alert fires, an on-call engineer triages, a playbook runs, the fabric recovers. From detection to resolution, this process typically takes hours. Meta's published Llama 3 training data shows more than 20 network events per year per 1,000 GPUs even on well-operated clusters — at hours of MTTR each, this adds up to hundreds of hours of lost GPU-time annually.

Hedgehog's declarative controller detects and remediates incidents continuously. When the observed fabric state diverges from the desired state, the controller reconciles them — no on-call rotation required. The 4-engineer NetOps team in the DIY model exists largely to do what the controller does automatically in the Hedgehog model. The staffing gap and the MTTR gap are the same fact viewed from two angles.

Performance. Getting a GPU cluster to its rated NCCL efficiency requires correct PFC, ECN, DCQCN, and adaptive routing configuration. In a DIY deployment, the 2 network automation engineers spend significant time deriving, testing, and maintaining this tuning in Ansible playbooks. When configurations drift — which they do, on every production cluster — a NetOps engineer investigates. The result is that DIY clusters routinely operate at 60–80% of their rated NCCL efficiency.

Hedgehog ships with this tuning already encoded in the reference architecture. FarmGPU's published B200 cluster, built on Hedgehog, hit 392/400 GB/s on SemiAnalysis ClusterMAX 2.0 benchmarks — 98% efficiency. The 0-FTE network automation headcount in the Hedgehog model is not a savings on engineers; it is a reflection of the fact that the tuning doesn't need to be re-derived because Hedgehog ships it validated.

Security. A flat-fabric DIY architecture has no tenant isolation at the network layer. When a security incident occurs — a compromised tenant container attempting lateral movement — the blast radius is the entire shared fabric. The 4 NetOps engineers in the DIY model are partly there to manually contain such events after the fact. Hedgehog's VPC microsegmentation enforces per-tenant isolation in the switch silicon, eliminating the lateral movement vector rather than responding to it. The 0 specialized NetOps engineers in the Hedgehog model is possible partly because there is far less human incident response required when the architecture prevents the incidents.

The framing that matters: the operations headcount in the DIY case is not overhead. It is how the DIY operator pays, in ongoing payroll, for the operational quality that Hedgehog ships as part of the product.


What This Means for AI Cloud Builders

The talent shortage makes the DIY headcount harder to assemble than the dollar figure suggests. Per the Vitex 2025 analysis, RoCE/RDMA specialists are "genuinely scarce in the market" with $120K+ annual premiums. Per Schneider Electric's February 2026 Mind the Gap analysis, "talent may become the primary barrier to scaling AI." Per Second Talent 2026, the global AI infrastructure talent supply-demand ratio is 3.2:1. The DIY model requires 8 specialized engineers in a market where each role takes 3–6 months to fill. Hedgehog's operational model lets a customer staff to "people we can actually hire" rather than "people we theoretically need."

The Zipline case is reproducible, not exceptional. Florian Berchtold is not unusual. He is a competent software engineer using a Kubernetes-native API to manage infrastructure that would otherwise require specialized network knowledge — the same pattern that lets backend engineers run AWS workloads without being network engineers. Hedgehog brings that pattern to on-premises AI networks. The customer profile that benefits is broad: any organization with a competent software or DevOps team that wants to run AI workloads on private infrastructure for cost or governance reasons.

The operations staffing savings are real, but the larger EBITDA contribution is what the staffing model unlocks. The $2.6M payroll difference between 12 DIY engineers and 4 Hedgehog engineers is genuine incremental EBITDA. But the staffing model it represents — software engineers instead of network specialists, a declarative controller instead of an on-call rotation — is what makes the reliability, performance, and security outcomes possible without dedicated headcount to sustain them. A DIY operator can buy better reliability through more NetOps engineers. A Hedgehog operator gets better reliability because the controller that replaces those engineers never goes off-shift. That difference compounds across every dimension of the business, and its full EBITDA impact shows up in the reliability, performance, and security analyses rather than the operations line alone.


Model Your Own Scenario

Every AI cloud is different. Cluster size, target utilization, on-call requirements, and local labor market conditions all affect the operations cost model. The Hedgehog AI Cloud Business Planning Playbook (available at hedgehog.cloud/playbook) lets you model operations staffing alongside all six dimensions of AI cloud economics — design, procurement, time-to-GPU-value, operations, performance, reliability, and security — at any cluster size from 64 to 8,192 GPUs.

The model is available as both a web-based wizard and a downloadable Excel workbook with every formula visible and every assumption editable. If your loaded compensation rates, on-call requirements, or headcount assumptions differ from the defaults used here, the model is built to reflect your actual situation.


Sources

  • Hedgehog and Zipline (2025). How Zipline Uses Hedgehog for AI Training. Networking Field Day 38, Silicon Valley, July 9, 2025. Florian Berchtold (Zipline) and Marc Austin (Hedgehog). Zipline's AI training cluster across 60+ servers in spine-leaf topology, operated by software engineering team without specialized NetOps.
  • techarena.ai (March 2026). How Hedgehog Brings Hyperscaler Agility to Any AI Infrastructure. Zipline cut infrastructure costs by 70% versus public cloud; operational staffing model (not just hardware) cited as the key win.
  • NVIDIA (2022). Professional Services SRE Datasheet. Defines the SRE role for GPU cluster operations: "DevOps engineer who helps customers manage and maintain their cluster remotely."
  • Insight Cloud (December 2025). NVIDIA Site Reliability Engineer — Complete Role Review. NVIDIA Senior SRE at DGX Cloud total compensation $250–400K; 10+ years experience required.
  • Red Hat (2024). Network Automation with Red Hat Ansible Automation Platform (DO457). Curriculum scope for Ansible-based network automation: playbook authoring, drift detection, multi-vendor adaptation.
  • Cisco / Coursera (2024). Ansible for Network Automation Specialization. Network automation engineering fundamentals; prerequisites include routing/switching proficiency and Python basics.
  • Vitex Tech (2025). InfiniBand vs Ethernet for AI Clusters: Effective GPU Networks in 2025. "$120,000+ annually in staff premium" for production RDMA/RoCE experience; talent "genuinely scarce in the market"; 3–6 month ramp without existing experience.
  • Schneider Electric (February 2026). Mind the Gap: Bridging AI Talent Shortages in Data Centers. 51% of operators struggled to find qualified candidates in 2024; "talent may become the primary barrier to scaling AI."
  • Second Talent (2026). Global AI Talent Shortage Statistics. 1.6M open AI positions globally; 518K qualified candidates; 3.2:1 demand-to-supply ratio.
  • IEEE Spectrum (January 2026). AI Data Centers Face Skilled Worker Shortage. AFCOM 2025 State of the Data Center: 58% of managers identify multiskilled DC operators as top growth area.
  • Hedgehog (2025). AI Network Product Page. "Hedgehog offers a cloud user experience that is familiar to any cloud operations team. No specialized network engineers required."
  • Meta AI (2024). The Llama 3 Herd of Models. Documents 20+ unexpected interruptions per 1,000 GPUs during 54-day pre-training run; network issues primary cause.
  • FarmGPU (2025). Building an AI Cluster: Our 17-Day Crash Course in Open Networking. 392/400 GB/s NCCL efficiency on B200 cluster built with Hedgehog; SemiAnalysis ClusterMAX 2.0 benchmark.