13 min read

Kubernetes for AI Workloads: When Perfect Scheduling Meets Imperfect Networks

Kubernetes for AI Workloads: When Perfect Scheduling Meets Imperfect Networks

Picture this: your organization just invested $100 million in the world's most sophisticated Kubernetes scheduler. It's running NVIDIA's latest KAI Scheduler with topology-aware placement, gang scheduling, and ML-powered prediction algorithms that can orchestrate 100,000 NVIDIA Blackwell B200 GPUs with surgical precision. The scheduler is so advanced it can predict exactly which pods should land on which nodes to minimize communication latency, balance workloads perfectly, and ensure every GPU stays busy.

Yet your training jobs are still crawling. Your $50,000-per-GPU Blackwell accelerators may be spending 70% of their time waiting. The bottleneck isn't your scheduler - it's the network underneath it that can't handle what your perfect orchestration is asking it to do.

This disconnect hit me at the last North American KubeCon. Session after session showcased genuinely impressive advances in Kubernetes scheduling for AI workloads. The technical depth was remarkable - custom schedulers promising to solve GPU idleness, ML algorithms predicting optimal pod placement, sophisticated queue management ensuring fairness and efficiency. But as I sat through these presentations, I kept wondering if these professionals who were working so hard to maximize scheduling efficacy understood that their efforts could be completely undermined without the right network infrastructure. 

It's not that the presenters didn't understand networking, it's that the AI infrastructure landscape has become so complex that even experts focus intensely on their specific domain while the critical interdependencies get overlooked. And honestly, who can blame them? The scope of knowledge required to truly understand everything from data pipelines to model architectures to network fabrics to scheduler algorithms is frankly overwhelming. 

But here's what I learned: no amount of scheduling sophistication can overcome fundamental infrastructure limitations. It's like building the world's most advanced air traffic control system for an airport with a single runway that's constantly under construction. While many GPU vendors strongly encourage customers to invest in advanced, AI-optimized network infrastructure - the immense cost and complexity frequently leads to stalled projects and underperforming outcomes that still leaves GPU's spending substantial time sitting idle.

In my role at Hedgehog, I work to identify and incorporate the latest networking technologies used by the majority of hyperscalers in their AI infrastructure deployments - and ensure that Hedgehog's Open Network Fabric can deliver them with radical simplicity and at a dramatically reduced price, often 50-80% less than most AI-optimized networking solutions.

The Orchestration Paradigm Shift

To understand why this matters, we need to step back and recognize just how dramatically Kubernetes has evolved from its original design. When it launched, Kubernetes was built for a much simpler environment: relatively lightweight, stateless applications that communicated over HTTP APIs. The network requirements were predictable, workloads could tolerate occasional latency spikes, and infrastructure complexity was manageable.

Fast forward to today, and we're using Kubernetes to orchestrate something entirely different: distributed supercomputers like NVIDIA's GB200 NVL72. This system combines 36 Grace CPUs and 72 Blackwell GPUs in a rack-scale, liquid-cooled design that acts as a single, massive GPU with 1.4 exaflops of AI performance and 30TB of fast memory. These aren't traditional workloads, they're among the most complex computational systems ever built.

The economic realities have shifted dramatically as well. When a single training job represents weeks of computation across millions of dollars in hardware, every millisecond of delay translates to real money. Modern machine learning jobs require multiple components to start simultaneously and coordinate with microsecond precision. Co-location on the same node or adjacent nodes isn't just nice to have,  it's essential for minimizing communication delays that can cascade into massive inefficiency.

This evolution means we're asking Kubernetes to serve as the nervous system for infrastructure that rivals the computational complexity of entire countries. And while the Kubernetes community has responded brilliantly with scheduling innovations, the gap between what schedulers can optimize and what traditional networks can deliver can results in substantial GPU idle time if not proactively addressed.

The Custom Scheduler Renaissance

We're seeing a renaissance of custom scheduler development that goes far beyond tweaking the default kube-scheduler. These are complete reimaginations of how work gets distributed across massive compute clusters, and they're genuinely impressive.

Let me walk you through what's happening in this space, because even if you're not directly implementing these schedulers, understanding their capabilities helps you make better infrastructure decisions and have more informed conversations with your teams.

NVIDIA KAI Scheduler: The Latest Evolution

NVIDIA's newly released KAI Scheduler represents the cutting edge of AI-native scheduling, supporting the entire AI lifecycle from small, interactive jobs to large training and inference workloads within the same cluster. Built on the foundation of NVIDIA Run:ai, KAI Scheduler introduces several breakthrough capabilities:

Batch Scheduling with Topology Awareness: KAI ensures all pods in a group are scheduled simultaneously or not at all, with intelligent bin-packing and spread scheduling to optimize node usage while maintaining topology requirements for GPU communication. This is crucial when you're orchestrating training jobs that require precise GPU placement to minimize NVLink hop counts and maximize communication bandwidth.

Dynamic Resource Allocation (DRA): The scheduler supports vendor-specific hardware resources through Kubernetes ResourceClaims, enabling fine-grained allocation of GPUs from NVIDIA, AMD, and other vendors. This allows the same cluster to efficiently handle diverse workloads across different accelerator types without manual intervention.

Hierarchical Queue Management: KAI implements two-level queue hierarchies with customizable quotas, over-quota weights, and priority management, ensuring that different teams and projects can share massive GPU clusters fairly while maintaining isolation and predictable resource access.

Volcano and YuniKorn: The Open Source Alternatives

Beyond NVIDIA's commercial offerings, the open source community has developed sophisticated alternatives. Volcano excels at high-performance and large-scale distributed jobs with built-in scheduling for batch workloads, advanced job lifecycle management including preemption and gang scheduling. It's particularly strong for HPC-style workloads where job dependencies and precise scheduling coordination are critical.

YuniKorn takes a different approach, designed to manage workloads in both Kubernetes and Apache Hadoop YARN environments with fine-grained resource allocation and hierarchical queues. This hybrid capability is valuable for organizations that need to bridge traditional big data infrastructure with modern AI workloads.

The KubeCon Disconnect: A Natural but Dangerous Gap

The presentations at KubeCon were genuinely impressive, and I want to be clear about that. Demo after demo showed schedulers that could optimize GPU placement with machine learning algorithms, predict resource requirements with unprecedented accuracy, and coordinate complex multi-node training jobs with remarkable precision. The technical depth was extraordinary, and the audience was rightfully engaged.

But here's what struck me: in our industry's natural tendency toward specialization, we've created an unintentional blind spot. The scheduler experts know scheduling inside and out. The network experts understand fabrics and protocols deeply. But the critical intersection between these domains the place where scheduling decisions meet network reality often falls into a gap between teams and expertise areas.

This isn't anyone's fault. The AI infrastructure landscape has become so complex that deep expertise in any one area is a full-time job. But the consequence is real: sophisticated schedulers are making optimal placement decisions that the underlying network infrastructure simply cannot support.

Let me give you a concrete example of why this matters. Consider the latest AMD MI325X accelerators, which deliver 1.3 petaflops of FP16 compute and up to 2.6 petaflops of FP8 computation. These accelerators are designed to work in clusters where all-reduce operations complete in 20-30 milliseconds. But if your network infrastructure can't handle the massive volume of synchronized AI traffic that overwhelms even the best standard QoS methods that worked perfectly for traditional enterprise workloads, those powerful accelerators spend more time waiting than computing.

The challenge becomes especially acute with the communication patterns that dominate AI training. During training, GPUs need frequent synchronization through all-reduce operations where multiple GPUs simultaneously send data to the same destination. This creates what's known as incast scenarios that can overwhelm network fabrics. Even though modern data center switches support line-rate forwarding, if ten 800-gigabit ports all try forwarding at line rate to a server connected to a single 800-gigabit port, the mathematics simply don't work. No amount of switch line-rate capability can solve that fundamental bandwidth mismatch.

Think of it this way: you can have the world's most intelligent traffic management system, but if your roads can't handle the traffic volume you're trying to route, all that intelligence hits a wall. The same principle applies here sophisticated orchestration requires infrastructure that can handle what the orchestration layer is trying to achieve.

The End-to-End AI Training Lifecycle: Where Kubernetes Fits

To understand why network infrastructure matters so much for Kubernetes AI scheduling, we need to examine how modern AI training actually works from end to end. Kubernetes doesn't just schedule individual containers it orchestrates a complex, multi-stage process that spans data ingestion, model preparation, distributed training, and artifact management.

Stage 1: Data Pipeline Orchestration

Before any training begins, Kubernetes must orchestrate massive data preprocessing pipelines. Modern foundation models train on petabytes of data that must be cleaned, tokenized, and prepared. Organizations must carefully orchestrate storage solutions that can handle this scale, with strategies to minimize data transfer latency and sophisticated scheduling for optimal performance.

The network implications are significant. Data preprocessing often involves streaming terabytes of training data from distributed storage systems to compute nodes. If Kubernetes schedules data preprocessing pods on nodes that lack sufficient network bandwidth to storage systems, the entire pipeline stalls. The scheduler might make perfect decisions from a CPU and memory perspective, but if it doesn't understand the network topology between compute and storage, those decisions become counterproductive.

Stage 2: Training Job Initialization

When a data scientist submits a large-scale training job, Kubernetes must solve an extraordinarily complex placement problem. Consider training a 70-billion parameter model on 512 NVIDIA GB200 Grace Blackwell Superchips. Each superchip contains two B200 GPUs and a Grace CPU connected by 900GB/s NVLink, and can be networked at speeds up to 800Gb/s through high-bandwidth Ethernet platforms.

The scheduler must identify available GB200 nodes, but that's just the beginning. It must also consider:

  • Network topology optimization: Placing GPUs within the same rack or adjacent racks to minimize communication latency
  • Fault domain distribution: Ensuring training job resilience by spreading critical components across failure boundaries
  • Bandwidth allocation: Reserving sufficient network capacity for the anticipated all-reduce traffic patterns

If Kubernetes places training pods optimally from a compute perspective but suboptimally from a network perspective, the training job will suffer from the network-induced performance degradation that we explored in our previous analysis of training job anatomy.

Stage 3: Runtime Orchestration and Scaling

During training execution, Kubernetes must continuously orchestrate the dynamic requirements of distributed AI workloads. Modern AI workloads show tendency towards massively parallel and elastic jobs, where a large number of short-running tasks is spawned to consume available resources.

This creates unique challenges for both the scheduler and the network. Traditional Kubernetes workloads scale up and down relatively gradually, allowing network infrastructure to adapt. AI training workloads often exhibit sudden, synchronized demands for network bandwidth during gradient synchronization phases. The scheduler must understand these patterns and ensure that pod placement doesn't create network hotspots that can overwhelm even advanced switching infrastructure.

Consider the communication pattern during a typical training iteration: for 100-200 milliseconds, the network remains relatively quiet as GPUs compute gradients locally. Then, suddenly, all GPUs simultaneously initiate all-reduce operations, creating massive synchronized traffic that will overwhelm even the most advanced high-speed data center switches unless they're specifically designed and configured with highly complex, recently developed AI-specific switching features.

Stage 4: Checkpoint and Artifact Management

Kubernetes must also orchestrate checkpointing operations, where training state is saved to persistent storage for fault tolerance and experiment management. For large models, this means writing hundreds of gigabytes of data simultaneously from thousands of GPUs to shared storage systems.

This operation creates a completely different traffic pattern from normal training communication. Instead of GPU-to-GPU communication during training, checkpointing generates massive simultaneous writes from all GPU servers to storage systems, creating severe bandwidth imbalances. Without proper network architecture that isolates storage traffic from training communication, checkpoint operations can introduce significant stalls that cascade through the entire training process, leaving expensive GPUs idle.

Network-First Architecture: The Foundation for Effective Scheduling

The solution isn't to abandon sophisticated scheduling it's to ensure that scheduling innovations are built on network infrastructure designed for AI workloads. This requires understanding the fundamental architectural differences between traditional enterprise networks and AI-optimized fabrics.

Endpoint-Scheduled vs. Switch-Scheduled Fabrics

The most critical architectural decision for AI networks is the choice between endpoint-scheduled and switch-scheduled fabrics. This choice directly impacts what Kubernetes schedulers can achieve.

Switch-Scheduled Limitations: Traditional switch-scheduled fabrics, like those based on Broadcom's Jericho and Ramon chipsets, use deep buffers to centrally manage traffic flow. While VoQ-based traffic scheduling can provide proactive congestion avoidance, these systems introduce significant latency overhead as packets traverse multiple buffer stages, and centralized scheduling logic becomes a bottleneck at scale.

Endpoint-Scheduled Advantages: Endpoint-scheduled fabrics distribute scheduling intelligence to the endpoints themselves, enabling distributed coordination that can handle AI traffic patterns more effectively. This approach is central to the newly released Ultra Ethernet 1.0 specification, which provides standardized APIs for endpoint-based congestion control and flow management. Ultra Ethernet enables AI training software to coordinate directly with network endpoints, implementing application-aware traffic management that switch-based solutions simply cannot match.

This architectural difference is crucial for Kubernetes scheduling. In endpoint-scheduled fabrics, intelligent network endpoints can implement application-aware traffic management that provides the foundation for more effective scheduling decisions. All intelligence resides in the endpoints, making it a great solution for AI/ML training with high radix and accelerated job completion.

Ultra Ethernet: The Emerging Standard

The Ultra Ethernet Consortium has just released the Ultra Ethernet 1.0 specification today (June 11, 2025), targeting 400G/800G data rates specifically designed to compete with InfiniBand for AI and HPC networking. With 120+ consortium members and explosive growth in industry adoption, UEC represents industry consensus around Ethernet-based AI networking.

For Kubernetes, Ultra Ethernet compliance means access to standardized APIs for congestion control and flow management. Features like packet spraying and enhanced congestion control, combined with standardized integration between training and storage networks, create the foundation for future Kubernetes scheduling policies that could potentially coordinate with network behavior.

Scale and Performance at 2025 Levels

The latest network infrastructure operates at scales that were unimaginable just two years ago. Broadcom's Tomahawk 6, now shipping, delivers 102.4 Terabits per second of switching capacity and can support scale-out networks with up to 128,000 GPUs using a two-layer topology. Multiple deployments are planned with more than 100,000 XPUs using Tomahawk 6 for both scale-out and scale-up interconnect.

This scale fundamentally changes what Kubernetes can orchestrate. The chip supports endpoint-scheduled fabrics and is designed for AI clusters requiring near-100 percent network utilization, compared to traditional data center networks that typically operate at 60-70 percent utilization.

The Practical Integration: Kubernetes Meets Network Reality

Understanding how sophisticated Kubernetes scheduling integrates with cutting-edge network infrastructure requires examining real-world deployment patterns. Let's look at how organizations are actually implementing these technologies together.

Topology-Aware Scheduling in Practice

Modern schedulers like KAI implement bin-packing and spread scheduling to optimize node usage while maintaining topology requirements. In practice, this means the scheduler must understand not just which nodes have available GPUs, but how those nodes are connected in the network fabric.

Consider scheduling a 1,024 GPU training job across a cluster using rail-optimized topologies. These specialized network designs require that both the hosts and associated software understand how to work with multiple parallel network planes, often leveraging NCCL or RCCL libraries to optimize communication patterns. The scheduler must:

  1. Map network topology: Understand the physical connections between switches and the bandwidth available on each link
  2. Predict traffic patterns: Anticipate the all-reduce communication patterns that will dominate during training
  3. Optimize placement: Position GPUs to minimize network hops while balancing load across the fabric

This coordination requires deep integration between Kubernetes and network infrastructure. The scheduler can't just know about CPU and memory resources it must understand network bandwidth, latency characteristics, and congestion patterns.

Dynamic Resource Allocation with Network Awareness

KAI Scheduler's Dynamic Resource Allocation (DRA) supports vendor-specific hardware resources through Kubernetes ResourceClaims, enabling fine-grained allocation of different GPU types. But in practice, this also means coordinating network resources.

Different accelerator types have different communication characteristics. AMD's MI325X accelerators achieve optimal performance within 8-GPU nodes using Infinity Fabric, but scaling beyond single nodes requires high-bandwidth Ethernet networking. A network-aware scheduler must understand these differences and allocate network resources accordingly. Mixed accelerator deployments require careful bandwidth provisioning to ensure that different GPU types don't interfere with each other's communication patterns.

Queue Management and Network QoS

Hierarchical queue management with customizable quotas and priority management becomes much more complex when network resources are constrained. Traditional Kubernetes resource quotas focus on CPU, memory, and storage. AI-optimized clusters require network bandwidth quotas and priority management.

Advanced switching technologies like Broadcom's Cognitive Routing 2.0 can provide telemetry that makes it theoretically possible to feed network conditions into Kubernetes scheduling decisions. However, the practical value of real-time network condition awareness for scheduling remains questionable. Network conditions change so rapidly - often in microseconds - that by the time a scheduler receives notification and processes it, traffic conditions have typically already shifted. More promising is the potential for schedulers to use machine learning to anticipate network traffic patterns based on how jobs are scheduled, enabling predictive workload placement that minimizes network oversubscription. This enables sophisticated policies where high-priority training jobs get preferential network treatment, while lower-priority workloads are throttled during periods of network congestion.

The Workflow Revolution: From Manual to Autonomous

The integration of advanced Kubernetes scheduling with AI-optimized networks is enabling new operational workflows. Organizations are moving from manual, script-based training job management toward automated AI development pipelines.

Autonomous Training Pipeline Orchestration

Modern AI development pipelines orchestrate complex workflows that span data preprocessing, model training, evaluation, and deployment. Kubernetes seamlessly integrates with CI/CD tools and facilitates MLOps pipelines, allowing teams to automate the training, testing, and deployment of AI models.

The key insight is that having AI-optimized network infrastructure - whether endpoint-scheduled fabrics or advanced switch-scheduled systems with AI-specific features - provides a stable foundation that enables these scheduling innovations to actually deliver their promised benefits. Rather than requiring complex real-time coordination between schedulers and networks, the goal is to provide network substrates that are robust enough to handle whatever traffic patterns effective schedulers create.

Multi-Cluster Federation for Global Training

The typical enterprise today uses services from multiple cloud providers, with organizations owning tens of Kubernetes clusters. For AI workloads, this federated approach enables global training strategies where different stages of model development occur in different geographic regions based on data locality, compute availability, and network performance.

While still early-stage, forward-thinking organizations are beginning to explore network-aware federations that could coordinate training across regions while optimizing for global network conditions. Companies like Microsoft and Google are experimenting with techniques to distribute training workloads based on real-time network capacity between regions, though these implementations remain largely proprietary and experimental.

Cost Optimization Through Network Intelligence

The economics of AI training make network optimization crucial for cost control. The GB200 NVL72 provides up to a 30x performance increase and reduces cost and energy consumption by up to 25x compared to previous generation systems. But these gains depend on network efficiency.

Kubernetes schedulers with network awareness can implement sophisticated cost optimization strategies. They might schedule training jobs during periods of lower network costs, migrate workloads to regions with better network performance, or adjust training parameters based on real-time network pricing.

The Future Landscape: What's Coming in 2025 and Beyond

As we look toward the rest of 2025 and beyond, several trends will reshape how Kubernetes orchestrates AI workloads in network-optimized environments.

Ultra Ethernet Standardization

With the Ultra Ethernet 1.0 specification released just yesterday (June 11, 2025), we're seeing rapid standardization of AI-optimized Ethernet capabilities. This standardization will enable Kubernetes to implement portable network-aware scheduling policies that work across different vendors and cloud providers.

The standardization also enables new integration possibilities. Kubernetes could implement standardized APIs for network resource allocation, congestion signaling, and topology discovery that work consistently across Ultra Ethernet compliant infrastructure.

Next-Generation Accelerator Integration

AMD's MI355X accelerators arriving in H2 2025 will provide 2.3 and 4.6 petaflops for FP16 and FP8 respectively, with up to 9.2 petaflops of FP4 compute. Meanwhile, NVIDIA's Blackwell architecture represents a fundamental shift to dual-die designs with 208 billion transistors and advanced precision formats.

These architectural advances will require new scheduling strategies. Kubernetes must understand the communication characteristics of different precision formats, coordinate memory coherency across dual-die designs, and optimize placement for the unique traffic patterns of next-generation accelerators.

Edge-Cloud Hybrid Orchestration

The future of AI isn't just about massive cloud-scale training clusters. Organizations increasingly need hybrid approaches that combine edge data collection with cloud-scale computation. Kubernetes will need to orchestrate training workflows that span edge devices, regional data centers, and cloud infrastructure while optimizing for variable network conditions across this hybrid landscape.

This requires network-aware scheduling that understands not just local cluster conditions, but global network topology, bandwidth availability, and latency characteristics across wide-area networks.

The Bottom Line: Network Infrastructure as the True Multiplier

The evolution of Kubernetes for AI workloads represents one of the most significant shifts in infrastructure orchestration since the original container revolution. The scheduling innovations we're seeing are genuinely remarkable, and they're solving real problems that matter enormously for AI success.

But here's the key insight I want to share: these innovations are only as effective as the network infrastructure beneath them. As AI workloads continue to push datacenter boundaries, every large-scale AI network deployment planned for 2025 will rely on Ethernet-based fabrics rather than InfiniBand. AI clusters are scaling from tens to thousands of accelerators, making the network either a bridge to success or a critical bottleneck.

The organizations that will thrive in the AI revolution aren't necessarily those with the most sophisticated schedulers or the most powerful GPUs they're those that understand the critical interdependence between scheduling intelligence and network infrastructure. When these elements work together properly, you can actually get the network out of the way and let traffic flow between GPUs as efficiently as possible.

Your Kubernetes scheduler can be as intelligent as you want, but if your network can't handle the traffic patterns that your AI workloads generate, that intelligence can't deliver its potential. The future belongs to organizations that recognize this dependency and build their AI infrastructure accordingly.

Whether you're evaluating custom schedulers, planning GPU deployments, or designing your next-generation AI platform, understanding this scheduler-network relationship will help you make better decisions and avoid expensive surprises. It's not about having perfect knowledge across every domain it's about understanding the key interdependencies that determine success.

The race to artificial general intelligence won't be won by the organization with the smartest scheduler alone. It will be won by those who understand that effective orchestration requires the right infrastructure foundation to orchestrate upon.

In our next article, we'll dive deep into the specific network topologies and architectures that enable these advanced Kubernetes scheduling capabilities, exploring how rail-optimized designs and endpoint-scheduled fabrics create the foundation for AI infrastructure that can actually deliver on the promise of perfect GPU utilization.

 

Dell'Oro Data Center Switch Report: The Market Is Choosing Ethernet for AI

Dell'Oro Data Center Switch Report: The Market Is Choosing Ethernet for AI

Dell'Oro Group just released their 4Q 2024 Ethernet Switch - Data Center Report showing record-breaking sales fueled by AI buildouts and a recovery...

Read More
Speed Matters: What DeepSeek Means for Enterprise AI Inference

6 min read

Speed Matters: What DeepSeek Means for Enterprise AI Inference

DeepSeek AI drives lower-cost AI inference for enterprise customers DeepSeek AI proved that optimized reinforcement learning and mixture-of-experts...

Read More