Disaggregated Inference - Part 1: Why the AI Network Is the Product

Written by Marc Austin | Jun 10, 2026 12:34:26 AM

The Moment the Industry Noticed the Network

At Computex 2026 in Taipei, the AI infrastructure industry got its first look at disaggregated inference running in production. Vista Equity Partners and Cambium Capital launched Vector Core Compute (VC2), the first commercially available enterprise inference cloud built for disaggregated workloads, and SambaNova ran a live demo from VC2’s new Los Angeles data center during Intel’s Computex keynote. The stack combined NVIDIA B200 GPUs for prefill and prompt caching, SambaNova SN40 RDUs (Reconfigurable Dataflow Units) for decode and token generation, and Intel Xeon 6 CPUs for orchestration and tool execution. Independent measurement by Artificial Analysis found the architecture at least two to three times faster than a GPU-only stack, and Together.AI signed on as the first commercial customer the same day.

That result could not have happened without the network. The GPUs and RDUs live in separate racks. The KV cache state that prefill builds must move to the decode accelerators at near-memory speeds — not storage speeds, not ordinary Ethernet speeds. If that transfer is slow, the latency penalty of crossing silicon boundaries erases the performance gain entirely. The Hedgehog AI network is what makes low-latency KV cache sharing between heterogeneous accelerators possible, letting GPUs and RDUs from different vendors operate as a single coherent inference engine.

This post explains what disaggregated inference is, and why every major inflection point in the AI hardware market over the past eight months — NVIDIA’s $20 billion Groq deal, Anthropic’s million-TPU commitment, Cerebras’s blockbuster IPO — points to the same conclusion: the network is the product.

What Is Disaggregated Inference?

Every time a large language model responds to a prompt, the inference pipeline runs two fundamentally different computations:

Prefill (context processing). The model ingests the full input — prompt, documents, conversation history — and builds an internal representation called the KV cache. This phase is compute-intensive and highly parallelizable. GPUs, designed for massively parallel matrix operations, excel here.
Decode (token generation). The model generates output one token at a time, reusing the KV cache. Each token requires loading the full KV cache and model weights, making this phase memory-bandwidth-bound, not compute-bound. The bottleneck is how fast data moves, not how fast math is done.

Running both phases on the same GPU means using an expensive, massively parallel compute engine to do a job that is fundamentally a memory-access problem. The result is GPUs sitting partially idle during decode, throttled by their own memory bandwidth rather than their compute capacity.

Disaggregated inference solves this by routing each phase to the hardware best suited for it. GPUs handle prefill. A different class of accelerator — optimized for memory bandwidth, deterministic execution, and token streaming — handles decode. The two communicate over a high-speed fabric, sharing KV cache state across silicon boundaries.

Why the Network Is the Critical Enabler

Disaggregation only works if the KV cache can move between systems fast enough that the cross-rack latency penalty is smaller than the decode performance gain. The KV cache for a long-context inference job can run to tens of gigabytes, multiplied across thousands of concurrent sessions. The fabric connecting prefill and decode pools must deliver high bandwidth, predictable ultra-low latency, and lossless transport at scale. This is not a software problem. It is a networking problem — and it is the problem Hedgehog’s AI networking platform is built to solve.

🔑 Core Insight

Disaggregated inference turns AI inference from a single-vendor GPU problem into a multi-vendor heterogeneous computing problem. And heterogeneous computing at scale is, fundamentally, a networking problem.

Why NVIDIA Paid ~$20 Billion for Groq

On December 24, 2025, NVIDIA announced it would acquire Groq’s assets for about $20 billion in cash — the largest deal in its history, structured as a non-exclusive licensing agreement, with founder Jonathan Ross and other senior leaders joining NVIDIA. The move puzzled some observers: NVIDIA already dominated AI chips, and Groq’s SRAM-based LPU needed dozens of racks to hold a single large model. The answer is disaggregated inference.

Groq's Language Processing Unit (LPU) was architecturally purpose-built for ultra-low-latency token generation: a deterministic, statically compiled execution model and massive on-chip SRAM delivered decode speeds GPU-based systems couldn't match. As a standalone platform, the rack count made it uneconomical. Paired with NVIDIA GPUs in a disaggregated architecture, the economics flipped completely — though NVIDIA's split is not the classic prefill/decode handoff. Attention stays on the Rubin GPUs, where the KV cache lives in HBM; the LPUs accelerate the feed-forward layers of every decode token.

According to Ross’s own account, the two companies had quietly worked together for nearly a year on disaggregating inference across GPUs and LPUs. After Groq demonstrated the prototype, Jensen Huang called three days later to propose working more closely together — and the deal was signed three weeks after that.

At GTC 2026, NVIDIA unveiled Groq LPX chips integrated with the Vera Rubin platform through NVIDIA Dynamo. Prefill and attention run on Rubin GPUs; the feed-forward portion of each decode token is offloaded to Groq LPUs. Because the two processor types cooperate on every token, the LPX and Rubin racks are linked by a custom, tightly coupled scale-up interconnect with memory-class bandwidth — a different tier of network from the scale-out fabric. Huang described the result as unifying "two processors of extreme differences, one for high throughput, one for low latency." The combined system delivers up to 35x more throughput per watt at the highest interactivity levels.

💡 Business Logic of the Groq Deal

NVIDIA didn’t buy Groq because it needed a chip company. It bought Groq because disaggregated inference is the future of AI deployment — and Groq had the best decode accelerator on the market. The $20 billion was the cost of owning the decode layer of every NVIDIA inference stack going forward.

GTC 2026: The Inference Inflection

GTC 2026 (March 16, San Jose) was the first NVIDIA keynote in years where training hardware was not the headline. Huang said NVIDIA sees $1 trillion in orders for Blackwell and Vera Rubin through 2027, driven by inference workloads overtaking training. The technical centerpiece was Dynamo, NVIDIA’s orchestration layer for disaggregated workloads: it dynamically routes prefill to Vera Rubin GPU nodes and decode to Groq LPX nodes, managing KV cache transfer between them across the Spectrum-X fabric.

Huang framed the data center as an “AI factory” whose product is tokens — and autonomous agents as the ultimate disaggregated and distributed computing model, with the language model, tool runner, orchestrator, and memory store running on different hardware across the data center. At scale, that architecture is only coherent if the network connecting those components is fast enough, and AI-optimized enough, to make the boundaries invisible.

Why Jensen Talks More About Spectrum-X and Less About InfiniBand

After acquiring Mellanox in 2019, NVIDIA used InfiniBand to dominate AI training infrastructure — its lossless transport and ultra-low latency made it the default for tightly coupled training clusters. InfiniBand’s strengths — a proprietary protocol, tight coupling, a closed ecosystem — are advantages when every GPU is the same vendor, owned by the same operator, running the same workload. They become liabilities in inference, where the hardware is heterogeneous (GPUs, LPUs, RDUs, CPUs), the operators are diverse, and the network must connect equipment from multiple vendors without proprietary lock-in.

Spectrum-X is NVIDIA’s AI-optimized Ethernet platform, delivering InfiniBand-class behavior — lossless transport, adaptive routing, in-network telemetry — over standards-based Ethernet. At GTC Washington, D.C. in October 2025, Huang was blunt: “Spectrum-X Ethernet is hardly Ethernet. Spectrum-X Ethernet is designed for AI performance.” Disaggregated inference needs two tiers of network. Inside a Rubin/LPX pod, the per-token attention/feed-forward split runs over a custom, tightly coupled scale-up interconnect — that link is its own tier, with memory-class bandwidth requirements no general-purpose fabric can meet. But at data-center scale, disaggregation needs a second tier: the scale-out fabric that moves KV cache between prefill and decode pools, feeds context-memory storage like NVIDIA's CMX platform, and connects heterogeneous pods across vendors. That tier is AI-optimized Ethernet, and it is structurally necessary to NVIDIA's disaggregation story.

The market is confirming the shift. Meta and Oracle adopted Spectrum-X for large-scale AI networks in October 2025. The Ultra Ethernet Consortium finalized UEC 1.0 in mid-2025. 650 Group estimates Ethernet will carry roughly 91% of AI workloads by 2029, and Dell’Oro moved its projected Ethernet-over-InfiniBand crossover in AI back-end networks up to 2027. InfiniBand remains important for tightly coupled training clusters, but the inference-dominated world Huang described at GTC 2026 runs on AI-optimized Ethernet.

Why Anthropic — and Maybe Meta — Are Betting on TPUs

In October 2025, Anthropic announced the largest infrastructure commitment in its history: access to up to one million Google Cloud TPUs, bringing well over a gigawatt of capacity online in 2026, in a deal worth tens of billions of dollars. Anthropic cited the “strong price-performance and efficiency” its teams have seen with TPUs for several years. A month later, reports surfaced that Meta was in talks to rent TPUs through Google Cloud as early as 2026 and deploy them in its own data centers by 2027.

The economics are real: Google’s sixth-generation Trillium TPU delivers 4.7x the peak compute per chip and is over 67% more energy-efficient than its predecessor, TPU v5e. But the deeper logic is architectural fit. TPUs were designed from the ground up for the matrix operations that dominate LLM inference — the same purpose-built philosophy behind SambaNova RDUs and Groq LPUs. When you know exactly what workload you’re running, purpose-built silicon wins on cost per token.

Anthropic’s multi-platform strategy makes the strategic intent explicit: Claude models run across Google TPUs, Amazon Trainium, and NVIDIA GPUs, each platform assigned to the workloads it handles most efficiently. This is disaggregation at the cloud level — routing AI workloads to purpose-built silicon rather than defaulting to general-purpose GPUs for everything.

📊 The TPU Economics Signal

When Anthropic — building some of the world’s most demanding AI models — commits tens of billions of dollars to TPUs on price-performance grounds, it validates the core premise of disaggregated inference: purpose-built silicon at each layer of the AI stack beats general-purpose GPUs doing everything.

Cerebras: The First Big Inference IPO

Cerebras Systems priced its IPO at $185 per share on May 13, 2026 raising $5.55 billion, the largest U.S. tech IPO since Uber’s 2019 debut. The stock closed its first Nasdaq session up 68% under the ticker CBRS.

The Cerebras story is, at its core, a disaggregated inference story. Its Wafer Scale Engine 3 — 4 trillion transistors and roughly 900,000 AI cores on a single dinner-plate-sized die — eliminates the inter-chip communication latency that accumulates in multi-GPU configurations. In disaggregation terms, Cerebras is a decode-optimized accelerator: no chip boundaries means no inter-chip bandwidth bottleneck, delivering token generation speeds GPU clusters cannot match.

The commercial validation came in January 2026, when OpenAI signed a deal worth more than $10 billion for 750 megawatts of Cerebras low-latency compute through 2028 — a production infrastructure commitment from the organization running the world’s most demanding inference workloads. PitchBook’s Dimitri Zabelin describes Cerebras as “positioned at the center of two converging tailwinds: the sovereign AI buildout and the incoming inference tsunami.”

The Pattern Across Every Example

NVIDIA + Groq. Anthropic + TPUs. Cerebras + OpenAI. SambaNova + B200s. In every case the insight is the same: AI inference is too expensive and too latency-sensitive to run efficiently on one type of chip. The systems that win route each phase of inference to the silicon best suited for it — and connect those silicon types over a network fast enough to make the boundaries invisible. That network is what Hedgehog builds.

Next: the engineering

Part 2 covers how to actually build that fabric — why every production RoCE network is an L3 design, what breaks when you stretch L2 across an AI back-end, and where EVPN fits: “Designing the RoCE Fabric for Disaggregated Inference: Why L3 to the Host Wins.”

Sources

View full post