AI datacenters demand a fundamentally new approach: the raw performance and scale of an AI workload, managed with the fully automated efficiency of a hyperscaler cloud. Hedgehog delivers an open, declarative, and cloud-native network fabric that empowers your teams to move at the speed of software, not the CLI.
How the Hedgehog Fabric Works
Standard enterprise networks rely on fragile, element-by-element CLI configuration. Hedgehog replaces this with a continuous, intent-based software architecture built entirely on Kubernetes.
- The Fabric Controller: At the heart of the Hedgehog architecture is a dedicated Control Node running Kubernetes and the Hedgehog Fabric Controller.
- Declarative Intent via CRDs: Operators interact natively with the Kubernetes API, declaring network intents—like VPCs, peerings, and Gateways—as standard Custom Resource Definitions (CRDs).
- Continuous Reconciliation: The Fabric Controller translates these intents into exact per-switch configurations. A lightweight Hedgehog Agent running on each switch continuously reconciles the declared state against the physical hardware, ensuring your network always matches your intent.
- Gateway Nodes: Dedicated, high-performance Gateway nodes sit at the edge of the fabric, providing stateful NAT, firewalling, and seamless external connectivity to the rest of your grid.
Key Design Principles
We built Hedgehog on the following core tenets to bridge the gap between bare-metal AI performance and cloud-native agility.
- Freedom to Choose: Choose your switches and your preferred NOS, and change them freely down the line. No layer is a hostage. Where the software is ours, the source is open, allowing you to verify exactly what the system is meant to do rather than just trusting that it does. This commitment to choice keeps us honest twice over: on price, because you can always leave, and on behavior, because you can always look.
- Manage Outcomes, Not Network Gear: Declare what you need, like a VPC or a service insertion, and the fabric handles the underlying routing, addressing, and per-switch state. We deliver true cloud abstraction: opinionated cloud constructs instead of protocol soup. This works because our software is rigorously tested and open-source. When something breaks, you can see what the fabric intended versus what it did.
- Best-in-Class, Not Built-in-House We don’t rebuild infrastructure that already works. Our control plane is a Kubernetes API; our observability relies on the LGTM stack. By integrating with the tools and skills your team already possesses, we focus our engineering purely on the network—instead of reinventing inferior versions of everything around it.
- Secure from the Ground Up Multi-tenancy is an assumption, not an add-on. Tenant isolation lives in the same declarative model as everything else and is enforced in the data plane, Workloads, models, and data remain separated by architectural design—never by a fragile config you just hope someone set correctly.
- Hyperscale Engineering for the Rest of Us Hyperscalers run networks like software—version-controlled, pipelined, and deeply observed—using massive teams. We built that operating model directly into our product, minus the headcount. We lead with an API because real operations run on Infrastructure-as-Code, not clicking through GUI screens at 2 AM. Built for Day Two, not the keynote.
- Two Switches or Two Thousand The exact same system runs a two-switch lab and a two-thousand-switch cluster. Scale up to massive build-outs without control-plane bottlenecks, or start small and grow into production on the same fabric. Zero re-platforming required on the way up.
Hedgehog Solves the Hardest AI Networking Challenges
Secure Multi Tenancy
The Challenge: Securely sharing expensive GPU clusters across different internal teams or external clients without data cross-talk.
The Solution: Hedgehog brings hyperscaler-grade logical isolation directly to bare metal. Operators can instantly spin up fully isolated Virtual Private Clouds (VPCs) with strict boundary enforcement, allowing you to partition and monetize your AI services securely.
- Provides the core abstractions modern teams need
- Enforces strict multi-tenant isolation across physical clusters
Network Performance
The Challenge: AI workloads demand two distinct network profiles: massive, lossless bandwidth to synchronize distributed training jobs without stalling GPUs, and predictable, ultra-low latency for high-concurrency inference serving.
The Solution: Hedgehog delivers an automated underlay and overlay network dynamically optimized for these unique traffic flows. By deploying validated configurations on open hardware, our fabric eliminates dropped packets to slash training time-to-completion, while ensuring the high-speed, reliable data delivery required to maximize your inference tokens per second.
- Lossless, high-throughput underlay and overlay automation
- Permanent hardware independence and vendor choice
Network Availability
The Challenge: Brittle configurations and manual updates lead to downtime and broken training runs.
The Solution: The continuous reconciliation of our Fabric Agents guarantees that the network state remains exactly as intended. If a state drifts or a link fails, the fabric automatically reroutes and heals without manual intervention.
- Native management via the Kubernetes API and CRDs
- Empowers platform teams to control networking within existing workflows
Lifecycle Management
The Challenge: Racking, provisioning, and updating network hardware manually takes months and requires specialized engineers.
The Solution: Zero Touch Lifecycle Management (ZTLM) accelerates your time to GPU value. Our software automatically discovers bare-metal hardware, provisions the OS, and pushes validated configurations the moment a switch is plugged in—taking you from rack to ready in hours.
- Automated device discovery and declarative provisioning
- Free up engineering resources with hitless lifecycle maintenance
Scales to Fit
The Challenge: Network architectures that require massive upfront over-provisioning or require forklift upgrades to grow.
The Solution: Hedgehog supports highly flexible, automated spine-leaf topologies. Start with the capacity you need today and scale out your physical topology non-disruptively as your AI cluster grows
- High-bandwidth external routing and simplified BGP peering
- Unifies distributed AI workloads without traffic choke points
Observability
The Challenge: Traditional monitoring tools sample traffic too slowly to catch the micro-bursts that stall GPU workloads.
The Solution: Deep, real-time telemetry mapped directly to cluster performance. Hedgehog exposes granular flow and queue-depth visibility, streaming natively into Prometheus and Grafana to proactively detect and resolve packet drops.
- Real-time visibility into micro-bursts and queue depths
- Full automation ensures clusters live up to their absolute potential
Firewall & NAT
The Challenge: Securing proprietary models and training data at line rate without degrading cluster performance.
The Solution: Integrated, stateful NAT and firewalling at the Gateway layer. Enforce zero-trust micro-segmentation and robust security policies directly within the fabric's flow, keeping your multi-tenant boundaries locked down.
- Policy-driven security enforcement within the fabric
- Strict tenant isolation guards against internal and external cross-talk
Data Center Interconnect
The Challenge: Distributed AI training requires bridging "AI islands" to external data lakes and public clouds.
The Solution: Unify your distributed workloads. We simplify BGP peering and Data Center Interconnect (DCI) routing, providing high-bandwidth external ingress and egress to keep your training pipelines fed without choke points.
-
Predictable Costs for High-Volume Data Transfer
-
Hedgehog never charge ingress or egress fees