Secure Multi-Tenancy for AI: The Hedgehog VPC Platform
Delivering hyperscaler-grade isolation, self-service provisioning, and enterprise grade security for bare-metal AI infrastructure.
Cloud providers—AWS in particular—have set the definitive standard for what users expect from network multi-tenancy: Virtual Private Cloud (VPC) isolation, instantaneous self-service provisioning, and connectivity that simply works out of the box. Today, as organizations deploy massive AI clusters on-premises and in neoclouds, they face a critical mandate: delivering those exact same cloud-native guarantees on hardware they own and operate.
The Need: The AI Multi-Tenancy Mandate
The economics of modern AI have fundamentally changed how organizations think about infrastructure utilization. High-performance GPUs are incredibly expensive capital assets that depreciate rapidly. Whether you are an enterprise supporting multiple internal data science teams, or a neocloud selling compute to external customers, allowing these resources to sit idle is financially untenable.
To maximize return on investment, AI infrastructure must be actively and securely partitioned. As soon as one workload finishes, the cluster must be instantly reprovisioned for the next user to ensure continuous utilization. This relentless need for high utilization, paired with strict data sovereignty and privacy concerns, makes secure multi-tenancy the foundational requirement of any modern AI datacenter.
The Challenge: Why Standard Isolation Fails on Bare Metal
Delivering true isolation on an AI cluster is significantly harder than in traditional networks. Standard host-based isolation mechanisms simply break down under the massive performance demands of modern GPUs.
- The Virtual Machine and IOMMU Bottleneck: Traditional VMs use the CPU's IOMMU to prevent cross-tenant data leaks. However, AI workloads require high-bandwidth RDMA transfers directly between NICs and GPUs. Enabling the IOMMU forces this traffic through the CPU, crippling throughput, but disabling it eliminates secure memory isolation entirely.
- The Limitations of Containers: Containers provide weaker security due to known escape vulnerabilities. Crucially, they fail entirely when different AI teams require custom OS kernels, specific GPU drivers, or conflicting Kubernetes configurations.
- The Burden of Manual Configuration: This leaves the physical network as the only viable security boundary. While switches can isolate hosts using BGP/EVPN, VXLAN, and ACLs, manually configuring these protocols across dozens of devices is slow, incredibly error-prone, and destroys the self-service cloud experience.
- The Danger of "AI Ops" for Networking: Relying on LLMs to generate complex switch configurations is a dangerous gamble. Adding one tenant alters hundreds of lines of code across devices, producing diffs too complex for meaningful human review. If an LLM hallucinates a single route-map entry, tenant data leaks globally.
Multi-tenant Isolation
Robust Multi-Tenant Isolation for Secure AI Workloads
When you are running multiple AI workloads on shared infrastructure, you need them to have their own private compute and storage resources. You can do this by creating a Hedgehog VPC for each workload then attach resources to each VPC. A Hedgehog VPC is similar to a public cloud VPC. It provides an isolated private network with support for multiple subnets, each with user-defined VLANs and optional DHCP services.
Subnets can be isolated and restricted, with the ability to define permit lists to allow communication between specific isolated subnets. The permit list is applied on top of the isolated flag and doesn't affect VPC peering.
Isolated subnet means that the subnet has no connectivity with other subnets within the VPC, but it could still be allowed by permit lists.
Restricted subnet means that all hosts in the subnet are isolated from each other within the subnet.
A Permit list contains a list. Every element of the list is a set of subnets that can communicate with each other.
Cloud Native API
Modern Architecture for Scalable, Automated AI Networking
The Hedgehog AI Network is built on Kubernetes, which means our API is cloud native. We built Hedgehog on principles like scalability, automation, resilience, and microservices architecture.
Designed for Cloud Infrastructure
Operates seamlessly in containerized and orchestrated platforms like Kubernetes.
Stateless
All fabric configuration, topology, and operational state is declaratively managed through Kubernetes CRDs, making the system resilient to pod restarts and enabling GitOps workflows.
Microservices-Oriented
Built as modular, independent services that communicate via lightweight protocols (HTTP/REST, gRPC), enabling flexible deployment and scaling.
Automated and CI/CD-Friendly
Supports DevOps practices, including CI/CD pipelines, infrastructure as code (IaC), automated testing, and GitOps.
Resilient and Fault-Tolerant
Designed with failover, load balancing, and observability in mind.
Operates Like AWS VPC
Optimized for Private and Hybrid AI Clouds
Every AWS user on the planet uses VPC whenever they spin up their very first EC2 instance. VPCs exist in AWS regions and availability zones, and they provide subnets that control tenant access to EC2 instances.
Hedgehog VPCs exists in data centers for AI training, fine-tuning or inference at the data edge. They also provide subnets that control access to GPU, compute and storage resources.
We designed the Hedgehog VPC on the same principles as Amazon VPC, which is pretty much the same thing that Microsoft, Google and Oracle did for their cloud services. The difference is that Hedgehog VPCs are built for private and hybrid AI cloud use cases.
Hedgehog VPC Features
- Enables secure multi-tenant isolation for compute and storage resources
- Operates like AWS VPC
- Kubernetes-native API supports cloud-native toolchain
- Infrastructure as code
- Supports GitOps
- Zero touch provisioning
- Full life cycle management
- Includes Grafana, Loki, Prometheus observability
- Models network as Kubernetes cluster
- Edge fabric for AI inferencing
- GPU fabric for AI training
- Data Center fabric for core workloads