Hands-On Fabric Diagnosis Lab
View learner record →Introduction
In Module 4.1a: Systematic Troubleshooting Framework, you learned the methodology for diagnosing fabric issues:
- Hypothesis-driven investigation
- Common failure modes
- Layered diagnostic workflow (Resource check → Agent CRD → Grafana → Logs)
- Decision trees for structured diagnosis
Now it's time to put that methodology into practice.
The Lab Scenario
You'll diagnose a real connectivity failure: A VPCAttachment was created successfully — kubectl describe shows Events: <none> (expected) and agents are converged — but the server cannot communicate within the VPC.
This is a classic troubleshooting challenge: Configuration looks correct, system reports success, but it doesn't work.
Using the systematic methodology from Module 4.1a, you'll:
- Gather symptoms and form hypotheses
- Test hypotheses systematically using kubectl and Grafana
- Identify the root cause through elimination
- Document your findings and solution
By the end of this lab, you'll have practiced the complete diagnostic workflow on a realistic scenario.
Learning Objectives
By the end of this module, you will be able to:
- Apply hypothesis-driven investigation - Form and test hypotheses systematically
- Use layered diagnostic workflow - Progress from resource check to Agent CRD to Grafana
- Follow decision trees - Apply structured diagnostic paths to real scenarios
- Identify VLAN configuration issues - Diagnose VLAN conflicts and mismatches
- Document troubleshooting findings - Create clear problem statements and solutions
Prerequisites
Before starting this module, you should have:
Completed Modules:
- Module 4.1a: Systematic Troubleshooting Framework (methodology and decision trees)
- All previous courses (Courses 1-3)
Understanding:
- Hypothesis-driven investigation framework
- Common failure modes (VPC attachment, BGP, interface, configuration drift)
- Layered diagnostic workflow
- Decision trees for common scenarios
Environment:
- kubectl configured and authenticated
- Grafana access (http://YOUR_VM_IP:3000)
- Hedgehog fabric with at least one VPC deployed
Scenario
Incident Report:
A developer reports that server-07 in VPC customer-app-vpc cannot reach the gateway or other servers in the VPC. The VPCAttachment was created this morning using the standard GitOps workflow.
Initial Investigation:
You run kubectl describe vpcattachment customer-app-vpc-server-07 and see Events: <none> — expected for fabric CRDs. The resource exists. kubectl get agents shows all agents converged (APPLIEDG == CURRENTG).
But the server still has no connectivity.
Your Task:
Use systematic troubleshooting methodology to identify the root cause and document a solution.
Known Information:
- VPC:
customer-app-vpc - Subnet:
frontend(10.20.10.0/24, gateway 10.20.10.1, VLAN 1025) - Server:
server-07 - Expected Connection:
server-07--unbundled--leaf-04 - VPCAttachment:
customer-app-vpc-server-07 - Symptom: Server cannot ping gateway 10.20.10.1
Before You Begin the Lab
The hands-on exercises in this module require the Hedgehog Virtual AI Data Center (vAIDC) — a pre-configured GCP lab environment that includes a complete Hedgehog fabric, Grafana observability dashboards, and all required services ready to use.
Ensure your vAIDC is running before proceeding. If you haven't set it up yet, complete the Accessing the Hedgehog vAIDC module first — it takes about 20 minutes and only needs to be done once.
Hands-On Lab
Lab Overview
Objective: Diagnose a VPCAttachment connectivity failure using systematic troubleshooting methodology.
Scenario: Server-07 in VPC customer-app-vpc cannot reach the gateway or other servers. The VPCAttachment was created this morning — kubectl describe shows Events: <none> (expected) and agents are converged.
Environment:
- kubectl: Already configured
- Grafana: http://YOUR_VM_IP:3000 (admin/admin)
- Server access: Available if needed
Known Information:
- VPC:
customer-app-vpc - Subnet:
frontend(10.20.10.0/24, gateway 10.20.10.1, VLAN 1025) - Server:
server-07 - Connection:
server-07--unbundled--leaf-04(correct connection name) - VPCAttachment:
customer-app-vpc-server-07
Task 1: Gather Symptoms and Form Hypotheses
Estimated Time: 2 minutes
Objective: Document symptoms and generate possible causes using systematic methodology.
Step 1.1: Document Symptoms
Write down what you know (use a text file or notebook):
Symptoms:
- Expected behavior: server-07 should ping gateway 10.20.10.1
- Actual behavior: ping fails with "Destination Host Unreachable"
- Recent change: VPCAttachment created today via GitOps
- kubectl describe: Events: <none> (expected — fabric CRDs do not emit K8s events)
Timeline:
- VPCAttachment created: This morning (10:00 AM)
- Issue reported: This morning (10:15 AM)
- First check (kubectl describe): No errors visible
This documentation helps you:
- Clarify what's actually broken
- Establish a timeline
- Identify recent changes
Step 1.2: Form Hypotheses
Based on symptoms, list possible causes. Use the common failure modes from Module 4.1a:
Hypothesis List:
- Wrong connection name - VPCAttachment references incorrect switch or connection
- Wrong subnet - VPCAttachment references non-existent subnet
- VLAN not configured - Reconciliation failed, VLAN 1025 not on switch interface
- VLAN mismatch - VLAN conflict caused different VLAN to be allocated
- Interface down - leaf-04 Ethernet interface is operationally down
- Server interface misconfigured - Server enp2s1 not configured correctly
- nativeVLAN mismatch - VPCAttachment and server expect different VLAN tagging
Why these hypotheses?
- Covers configuration issues (1, 2, 4, 7)
- Covers connectivity issues (3, 5)
- Covers server-side issues (6)
Success Criteria
- ✅ Symptoms documented clearly
- ✅ At least 5 hypotheses listed
- ✅ Hypotheses cover configuration, connectivity, and server issues
Time check: You should complete this in 2 minutes or less. Don't overthink—list possibilities quickly, you'll test them systematically.
Task 2: Test Hypotheses with kubectl
Estimated Time: 3 minutes
Objective: Systematically test each hypothesis using kubectl commands.
Step 2.1: Check VPCAttachment Configuration
Test hypotheses 1 and 2: Wrong connection or wrong subnet.
# Get full VPCAttachment spec
kubectl get vpcattachment customer-app-vpc-server-07 -o yaml
# Check connection reference
kubectl get vpcattachment customer-app-vpc-server-07 -o jsonpath='{.spec.connection}'
# Expected: server-07--unbundled--leaf-04
# Check subnet reference
kubectl get vpcattachment customer-app-vpc-server-07 -o jsonpath='{.spec.subnet}'
# Expected: customer-app-vpc/frontend
Verify connection exists:
kubectl get connection server-07--unbundled--leaf-04
# Should show: Connection exists
Test Result:
- Connection reference: ✅ Correct (server-07--unbundled--leaf-04)
- Subnet reference: ✅ Correct (customer-app-vpc/frontend)
Hypotheses 1 and 2: ELIMINATED
Step 2.2: Verify Subnet Exists in VPC
Test hypothesis 2 more thoroughly:
# Check if frontend subnet exists in customer-app-vpc
kubectl get vpc customer-app-vpc -o yaml | grep -A 5 "frontend:"
# Expected output shows:
# frontend:
# subnet: 10.20.10.0/24
# gateway: 10.20.10.1
# vlan: 1025
Test Result:
- Subnet exists: ✅
- VLAN specified: 1025
Hypothesis 2: ELIMINATED (subnet exists)
Step 2.3: Check Agent CRD for leaf-04
Test hypotheses 3, 4, and 5: VLAN configuration and interface state.
Identify which interface server-07 connects to:
kubectl get connection server-07--unbundled--leaf-04 -o yaml | grep "port:"
# Expected output: leaf-04/E1/8
Check interface state in Agent CRD:
# Check if interface is up
kubectl get agent leaf-04 -o json | jq '.status.state.interfaces["E1/8"].oper'
# Expected: "up"
# Check which VLANs are configured
kubectl get agent leaf-04 -o json | jq '.status.state.interfaces["E1/8"].vlans'
# Look for: VLAN list
CRITICAL FINDING:
The Agent CRD shows:
{
"oper": "up",
"admin": "up",
"vlans": [1020],
...
}
Interface E1/8 is up (✅) but VLAN is 1020, not 1025!
Test Result:
- Interface oper: ✅ Up (Hypothesis 5 eliminated)
- VLAN configured: ❌ VLAN 1020, expected 1025
Hypothesis 4: CONFIRMED (VLAN mismatch)
Step 2.4: Investigate Why VLAN is Wrong
Now that you've identified the mismatch, investigate the root cause:
# Check VPC configuration again
kubectl get vpc customer-app-vpc -o yaml | grep -A 10 "frontend:"
# Check all VLANs currently in use
kubectl get vpc -A -o yaml | grep "vlan:" | sort
# Look for VLAN 1025 usage
kubectl get vpc -A -o yaml | grep "1025"
Discovery:
Another VPC (existing-vpc-prod) is using VLAN 1025. When customer-app-vpc was created, the VLANNamespace automatically allocated VLAN 1020 instead due to the conflict!
Root Cause Identified:
VLAN conflict. The VPC subnet definition manually specifies VLAN 1025, but the VLANNamespace allocated VLAN 1020 because 1025 was already in use by another VPC.
Why kubectl describe showed no errors:
The controller successfully reconciled the VPCAttachment with VLAN 1020 (the allocated VLAN). No error occurred from the controller's perspective—the configuration just doesn't match the operator's expectation.
Step 2.5: Document Root Cause
Write your findings:
Root Cause:
- VLAN conflict between customer-app-vpc and existing-vpc-prod
- VPC subnet manually specified VLAN 1025
- VLANNamespace allocated VLAN 1020 instead (1025 already in use)
- Switch configured with VLAN 1020 (correct according to allocation)
- Server expects VLAN 1025 (incorrect expectation)
Evidence:
- Agent CRD shows VLAN 1020 on leaf-04/E1/8
- VPC spec shows VLAN 1025 in subnet definition
- existing-vpc-prod is using VLAN 1025
Solution:
- Update customer-app-vpc frontend subnet VLAN to 1020 (match allocation)
- OR choose a different unused VLAN for customer-app-vpc
Success Criteria
- ✅ Hypotheses tested systematically
- ✅ Root cause identified (VLAN mismatch due to conflict)
- ✅ Evidence documented
- ✅ Solution path clear
Time check: You should complete this in 3 minutes or less with practice.
Task 3: Validate with Grafana
Estimated Time: 1-2 minutes
Objective: Confirm findings using Grafana dashboards for visual validation.
Step 3.1: Check Interfaces Dashboard
- Open Grafana: http://YOUR_VM_IP:3000 (admin/admin)
- Navigate to "Hedgehog Switch Interface Counters" dashboard
- Set filters:
- Switch:
leaf-04 - Interface:
E1/8
- Switch:
- Observe:
- Operational State: Should show "up"
- VLANs Configured: Should show VLAN 1020
- Traffic Patterns: Should show minimal or no traffic (server can't communicate)
Expected Finding:
Grafana confirms:
- Interface is up ✅
- VLAN 1020 configured ✅
- Low or zero traffic (consistent with connectivity failure)
Step 3.2: Check Fabric Dashboard
- Navigate to "Hedgehog Fabric" dashboard
- Check BGP sessions for leaf-04
- Verify all BGP sessions are established
Expected Finding:
All BGP sessions show "established" state, confirming this is not a BGP routing issue.
Step 3.3: Correlation Check
Look at Grafana timeline:
- When was VLAN 1020 added to E1/8? (Should correlate with VPCAttachment creation time)
- Any interface state changes around that time? (Should show VLAN added, no flapping)
What Grafana Tells You:
- Visual confirmation of Agent CRD findings
- No intermittent issues (interface stable)
- VLAN was added successfully (just the wrong VLAN ID)
Success Criteria
- ✅ Grafana confirms VLAN 1020 configured (not 1025)
- ✅ Interface is up and stable
- ✅ BGP sessions healthy (not a routing issue)
Time check: 1-2 minutes for visual confirmation.
Task 4: Document Root Cause and Solution
Estimated Time: 1 minute
Objective: Write clear problem statement and solution for handoff or documentation.
Step 4.1: Problem Statement
Write a concise problem statement:
Problem Statement:
VPCAttachment customer-app-vpc-server-07 created successfully without errors,
but server-07 cannot communicate within VPC.
Root Cause:
VLAN conflict. VPC subnet specifies VLAN 1025, but switch interface configured
with VLAN 1020 due to conflict with existing-vpc-prod (already using VLAN 1025).
VLANNamespace automatically allocated VLAN 1020 instead.
Controller reconciled successfully with VLAN 1020 (no errors), but configuration
does not match operator expectation (VLAN 1025).
Impact:
- Server-07 has no connectivity within customer-app-vpc
- Application dependent on server-07 is down
Step 4.2: Solution Options
Document solution paths:
Option 1 (Recommended): Update VPC VLAN to match allocation
# Edit customer-app-vpc in Gitea
# Change frontend subnet VLAN from 1025 to 1020
# In Gitea: network-like-hyperscaler/vpcs/customer-app-vpc.yaml
spec:
subnets:
frontend:
subnet: 10.20.10.0/24
gateway: 10.20.10.1
vlan: 1020 # Changed from 1025
# Commit change
git add vpcs/customer-app-vpc.yaml
git commit -m "Fix VLAN conflict: use allocated VLAN 1020 for customer-app-vpc frontend"
git push
# Wait for ArgoCD sync
kubectl get vpc customer-app-vpc -w
# Verify Agent convergence after ArgoCD sync
kubectl get agents # Wait for APPLIEDG == CURRENTG
Why recommended: Aligns configuration with reality (VLAN 1020 already allocated and configured).
Option 2: Choose unused VLAN
# Check available VLANs
kubectl get vpc -A -o yaml | grep "vlan:" | sort
# Identify unused VLAN (e.g., 1026)
# Edit customer-app-vpc in Gitea
# Change frontend subnet VLAN to 1026
# Commit to Gitea, wait for sync
Why consider: If VLAN 1025 has significance (e.g., organizational standard).
Option 3: Remove VLAN specification (let VLANNamespace allocate)
# Edit customer-app-vpc in Gitea
# Remove manual VLAN specification
spec:
subnets:
frontend:
subnet: 10.20.10.0/24
gateway: 10.20.10.1
# vlan: 1025 # Remove this line
# VLANNamespace will auto-allocate next available VLAN
Why consider: Prevents future conflicts, follows GitOps best practices.
Step 4.3: Prevention
Document how to prevent this issue in the future:
Prevention:
1. Do not manually specify VLANs in VPC subnets unless required
- Let VLANNamespace auto-allocate to avoid conflicts
2. If manual VLAN required, check for conflicts first:
kubectl get vpc -A -o yaml | grep "vlan:" | sort
3. Use VLANNamespace ranges to segregate VLAN usage:
- VLANNamespace "production": 1000-1999
- VLANNamespace "development": 2000-2999
4. Verify admission webhook accepted VPC at apply time:
Check ArgoCD sync history for any webhook errors when VPC was applied
Success Criteria
- ✅ Root cause documented clearly
- ✅ Solution options identified with pros/cons
- ✅ Next steps defined
- ✅ Prevention measures documented
Lab Summary
What You Accomplished:
You successfully diagnosed a VPCAttachment connectivity failure using systematic troubleshooting methodology:
- ✅ Gathered symptoms and formed hypotheses
- ✅ Tested hypotheses systematically using kubectl
- ✅ Identified root cause (VLAN mismatch due to conflict)
- ✅ Validated findings with Grafana
- ✅ Documented solution path and prevention measures
Key Techniques Used:
- Hypothesis-driven investigation (not random checking)
- Layered diagnostic approach (resource check → Agent CRD → Grafana)
- Evidence-based elimination (tested each hypothesis)
- Root cause identification (VLAN conflict, not symptoms)
Time to Resolution:
- Task 1: 2 minutes (symptoms and hypotheses)
- Task 2: 3 minutes (hypothesis testing)
- Task 3: 1-2 minutes (Grafana validation)
- Task 4: 1 minute (documentation)
- Total: 7-8 minutes from symptom to solution
Contrast with Random Checking:
Without systematic methodology, you might have:
- Checked controller logs (no useful info)
- Restarted the controller (no effect)
- Deleted and recreated VPCAttachment (same result)
- Checked BGP (not relevant)
- Escalated to support (with no evidence)
- Spent 30+ minutes without identifying root cause
Troubleshooting
Common Lab Challenges
Challenge: "All my hypotheses were eliminated, but the issue persists"
What this means: Your initial hypothesis list didn't include the actual root cause.
What to do:
- Review your evidence collection (Agent CRD, Grafana, logs)
- Form new hypotheses based on what you did find
- Example: If VLAN is configured and interface is up, maybe VLAN ID is wrong
Key insight: Hypothesis-driven investigation is iterative. Eliminating hypotheses is progress—it narrows the problem space.
Challenge: "I found the root cause but don't know how to fix it"
What this means: Diagnosis succeeded, but solution implementation is unclear.
What to do:
- Consult Module 2.2 (VPC Design Patterns) for configuration guidance
- Check Module 1.3 (GitOps Workflow) for making changes
- Reference Hedgehog documentation for CRD field definitions
Key insight: Diagnosis and resolution are separate skills. This module focuses on diagnosis—Module 4.2 covers rollback and recovery.
Challenge: "kubectl commands are slow or timing out"
What this means: Kubernetes API server may be under load or network issues.
What to do:
- Check kubectl cluster-info and basic connectivity
- Use
--request-timeoutflag to extend timeout - If persistent, check control node resources (CPU, memory)
Challenge: "Grafana dashboards show 'No Data'"
What this means: Telemetry may not be configured or Prometheus/Loki not accessible.
What to do:
- Check if telemetry is configured in Fabricator
- Rely on kubectl and Agent CRD for this lab
- Grafana validation is optional if telemetry is not configured
Reference: Module 3.1 (Telemetry and Prometheus) for telemetry setup.
Challenge: "I'm not sure which hypothesis to test first"
What this means: You need a prioritization strategy.
What to do: Test hypotheses in this order:
- Fastest to check: Resource existence + Agent convergence (10-30 seconds)
- Most likely: Common failure modes from Module 4.1a
- Highest impact: Issues that would affect multiple resources
Key insight: Start with quick checks, then move to detailed investigation.
Debugging the Diagnostic Process
If you're stuck, ask yourself:
Did I collect evidence from all four layers?
- Resource check + Agent convergence, Agent CRD, Grafana, logs
Am I testing hypotheses or guessing?
- Each hypothesis should have a specific test
Am I documenting what I find?
- Write down results to avoid re-checking
Have I used decision trees?
- Follow Decision Tree 3 for "VPCAttachment shows success but doesn't work"
Am I comparing expected vs. actual?
- VPC expects VLAN 1025, Agent CRD shows VLAN 1020 → mismatch
Resources
Reference Documentation
Hedgehog CRD Reference:
- VPC and VPCAttachment spec fields
- Agent CRD status fields (interfaces, bgpNeighbors, platform)
- Connection CRD structure
Observability and Diagnostics:
- Module 3.1: Telemetry and Prometheus
- Module 3.2: Grafana Dashboards
- Module 3.3: Agent CRD Deep Dive
- Module 3.4: Pre-Escalation Diagnostic Checklist
GitOps Workflow:
- Module 1.3: GitOps with Hedgehog Fabric
- Module 4.2: Rollback and Recovery (upcoming)
Quick Reference: Diagnostic Commands
Layer 1: Resource Check + Agent Convergence
# Check if resources exist (missing = admission webhook rejected)
kubectl get vpc <name>
kubectl get vpcattachment <name>
# Check Agent convergence (APPLIEDG == CURRENTG means applied)
kubectl get agents
# Describe resource (Events: <none> is expected for fabric CRDs)
kubectl describe vpcattachment <name>
Layer 2: Agent CRD
# Check agent readiness (default namespace)
kubectl get agents
# View interface state
kubectl get agent <switch> -o json | jq '.status.state.interfaces["E1/<N>"]'
# View BGP neighbors
kubectl get agent <switch> -o jsonpath='{.status.state.bgpNeighbors}' | jq
Layer 3: Grafana (http://YOUR_VM_IP:3000, admin/admin)
- Hedgehog Fabric
- Hedgehog Switch Interface Counters
- Hedgehog Fabric Logs
Layer 4: Logs
# Controller logs
kubectl logs -n fab deployment/fabric-ctrl --tail=200
# Agent logs (use label selector)
kubectl logs -n fab -l "wiring.githedgehog.com/agent=<switch>"
Decision Tree Quick Reference
Use Decision Tree 1 when:
- Server cannot communicate within VPC
- VPCAttachment exists, no errors
Use Decision Tree 2 when:
- Cross-VPC connectivity fails
- VPCPeering exists
Use Decision Tree 3 when:
- kubectl describe shows success
- Server has no connectivity
- No obvious errors
Common VLAN Issues
| Symptom | Root Cause | Solution |
|---|---|---|
| VLAN mismatch (allocated ≠ specified) | VLAN conflict | Update VPC VLAN to match allocation |
| VLAN not configured on interface | Wrong connection reference | Fix VPCAttachment connection field |
| VLAN configured but wrong ID | Manual VLAN specification conflict | Remove manual VLAN, let VLANNamespace allocate |
| nativeVLAN mismatch | VPCAttachment nativeVLAN ≠ server config | Align nativeVLAN setting with server interface |
Common BGP Issues
| Symptom | Root Cause | Solution |
|---|---|---|
| BGP state: idle | Neighbor IP unreachable | Check ExternalAttachment switch IP and neighbor IP |
| BGP state: active | ASN mismatch or config error | Verify ASN in ExternalAttachment matches external router |
| BGP established but no routes | Permit list missing subnets | Update VPCPeering or ExternalPeering permit |
| Routes filtered | Community mismatch | Check External inboundCommunity and outboundCommunity |
Escalation Criteria
When to escalate to support:
All decision tree paths exhausted
- Followed relevant decision tree to end
- Issue doesn't match known patterns
Evidence collected but root cause unclear
- Completed all 4 layers of diagnostic workflow
- Findings don't point to specific root cause
Suspected platform issue
- Agent CRD shows switch failures (PSU, temperature)
- Controller logs show internal errors
Time-sensitive production outage
- Issue blocking critical services
- Need expert assistance to resolve quickly
Before escalating, ensure you have:
- ✅ Symptoms documented
- ✅ Hypotheses tested
- ✅ Evidence collected (events, Agent CRD, Grafana, logs)
- ✅ Decision tree followed
- ✅ Relevant kubectl outputs saved
Reference Module 4.3 (Coordinating with Support) for escalation procedures.
Next Steps
Module 4.2: Rollback and Recovery
Learn how to safely undo changes when things go wrong:
- GitOps rollback procedures
- Safe deletion order for Hedgehog resources
- Handling stuck resources
- Emergency recovery patterns
Module 4.3: Coordinating with Support
Learn how to work effectively with Hedgehog support:
- Crafting effective support tickets
- Providing diagnostic evidence
- Troubleshooting with support engineers
- Post-resolution follow-up
Module 4.4: Post-Incident Review
Learn how to conduct effective post-incident reviews:
- Documenting incidents
- Root cause analysis
- Prevention measures
- Knowledge sharing
Module Assessment
1. Server-03 in VPC prod-vpc cannot reach server-04 in the same VPC. Both VPCAttachments exist and all agents are converged (APPLIEDG == CURRENTG). What is your NEXT diagnostic step using systematic methodology?
2. VPCPeering between vpc-a and vpc-b exists. kubectl describe shows no errors. Server in vpc-a can ping its own gateway but CANNOT ping server in vpc-b. Which failure mode is most likely?
3. Using Decision Tree 3, you have verified: VPCAttachment references correct connection, subnet exists in VPC, and Agent CRD shows VLAN configured on interface. According to the decision tree, what should you check NEXT?
4. Resources exist, agents are converged, and Agent CRD shows all interfaces up with VLANs configured correctly, but the server still cannot communicate. Why should you check Grafana BEFORE checking controller logs?
Hands-On Lab
Complete the hands-on lab activities above, then click below to mark the lab as complete.