Hands-On Fabric Diagnosis Lab
Introduction
In Module 4.1a: Systematic Troubleshooting Framework, you learned the methodology for diagnosing fabric issues:
- Hypothesis-driven investigation
- Common failure modes
- Layered diagnostic workflow (Events → Agent CRD → Grafana → Logs)
- Decision trees for structured diagnosis
Now it's time to put that methodology into practice.
The Lab Scenario
You'll diagnose a real connectivity failure: A VPCAttachment was created successfully with no errors in kubectl events, but the server cannot communicate within the VPC.
This is a classic troubleshooting challenge: Configuration looks correct, system reports success, but it doesn't work.
Using the systematic methodology from Module 4.1a, you'll:
- Gather symptoms and form hypotheses
- Test hypotheses systematically using kubectl and Grafana
- Identify the root cause through elimination
- Document your findings and solution
By the end of this lab, you'll have practiced the complete diagnostic workflow on a realistic scenario.
Learning Objectives
By the end of this module, you will be able to:
- Apply hypothesis-driven investigation - Form and test hypotheses systematically
- Use layered diagnostic workflow - Progress from events to Agent CRD to Grafana
- Follow decision trees - Apply structured diagnostic paths to real scenarios
- Identify VLAN configuration issues - Diagnose VLAN conflicts and mismatches
- Document troubleshooting findings - Create clear problem statements and solutions
Prerequisites
Before starting this module, you should have:
Completed Modules:
- Module 4.1a: Systematic Troubleshooting Framework (methodology and decision trees)
- All previous courses (Courses 1-3)
Understanding:
- Hypothesis-driven investigation framework
- Common failure modes (VPC attachment, BGP, interface, configuration drift)
- Layered diagnostic workflow
- Decision trees for common scenarios
Environment:
- kubectl configured and authenticated
- Grafana access (http://localhost:3000)
- Hedgehog fabric with at least one VPC deployed
Scenario
Incident Report:
A developer reports that server-07 in VPC customer-app-vpc cannot reach the gateway or other servers in the VPC. The VPCAttachment was created this morning using the standard GitOps workflow.
Initial Investigation:
You run kubectl describe vpcattachment customer-app-vpc-server-07 and see no error events. The resource exists, the controller processed it successfully, and there are no warnings.
But the server still has no connectivity.
Your Task:
Use systematic troubleshooting methodology to identify the root cause and document a solution.
Known Information:
- VPC: customer-app-vpc
- Subnet: frontend(10.20.10.0/24, gateway 10.20.10.1, VLAN 1025)
- Server: server-07
- Expected Connection: server-07--unbundled--leaf-04
- VPCAttachment: customer-app-vpc-server-07
- Symptom: Server cannot ping gateway 10.20.10.1
Hands-On Lab
Lab Overview
Objective: Diagnose a VPCAttachment connectivity failure using systematic troubleshooting methodology.
Scenario: Server-07 in VPC customer-app-vpc cannot reach the gateway or other servers. The VPCAttachment was created this morning and kubectl describe shows no error events.
Environment:
- kubectl: Already configured
- Grafana: http://localhost:3000
- Server access: Available if needed
Known Information:
- VPC: customer-app-vpc
- Subnet: frontend(10.20.10.0/24, gateway 10.20.10.1, VLAN 1025)
- Server: server-07
- Connection: server-07--unbundled--leaf-04(correct connection name)
- VPCAttachment: customer-app-vpc-server-07
Task 1: Gather Symptoms and Form Hypotheses
Estimated Time: 2 minutes
Objective: Document symptoms and generate possible causes using systematic methodology.
Step 1.1: Document Symptoms
Write down what you know (use a text file or notebook):
Symptoms:
- Expected behavior: server-07 should ping gateway 10.20.10.1
- Actual behavior: ping fails with "Destination Host Unreachable"
- Recent change: VPCAttachment created today via GitOps
- kubectl describe: No error events
Timeline:
- VPCAttachment created: This morning (10:00 AM)
- Issue reported: This morning (10:15 AM)
- First check (kubectl describe): No errors visible
This documentation helps you:
- Clarify what's actually broken
- Establish a timeline
- Identify recent changes
Step 1.2: Form Hypotheses
Based on symptoms, list possible causes. Use the common failure modes from Module 4.1a:
Hypothesis List:
- Wrong connection name - VPCAttachment references incorrect switch or connection
- Wrong subnet - VPCAttachment references non-existent subnet
- VLAN not configured - Reconciliation failed, VLAN 1025 not on switch interface
- VLAN mismatch - VLAN conflict caused different VLAN to be allocated
- Interface down - leaf-04 Ethernet interface is operationally down
- Server interface misconfigured - Server enp2s1 not configured correctly
- nativeVLAN mismatch - VPCAttachment and server expect different VLAN tagging
Why these hypotheses?
- Covers configuration issues (1, 2, 4, 7)
- Covers connectivity issues (3, 5)
- Covers server-side issues (6)
Success Criteria
- ✅ Symptoms documented clearly
- ✅ At least 5 hypotheses listed
- ✅ Hypotheses cover configuration, connectivity, and server issues
Time check: You should complete this in 2 minutes or less. Don't overthink—list possibilities quickly, you'll test them systematically.
Task 2: Test Hypotheses with kubectl
Estimated Time: 3 minutes
Objective: Systematically test each hypothesis using kubectl commands.
Step 2.1: Check VPCAttachment Configuration
Test hypotheses 1 and 2: Wrong connection or wrong subnet.
# Get full VPCAttachment spec
kubectl get vpcattachment customer-app-vpc-server-07 -o yaml
# Check connection reference
kubectl get vpcattachment customer-app-vpc-server-07 -o jsonpath='{.spec.connection}'
# Expected: server-07--unbundled--leaf-04
# Check subnet reference
kubectl get vpcattachment customer-app-vpc-server-07 -o jsonpath='{.spec.subnet}'
# Expected: customer-app-vpc/frontend
Verify connection exists:
kubectl get connection server-07--unbundled--leaf-04 -n fab
# Should show: Connection exists
Test Result:
- Connection reference: ✅ Correct (server-07--unbundled--leaf-04)
- Subnet reference: ✅ Correct (customer-app-vpc/frontend)
Hypotheses 1 and 2: ELIMINATED
Step 2.2: Verify Subnet Exists in VPC
Test hypothesis 2 more thoroughly:
# Check if frontend subnet exists in customer-app-vpc
kubectl get vpc customer-app-vpc -o yaml | grep -A 5 "frontend:"
# Expected output shows:
#   frontend:
#     subnet: 10.20.10.0/24
#     gateway: 10.20.10.1
#     vlan: 1025
Test Result:
- Subnet exists: ✅
- VLAN specified: 1025
Hypothesis 2: ELIMINATED (subnet exists)
Step 2.3: Check Agent CRD for leaf-04
Test hypotheses 3, 4, and 5: VLAN configuration and interface state.
Identify which interface server-07 connects to:
kubectl get connection server-07--unbundled--leaf-04 -n fab -o yaml | grep "port:"
# Expected output: leaf-04/E1/8 (which maps to Ethernet8)
Check interface state in Agent CRD:
# Check if interface is up
kubectl get agent leaf-04 -n fab -o jsonpath='{.status.state.interfaces.Ethernet8.oper}'
# Expected: up
# Check which VLANs are configured
kubectl get agent leaf-04 -n fab -o jsonpath='{.status.state.interfaces.Ethernet8.vlans}'
# Look for: VLAN list
CRITICAL FINDING:
The Agent CRD shows:
{
  "oper": "up",
  "admin": "up",
  "vlans": [1020],
  ...
}
Interface is up (✅) but VLAN is 1020, not 1025!
Test Result:
- Interface oper: ✅ Up (Hypothesis 5 eliminated)
- VLAN configured: ❌ VLAN 1020, expected 1025
Hypothesis 4: CONFIRMED (VLAN mismatch)
Step 2.4: Investigate Why VLAN is Wrong
Now that you've identified the mismatch, investigate the root cause:
# Check VPC configuration again
kubectl get vpc customer-app-vpc -o yaml | grep -A 10 "frontend:"
# Check all VLANs currently in use
kubectl get vpc -A -o yaml | grep "vlan:" | sort
# Look for VLAN 1025 usage
kubectl get vpc -A -o yaml | grep "1025"
Discovery:
Another VPC (existing-vpc-prod) is using VLAN 1025. When customer-app-vpc was created, the VLANNamespace automatically allocated VLAN 1020 instead due to the conflict!
Root Cause Identified:
VLAN conflict. The VPC subnet definition manually specifies VLAN 1025, but the VLANNamespace allocated VLAN 1020 because 1025 was already in use by another VPC.
Why kubectl describe showed no errors:
The controller successfully reconciled the VPCAttachment with VLAN 1020 (the allocated VLAN). No error occurred from the controller's perspective—the configuration just doesn't match the operator's expectation.
Step 2.5: Document Root Cause
Write your findings:
Root Cause:
- VLAN conflict between customer-app-vpc and existing-vpc-prod
- VPC subnet manually specified VLAN 1025
- VLANNamespace allocated VLAN 1020 instead (1025 already in use)
- Switch configured with VLAN 1020 (correct according to allocation)
- Server expects VLAN 1025 (incorrect expectation)
Evidence:
- Agent CRD shows VLAN 1020 on leaf-04/Ethernet8
- VPC spec shows VLAN 1025 in subnet definition
- existing-vpc-prod is using VLAN 1025
Solution:
- Update customer-app-vpc frontend subnet VLAN to 1020 (match allocation)
- OR choose a different unused VLAN for customer-app-vpc
Success Criteria
- ✅ Hypotheses tested systematically
- ✅ Root cause identified (VLAN mismatch due to conflict)
- ✅ Evidence documented
- ✅ Solution path clear
Time check: You should complete this in 3 minutes or less with practice.
Task 3: Validate with Grafana
Estimated Time: 1-2 minutes
Objective: Confirm findings using Grafana dashboards for visual validation.
Step 3.1: Check Interfaces Dashboard
- Open Grafana: http://localhost:3000
- Navigate to "Hedgehog Interfaces" dashboard
- Set filters:- Switch: leaf-04
- Interface: Ethernet8
 
- Switch: 
- Observe:- Operational State: Should show "up"
- VLANs Configured: Should show VLAN 1020
- Traffic Patterns: Should show minimal or no traffic (server can't communicate)
 
Expected Finding:
Grafana confirms:
- Interface is up ✅
- VLAN 1020 configured ✅
- Low or zero traffic (consistent with connectivity failure)
Step 3.2: Check Fabric Dashboard
- Navigate to "Hedgehog Fabric" dashboard
- Check BGP sessions for leaf-04
- Verify all BGP sessions are established
Expected Finding:
All BGP sessions show "established" state, confirming this is not a BGP routing issue.
Step 3.3: Correlation Check
Look at Grafana timeline:
- When was VLAN 1020 added to Ethernet8? (Should correlate with VPCAttachment creation time)
- Any interface state changes around that time? (Should show VLAN added, no flapping)
What Grafana Tells You:
- Visual confirmation of Agent CRD findings
- No intermittent issues (interface stable)
- VLAN was added successfully (just the wrong VLAN ID)
Success Criteria
- ✅ Grafana confirms VLAN 1020 configured (not 1025)
- ✅ Interface is up and stable
- ✅ BGP sessions healthy (not a routing issue)
Time check: 1-2 minutes for visual confirmation.
Task 4: Document Root Cause and Solution
Estimated Time: 1 minute
Objective: Write clear problem statement and solution for handoff or documentation.
Step 4.1: Problem Statement
Write a concise problem statement:
Problem Statement:
VPCAttachment customer-app-vpc-server-07 created successfully without errors,
but server-07 cannot communicate within VPC.
Root Cause:
VLAN conflict. VPC subnet specifies VLAN 1025, but switch interface configured
with VLAN 1020 due to conflict with existing-vpc-prod (already using VLAN 1025).
VLANNamespace automatically allocated VLAN 1020 instead.
Controller reconciled successfully with VLAN 1020 (no errors), but configuration
does not match operator expectation (VLAN 1025).
Impact:
- Server-07 has no connectivity within customer-app-vpc
- Application dependent on server-07 is down
Step 4.2: Solution Options
Document solution paths:
Option 1 (Recommended): Update VPC VLAN to match allocation
# Edit customer-app-vpc in Gitea
# Change frontend subnet VLAN from 1025 to 1020
# In Gitea: network-like-hyperscaler/vpcs/customer-app-vpc.yaml
spec:
  subnets:
    frontend:
      subnet: 10.20.10.0/24
      gateway: 10.20.10.1
      vlan: 1020  # Changed from 1025
# Commit change
git add vpcs/customer-app-vpc.yaml
git commit -m "Fix VLAN conflict: use allocated VLAN 1020 for customer-app-vpc frontend"
git push
# Wait for ArgoCD sync
kubectl get vpc customer-app-vpc -w
# Verify with kubectl events
kubectl get events --field-selector involvedObject.name=customer-app-vpc
Why recommended: Aligns configuration with reality (VLAN 1020 already allocated and configured).
Option 2: Choose unused VLAN
# Check available VLANs
kubectl get vpc -A -o yaml | grep "vlan:" | sort
# Identify unused VLAN (e.g., 1026)
# Edit customer-app-vpc in Gitea
# Change frontend subnet VLAN to 1026
# Commit to Gitea, wait for sync
Why consider: If VLAN 1025 has significance (e.g., organizational standard).
Option 3: Remove VLAN specification (let VLANNamespace allocate)
# Edit customer-app-vpc in Gitea
# Remove manual VLAN specification
spec:
  subnets:
    frontend:
      subnet: 10.20.10.0/24
      gateway: 10.20.10.1
      # vlan: 1025  # Remove this line
# VLANNamespace will auto-allocate next available VLAN
Why consider: Prevents future conflicts, follows GitOps best practices.
Step 4.3: Prevention
Document how to prevent this issue in the future:
Prevention:
1. Do not manually specify VLANs in VPC subnets unless required
   - Let VLANNamespace auto-allocate to avoid conflicts
2. If manual VLAN required, check for conflicts first:
   kubectl get vpc -A -o yaml | grep "vlan:" | sort
3. Use VLANNamespace ranges to segregate VLAN usage:
   - VLANNamespace "production": 1000-1999
   - VLANNamespace "development": 2000-2999
4. Monitor kubectl events after VPC creation:
   kubectl get events --field-selector involvedObject.name=<vpc-name>
Success Criteria
- ✅ Root cause documented clearly
- ✅ Solution options identified with pros/cons
- ✅ Next steps defined
- ✅ Prevention measures documented
Lab Summary
What You Accomplished:
You successfully diagnosed a VPCAttachment connectivity failure using systematic troubleshooting methodology:
- ✅ Gathered symptoms and formed hypotheses
- ✅ Tested hypotheses systematically using kubectl
- ✅ Identified root cause (VLAN mismatch due to conflict)
- ✅ Validated findings with Grafana
- ✅ Documented solution path and prevention measures
Key Techniques Used:
- Hypothesis-driven investigation (not random checking)
- Layered diagnostic approach (events → Agent CRD → Grafana)
- Evidence-based elimination (tested each hypothesis)
- Root cause identification (VLAN conflict, not symptoms)
Time to Resolution:
- Task 1: 2 minutes (symptoms and hypotheses)
- Task 2: 3 minutes (hypothesis testing)
- Task 3: 1-2 minutes (Grafana validation)
- Task 4: 1 minute (documentation)
- Total: 7-8 minutes from symptom to solution
Contrast with Random Checking:
Without systematic methodology, you might have:
- Checked controller logs (no useful info)
- Restarted the controller (no effect)
- Deleted and recreated VPCAttachment (same result)
- Checked BGP (not relevant)
- Escalated to support (with no evidence)
- Spent 30+ minutes without identifying root cause
Troubleshooting
Common Lab Challenges
Challenge: "All my hypotheses were eliminated, but the issue persists"
What this means: Your initial hypothesis list didn't include the actual root cause.
What to do:
- Review your evidence collection (Agent CRD, Grafana, events)
- Form new hypotheses based on what you did find
- Example: If VLAN is configured and interface is up, maybe VLAN ID is wrong
Key insight: Hypothesis-driven investigation is iterative. Eliminating hypotheses is progress—it narrows the problem space.
Challenge: "I found the root cause but don't know how to fix it"
What this means: Diagnosis succeeded, but solution implementation is unclear.
What to do:
- Consult Module 2.2 (VPC Design Patterns) for configuration guidance
- Check Module 1.3 (GitOps Workflow) for making changes
- Reference Hedgehog documentation for CRD field definitions
Key insight: Diagnosis and resolution are separate skills. This module focuses on diagnosis—Module 4.2 covers rollback and recovery.
Challenge: "kubectl commands are slow or timing out"
What this means: Kubernetes API server may be under load or network issues.
What to do:
- Check kubectl cluster-info and basic connectivity
- Use --request-timeoutflag to extend timeout
- If persistent, check control node resources (CPU, memory)
Challenge: "Grafana dashboards show 'No Data'"
What this means: Telemetry may not be configured or Prometheus/Loki not accessible.
What to do:
- Check if telemetry is configured in Fabricator
- Rely on kubectl and Agent CRD for this lab
- Grafana validation is optional if telemetry is not configured
Reference: Module 3.1 (Telemetry and Prometheus) for telemetry setup.
Challenge: "I'm not sure which hypothesis to test first"
What this means: You need a prioritization strategy.
What to do: Test hypotheses in this order:
- Fastest to check: kubectl events (10 seconds)
- Most likely: Common failure modes from Module 4.1a
- Highest impact: Issues that would affect multiple resources
Key insight: Start with quick checks, then move to detailed investigation.
Debugging the Diagnostic Process
If you're stuck, ask yourself:
- Did I collect evidence from all four layers? - Events, Agent CRD, Grafana, logs
 
- Am I testing hypotheses or guessing? - Each hypothesis should have a specific test
 
- Am I documenting what I find? - Write down results to avoid re-checking
 
- Have I used decision trees? - Follow Decision Tree 3 for "VPCAttachment shows success but doesn't work"
 
- Am I comparing expected vs. actual? - VPC expects VLAN 1025, Agent CRD shows VLAN 1020 → mismatch
 
Resources
Reference Documentation
Hedgehog CRD Reference:
- VPC and VPCAttachment spec fields
- Agent CRD status fields (interfaces, bgpNeighbors, platform)
- Connection CRD structure
Observability and Diagnostics:
- Module 3.1: Telemetry and Prometheus
- Module 3.2: Grafana Dashboards
- Module 3.3: Agent CRD Deep Dive
- Module 3.4: Pre-Escalation Diagnostic Checklist
GitOps Workflow:
- Module 1.3: GitOps with Hedgehog Fabric
- Module 4.2: Rollback and Recovery (upcoming)
Quick Reference: Diagnostic Commands
Layer 1: Events
# Check for Warning events
kubectl get events --field-selector type=Warning --sort-by='.lastTimestamp'
# Events for specific resource
kubectl describe vpcattachment <name>
Layer 2: Agent CRD
# Check agent readiness
kubectl get agents -n fab
# View interface state
kubectl get agent <switch> -n fab -o jsonpath='{.status.state.interfaces.<interface>}' | jq
# View BGP neighbors
kubectl get agent <switch> -n fab -o jsonpath='{.status.state.bgpNeighbors}' | jq
Layer 3: Grafana
- Fabric Dashboard: http://localhost:3000/d/fabric/hedgehog-fabric
- Interfaces Dashboard: http://localhost:3000/d/interfaces/hedgehog-interfaces
- Logs Dashboard: http://localhost:3000/d/logs/hedgehog-logs
Layer 4: Logs
# Controller logs
kubectl logs -n fab deployment/fabric-controller-manager --tail=200
# Agent logs
kubectl logs -n fab <agent-pod-name>
Decision Tree Quick Reference
Use Decision Tree 1 when:
- Server cannot communicate within VPC
- VPCAttachment exists, no errors
Use Decision Tree 2 when:
- Cross-VPC connectivity fails
- VPCPeering exists
Use Decision Tree 3 when:
- kubectl describe shows success
- Server has no connectivity
- No obvious errors
Common VLAN Issues
| Symptom | Root Cause | Solution | 
|---|---|---|
| VLAN mismatch (allocated ≠ specified) | VLAN conflict | Update VPC VLAN to match allocation | 
| VLAN not configured on interface | Wrong connection reference | Fix VPCAttachment connection field | 
| VLAN configured but wrong ID | Manual VLAN specification conflict | Remove manual VLAN, let VLANNamespace allocate | 
| nativeVLAN mismatch | VPCAttachment nativeVLAN ≠ server config | Align nativeVLAN setting with server interface | 
Common BGP Issues
| Symptom | Root Cause | Solution | 
|---|---|---|
| BGP state: idle | Neighbor IP unreachable | Check ExternalAttachment switch IP and neighbor IP | 
| BGP state: active | ASN mismatch or config error | Verify ASN in ExternalAttachment matches external router | 
| BGP established but no routes | Permit list missing subnets | Update VPCPeering or ExternalPeering permit | 
| Routes filtered | Community mismatch | Check External inboundCommunity and outboundCommunity | 
Escalation Criteria
When to escalate to support:
- All decision tree paths exhausted - Followed relevant decision tree to end
- Issue doesn't match known patterns
 
- Evidence collected but root cause unclear - Completed all 4 layers of diagnostic workflow
- Findings don't point to specific root cause
 
- Suspected platform issue - Agent CRD shows switch failures (PSU, temperature)
- Controller logs show internal errors
 
- Time-sensitive production outage - Issue blocking critical services
- Need expert assistance to resolve quickly
 
Before escalating, ensure you have:
- ✅ Symptoms documented
- ✅ Hypotheses tested
- ✅ Evidence collected (events, Agent CRD, Grafana, logs)
- ✅ Decision tree followed
- ✅ Relevant kubectl outputs saved
Reference Module 4.3 (Coordinating with Support) for escalation procedures.
Next Steps
Module 4.2: Rollback and Recovery
Learn how to safely undo changes when things go wrong:
- GitOps rollback procedures
- Safe deletion order for Hedgehog resources
- Handling stuck resources
- Emergency recovery patterns
Module 4.3: Coordinating with Support
Learn how to work effectively with Hedgehog support:
- Crafting effective support tickets
- Providing diagnostic evidence
- Troubleshooting with support engineers
- Post-resolution follow-up
Module 4.4: Post-Incident Review
Learn how to conduct effective post-incident reviews:
- Documenting incidents
- Root cause analysis
- Prevention measures
- Knowledge sharing
Assessment
Test your understanding of systematic troubleshooting methodology.
Question 1: Troubleshooting Methodology
Scenario: Server-03 in VPC prod-vpc cannot reach server-04 in the same VPC. You've checked kubectl events (no errors) and verified both VPCAttachments exist.
What is your NEXT diagnostic step using systematic methodology?
- A) Restart the fabric controller
- B) Check Agent CRD to verify VLANs configured on both switch interfaces
- C) Escalate to support immediately
- D) Delete and recreate both VPCAttachments
Answer & Explanation
Answer: B) Check Agent CRD to verify VLANs configured on both switch interfaces
Explanation:
Following the diagnostic workflow (Layer 1: Events → Layer 2: Agent CRD):
- You've completed Layer 1 (kubectl events) - no errors found
- Next step is Layer 2 (Agent CRD) - verify switch configuration
Why B is correct:
- Agent CRD shows actual switch interface configuration
- Reveals if VLANs are properly configured
- Tests hypothesis: "VLAN configuration issue"
- Follows systematic layered approach
Example commands:
# Identify which switches server-03 and server-04 connect to
kubectl get vpcattachment prod-vpc-server-03 -o jsonpath='{.spec.connection}'
kubectl get vpcattachment prod-vpc-server-04 -o jsonpath='{.spec.connection}'
# Check Agent CRD for those switches
kubectl get agent leaf-01 -n fab -o jsonpath='{.status.state.interfaces.Ethernet5}' | jq
kubectl get agent leaf-02 -n fab -o jsonpath='{.status.state.interfaces.Ethernet6}' | jq
# Look for: VLAN configured, interface oper=up
Why others are wrong:
A) Restart controller:
- No evidence of controller failure
- Premature action without diagnosis
- Could disrupt fabric unnecessarily
- VPCAttachments exist (controller is working)
C) Escalate to support:
- Haven't completed basic troubleshooting
- No evidence collected yet from Agent CRD or Grafana
- Should check switch state first
- Escalation should be last resort after diagnostic workflow
D) Delete and recreate VPCAttachments:
- No diagnosis performed yet
- May not fix underlying issue (e.g., VLAN conflict)
- Could make troubleshooting harder (lose state)
- Action without understanding root cause
Systematic approach: Complete evidence collection (all 4 layers) before taking corrective action.
Module Reference: Module 4.1a - Concept 3: Diagnostic Workflow (Layer 2: Agent CRD Status)
Question 2: Common Failure Modes
Scenario: You observe these symptoms:
- VPCPeering between vpc-aandvpc-bexists
- kubectl describe shows no errors
- Server in vpc-acan ping its own gateway
- Server in vpc-aCANNOT ping server invpc-b
Which failure mode is most likely?
- A) BGP peering problem (sessions down)
- B) VPC isolation settings or permit list misconfiguration
- C) Interface errors (physical layer issue)
- D) Configuration drift (GitOps reconciliation failure)
Answer & Explanation
Answer: B) VPC isolation settings or permit list misconfiguration
Explanation:
Symptoms indicate:
- ✅ Intra-VPC connectivity works (can ping gateway)
- ❌ Cross-VPC connectivity fails
- ✅ VPCPeering resource exists
- ✅ No error events
This pattern strongly suggests a permit list issue.
Why B is correct:
VPCPeering failure modes include:
- Permit list missing subnets - VPCPeering exists but doesn't include required subnets - spec: permit: - vpc-a: {subnets: [frontend]} # Missing backend subnet! vpc-b: {subnets: [dmz]}
- VPC isolation=true without permit - Subnet marked isolated but no permit list entry - # In vpc-a: subnets: backend: isolated: true # Isolated from other subnets # But VPCPeering permit doesn't include backend → blocked
- Different IPv4Namespaces - VPCPeering requires same namespace - kubectl get vpc vpc-a -o jsonpath='{.spec.ipv4Namespace}' # Output: production kubectl get vpc vpc-b -o jsonpath='{.spec.ipv4Namespace}' # Output: development # DIFFERENT! VPCPeering won't work
Diagnostic steps:
# Check permit list
kubectl get vpcpeering vpc-a--vpc-b -o yaml
# Look for:
spec:
  permit:
    - vpc-a: {subnets: [...]}
      vpc-b: {subnets: [...]}
# Verify both subnets are included
# Check VPC isolation flags
kubectl get vpc vpc-a -o jsonpath='{.spec.subnets.*.isolated}'
Why others are wrong:
A) BGP peering problem:
- Intra-VPC connectivity works, so fabric underlay BGP likely up
- Would affect more than just cross-VPC traffic
- Symptoms would include gateway unreachable
- Check with: kubectl get agent <switch> -n fab -o jsonpath='{.status.state.bgpNeighbors}' | jq
C) Interface errors:
- Would affect intra-VPC connectivity too
- Server can ping gateway (interface working)
- Would see errors in Grafana Interface Dashboard
- Symptoms: intermittent failures, packet loss
D) Configuration drift:
- VPCPeering exists (not a sync issue)
- No evidence of ArgoCD OutOfSync
- kubectl describe shows no errors
- Would see Warning events if reconciliation failed
Decision Tree: Use Decision Tree 2 (Cross-VPC Connectivity Fails) from Module 4.1a.
Module Reference: Module 4.1a - Concept 2: Common Failure Modes
Question 3: Decision Trees
Scenario: Using Decision Tree 3 ("VPCAttachment Shows Success But Doesn't Work"), you've verified:
- VPCAttachment references correct connection ✅
- Subnet exists in VPC ✅
- Agent CRD shows VLAN configured on interface ✅
According to the decision tree, what should you check NEXT?
- A) Controller logs for reconciliation errors
- B) Grafana Interface Dashboard for errors
- C) nativeVLAN setting matches server expectation
- D) Escalate immediately (all checks passed)
Answer & Explanation
Answer: C) nativeVLAN setting matches server expectation
Explanation:
Decision Tree 3 path:
VPCAttachment shows success but doesn't work
  ↓
Verify connection reference ✅ (already checked)
  ↓
Verify subnet exists ✅ (already checked)
  ↓
Check Agent CRD for VLAN ✅ (already checked)
  ↓
→ Check nativeVLAN setting ← YOU ARE HERE
  ↓
Escalate if still unresolved
Why C is correct:
nativeVLAN mismatch is a common issue:
Scenario 1: VPCAttachment expects tagged, server expects untagged
# VPCAttachment
spec:
  nativeVLAN: false  # Switch sends tagged VLAN 1010 traffic
# Server interface (expects untagged)
# Interface: enp2s1 (no VLAN subinterface)
# Result: Server sees VLAN-tagged frames, doesn't process them → no connectivity
Scenario 2: VPCAttachment expects untagged, server expects tagged
# VPCAttachment
spec:
  nativeVLAN: true  # Switch sends untagged traffic
# Server interface (expects tagged)
# Interface: enp2s1.1010 (VLAN subinterface)
# Result: Server expects VLAN tag, receives untagged → no connectivity
How to check:
# VPCAttachment nativeVLAN setting
kubectl get vpcattachment customer-app-vpc-server-07 -o jsonpath='{.spec.nativeVLAN}'
# Output: false (tagged) or true (untagged)
# Server interface configuration (SSH to server)
ip link show
# Look for:
# enp2s1: <BROADCAST,MULTICAST,UP,LOWER_UP>  ← Untagged interface
# enp2s1.1010: <BROADCAST,MULTICAST,UP,LOWER_UP>  ← Tagged interface
# If VPCAttachment nativeVLAN=false, server should have enp2s1.1010
# If VPCAttachment nativeVLAN=true, server should have enp2s1 (no subinterface)
Resolution:
# Option 1: Update VPCAttachment to match server
# In Gitea: change nativeVLAN setting
# Option 2: Update server interface configuration
# Add VLAN subinterface or remove it to match VPCAttachment
Why others are wrong:
A) Controller logs:
- Agent CRD shows VLAN configured (reconciliation succeeded)
- Logs won't reveal server-side configuration issue
- Controller successfully applied configuration
- No error events (controller perspective is success)
B) Grafana Interface Dashboard:
- Already verified VLAN configured via Agent CRD
- Physical layer likely working (VLAN present)
- Doesn't check nativeVLAN setting (tagged vs. untagged)
- Grafana shows interface up, VLAN configured—appears healthy
D) Escalate immediately:
- Decision tree not complete yet
- One more hypothesis to test (nativeVLAN)
- Premature escalation
- Should exhaust decision tree before escalating
Best Practice:
Always check nativeVLAN when:
- VPCAttachment exists, no errors
- VLAN configured correctly
- Interface up
- Server still has no connectivity
This is a configuration mismatch between VPCAttachment and server—easy to overlook but common in practice.
Module Reference: Module 4.1a - Concept 4: Decision Trees (Decision Tree 3)
Question 4: Diagnostic Workflow
Scenario: You're investigating a connectivity issue. You've checked kubectl events (no errors) and Agent CRD (all interfaces up, VLANs configured correctly). Server still cannot communicate.
Why should you check Grafana BEFORE checking controller logs?
- A) Grafana is faster to load than kubectl logs
- B) Grafana provides visual trends and historical context (e.g., intermittent errors over time)
- C) Controller logs are unreliable
- D) Grafana is always the first troubleshooting step
Answer & Explanation
Answer: B) Grafana provides visual trends and historical context (e.g., intermittent errors over time)
Explanation:
Diagnostic Workflow Order:
- kubectl events - Current errors and warnings (fast check)
- Agent CRD - Current switch state (detailed config)
- Grafana - Historical trends and visual patterns ← YOU ARE HERE
- Controller logs - Reconciliation details (specific events)
Why Grafana before controller logs:
Grafana reveals patterns that kubectl cannot:
Example 1: Intermittent Interface Errors
Agent CRD (right now):
- Interface Ethernet8: oper=up, ine=0, oute=0
- Looks healthy!
Grafana Interface Dashboard (last 6 hours):
- 10:00 AM: 0 errors
- 11:30 AM: Spike to 10,000 input errors
- 12:00 PM: Back to 0 errors
- Pattern: Intermittent issue, not current state problem
Without Grafana: You see current state (healthy) and miss the intermittent errors.
With Grafana: You see the spike and know to investigate physical layer or MTU issues.
Example 2: BGP Flapping
Agent CRD (right now):
- BGP neighbor 172.30.128.10: state=established
- Looks healthy!
Grafana Fabric Dashboard (last 24 hours):
- BGP session up/down 15 times
- Pattern: Flapping session, unstable routing
Without Grafana: You see current state (established) and miss the instability.
With Grafana: You see the flapping and investigate route filtering or keepalive issues.
Example 3: Traffic Patterns
Agent CRD (right now):
- Interface Ethernet8: oper=up
- Counters: inb=123456, outb=654321
Grafana Interface Dashboard (last 1 hour):
- Traffic: Zero bytes in/out for entire hour
- Pattern: Interface up but no traffic (server not sending)
Without Grafana: You see non-zero counters (historical total) and assume traffic flowing.
With Grafana: You see zero current traffic and know server-side issue.
Controller logs:
- Show reconciliation events (discrete actions)
- Don't show operational metrics over time
- Useful for understanding "why did controller make this decision?"
- Not useful for "is this interface flapping over time?"
Example controller log:
2025-10-17T10:15:00Z INFO Reconciling VPCAttachment customer-app-vpc-server-07
2025-10-17T10:15:01Z INFO VLAN 1020 configured on leaf-04/Ethernet8
2025-10-17T10:15:02Z INFO Reconciliation successful
Logs show discrete reconciliation success, not ongoing operational state.
Why others are wrong:
A) Grafana is faster:
- Not the reason for ordering
- kubectl can be just as fast (or faster)
- Speed is not the primary consideration
- Order is about information type, not speed
C) Controller logs unreliable:
- False - controller logs are critical for reconciliation debugging
- Just not useful for trending/historical analysis
- Logs are reliable but serve different purpose
- Use logs for "why did controller fail?" not "is interface flapping?"
D) Grafana always first:
- False - kubectl events should be first (fastest check for errors)
- Correct order: Events → Agent CRD → Grafana → Logs
- Grafana is Layer 3, not Layer 1
- Events catch most configuration errors immediately
When to use each tool:
| Tool | When to Use | 
|---|---|
| kubectl events | First check: Are there any error events? | 
| Agent CRD | Second: What is current switch state? | 
| Grafana | Third: Are there patterns over time? Intermittent issues? | 
| Controller logs | Fourth: Why did reconciliation fail or succeed? | 
Practical Example:
Scenario: Server reports "occasional packet loss"
- kubectl events: No errors (rules out configuration issue)
- Agent CRD: Interface up, 0 errors right now (looks healthy)
- Grafana: Shows error spike every 15 minutes for last 6 hours → Root cause visible!
- Controller logs: Not needed (issue is operational, not reconciliation)
Grafana revealed the intermittent pattern that Agent CRD (current state) missed.
Module Reference: Module 4.1a - Concept 3: Diagnostic Workflow (Layer 3: Grafana Dashboards)
Conclusion
You've completed Module 4.1b: Hands-On Fabric Diagnosis Lab!
What You Learned
Practical Application:
- Applied hypothesis-driven investigation to real scenario
- Used layered diagnostic workflow (Events → Agent CRD → Grafana)
- Followed decision trees for structured diagnosis
- Identified VLAN mismatch through systematic testing
Troubleshooting Skills:
- Form and test hypotheses systematically
- Eliminate possibilities through evidence
- Document findings clearly
- Propose multiple solution options
Time Efficiency:
- Diagnosed issue in 7-8 minutes using systematic approach
- Contrast: Random checking could take 30+ minutes without success
- Systematic methodology saves time and ensures thorough diagnosis
Key Takeaways
- Systematic approach is faster - 7-8 minutes with methodology vs. 30+ minutes randomly checking 
- Hypothesis elimination is progress - Each test narrows the problem space 
- VLAN conflicts are subtle - Controller shows success but configuration doesn't match expectation 
- Documentation enables handoff - Clear problem statement and solution options support team collaboration 
- Prevention matters - Document lessons to prevent recurrence 
Troubleshooting Mindset
As you continue operating Hedgehog fabrics:
- Stay systematic: Don't jump to conclusions based on hunches
- Test hypotheses: Verify assumptions with evidence
- Document findings: Track what you've checked to avoid re-work
- Think about "why": Understand the cause, not just the symptom
- Iterate when needed: If all hypotheses eliminated, form new ones based on evidence
Course 4 Progress
Completed:
- ✅ Module 4.1a: Systematic Troubleshooting Framework
- ✅ Module 4.1b: Hands-On Fabric Diagnosis Lab
Up Next:
- Module 4.2: Rollback and Recovery (safe undo procedures, handling stuck resources)
- Module 4.3: Coordinating with Support (effective tickets, working with engineers)
- Module 4.4: Post-Incident Review (documentation, prevention, knowledge sharing)
You're now equipped to diagnose fabric issues systematically and confidently. See you in Module 4.2!
