Systematic Troubleshooting Framework
Introduction
You've designed VPCs, attached servers, validated connectivity, and monitored fabric health. Your Hedgehog fabric is operational.
Then one morning: "Server-07 can't reach the database. The application is down."
You could panic. Or you could apply systematic troubleshooting methodology—the difference between experienced operators and beginners.
Beginners randomly check things, hoping to stumble on the issue. Experts use hypothesis-driven investigation, eliminating possibilities systematically until they identify the root cause.
Building on Pre-Escalation Diagnostics
In Module 3.4, you learned the pre-escalation diagnostic checklist—collecting evidence for support tickets. That checklist taught you what to collect: events, Agent CRD status, Grafana metrics, and logs.
Module 4.1a teaches you how to diagnose issues independently using that evidence:
- Form hypotheses based on symptoms
- Test hypotheses with kubectl and Grafana
- Eliminate possibilities systematically
- Identify root cause with confidence
What You'll Learn
Troubleshooting Methodology:
- Hypothesis-driven investigation framework
- Systematic evidence collection order
- Root cause identification through elimination
Common Failure Modes:
- VPC attachment issues (incorrect subnet, connection, or VLAN)
- BGP peering problems (Agent CRD state inspection)
- Interface errors (Grafana Interface Dashboard analysis)
- Configuration drift (GitOps reconciliation failures)
Diagnostic Workflow:
- Layer 1: Kubernetes events (fastest check)
- Layer 2: Agent CRD status (detailed switch state)
- Layer 3: Grafana dashboards (visual metrics and trends)
- Layer 4: Controller logs (reconciliation details)
Decision Trees:
- Server cannot communicate in VPC
- Cross-VPC connectivity fails
- VPCAttachment shows success but doesn't work
The Scenario
You'll learn systematic methodology for diagnosing connectivity failures. The framework you master here applies to any fabric issue you encounter.
This module focuses on the methodology itself—how to think systematically about troubleshooting. In Module 4.1b, you'll apply this methodology in a hands-on troubleshooting lab.
Learning Objectives
By the end of this module, you will be able to:
- Apply systematic troubleshooting methodology - Use hypothesis-driven investigation to diagnose fabric issues
- Identify common failure modes - Recognize VPC attachment issues, BGP problems, interface errors, and configuration drift
- Construct diagnostic workflows - Build investigation paths from events → Agent CRDs → Grafana metrics → logs
- Differentiate issue types - Distinguish between configuration, connectivity, and platform issues
- Use decision trees for diagnosis - Follow structured decision paths to identify root causes
Prerequisites
Before starting this module, you should have:
Completed Courses:
- Course 1: Foundations & Interfaces (kubectl proficiency, GitOps workflow)
- Course 2: Provisioning & Connectivity (VPC and VPCAttachment creation experience)
- Course 3: Observability & Fabric Health (Prometheus metrics, Grafana dashboards, Agent CRD inspection, diagnostic checklist)
Understanding:
- How VPCs and VPCAttachments should work (expected behavior)
- Agent CRD structure and status fields
- Grafana dashboard navigation
- kubectl events and describe commands
Environment:
- kubectl configured and authenticated
- Grafana access (http://localhost:3000)
- Hedgehog fabric with at least one VPC deployed
Scenario
Incident Report:
A developer reports that server-07 in VPC customer-app-vpc cannot reach the gateway or other servers in the VPC. The VPCAttachment was created this morning using the standard GitOps workflow.
Initial Investigation:
You run kubectl describe vpcattachment customer-app-vpc-server-07 and see no error events. The resource exists, the controller processed it successfully, and there are no warnings.
But the server still has no connectivity.
Your Task:
Use systematic troubleshooting methodology to identify the root cause and document a solution.
Known Information:
- VPC:
customer-app-vpc - Subnet:
frontend(10.20.10.0/24, gateway 10.20.10.1, VLAN 1025) - Server:
server-07 - Expected Connection:
server-07--unbundled--leaf-04 - VPCAttachment:
customer-app-vpc-server-07 - Symptom: Server cannot ping gateway 10.20.10.1
Core Concepts & Deep Dive
Concept 1: Troubleshooting Methodology (Hypothesis-Driven Investigation)
The Problem with Random Checking
Beginners troubleshoot by checking things randomly:
- "Let me check the controller logs"
- "Maybe it's a BGP issue"
- "Did the switch reboot?"
- "Is DNS working?"
This wastes time and misses systematic patterns. Worse, you might check dozens of things without finding the root cause, or accidentally stumble on the solution without understanding why it works.
Systematic Approach: Hypothesis-Driven Investigation
Professional troubleshooting follows a structured methodology:
Step 1: Gather Symptoms
Document what you know:
- What is not working? (Specific behavior: "server-07 cannot ping 10.20.10.1")
- What is the expected behavior? (Server should ping gateway successfully)
- When did it start? (This morning after VPCAttachment creation)
- What changed recently? (VPCAttachment created, committed to Git, synced by ArgoCD)
Step 2: Form Hypotheses
Based on symptoms, generate possible causes:
- Configuration error: VLAN mismatch, wrong connection reference, incorrect subnet
- Connectivity issue: BGP session down, interface down, physical layer problem
- Platform issue: Switch failure, resource exhaustion, controller error
Step 3: Test Hypotheses
For each hypothesis, design a specific test:
- If VLAN mismatch → Check Agent CRD interface configuration
- If BGP down → Check Agent CRD bgpNeighbors state
- If interface down → Check Grafana Interface Dashboard
Step 4: Eliminate or Confirm
Each test either:
- Eliminates hypothesis (BGP is established → not a BGP issue)
- Confirms hypothesis (VLAN 1010 expected, VLAN 1020 configured → VLAN mismatch confirmed!)
Step 5: Identify Root Cause
When one hypothesis is confirmed and others eliminated:
- Root cause identified
- Solution becomes clear
- You can document findings with confidence
Example Walkthrough
Symptoms:
- Server-07 in VPC
webapp-vpc, subnetfrontend - Expected: ping 10.0.10.1 (gateway)
- Actual: "Destination Host Unreachable"
- Recent change: VPCAttachment created today
Hypotheses:
- VLAN not configured on switch interface (configuration issue)
- Interface is down (connectivity issue)
- Server network interface misconfigured (server issue)
- BGP session down (routing issue)
Tests:
Hypothesis 1: VLAN not configured
# Check Agent CRD for leaf-04 (server-07 connects to leaf-04)
kubectl get agent leaf-04 -n fab -o jsonpath='{.status.state.interfaces.Ethernet8}' | jq
# Look for: vlans field contains 1010
Result: VLAN 1010 is configured on Ethernet8 ✅ (Hypothesis 1 eliminated)
Hypothesis 2: Interface down
# Check Grafana Interface Dashboard for leaf-04/Ethernet8
# Or check Agent CRD:
kubectl get agent leaf-04 -n fab -o jsonpath='{.status.state.interfaces.Ethernet8.oper}'
# Look for: oper=up
Result: Ethernet8 oper=up ✅ (Hypothesis 2 eliminated)
Hypothesis 3: Server misconfigured
# SSH to server-07
ip link show enp2s1.1010
# Look for: interface exists and is up
Result: Server interface enp2s1.1010 is up and tagged ✅ (Hypothesis 3 eliminated)
Hypothesis 4: BGP down
# Check Agent CRD bgpNeighbors
kubectl get agent leaf-04 -n fab -o jsonpath='{.status.state.bgpNeighbors.default}' | jq
# Look for: all neighbors state=established
Result: All BGP sessions established ✅ (Hypothesis 4 eliminated)
Wait—all tests passed, but still broken?
When all initial hypotheses are eliminated, revise your hypothesis list. This is where systematic methodology shines: you don't give up or escalate prematurely. You form new hypotheses based on evidence.
Revised Hypothesis: VPCAttachment references wrong connection.
Test:
kubectl get vpcattachment webapp-vpc-server-07 -o jsonpath='{.spec.connection}'
# Output: server-07--unbundled--leaf-05
Root Cause Found: VPCAttachment references leaf-05 but server-07 actually connects to leaf-04.
Solution: Update VPCAttachment connection reference in Gitea to server-07--unbundled--leaf-04, commit, wait for ArgoCD sync.
Why This Methodology Works
- Eliminates guesswork: You test, don't guess
- Builds evidence: Each test adds to your knowledge
- Enables explanation: You can document why the issue occurred
- Transfers to new scenarios: The methodology applies to issues you've never seen before
- Supports escalation: If you must escalate, you provide tested hypotheses and evidence
Concept 2: Common Failure Modes
Knowing common failure patterns helps you form hypotheses quickly. Here are the most frequent issues in Hedgehog fabrics:
Failure Mode 1: VPC Attachment Issues
Symptoms:
- VPCAttachment created, no error events in kubectl describe
- Server cannot communicate within VPC (cannot ping gateway or other servers)
Common Root Causes:
Wrong connection name
- VPCAttachment references incorrect Connection CRD
- Example: VPCAttachment specifies
leaf-05but server connects toleaf-04
Wrong subnet
- VPCAttachment references non-existent subnet
- Example: VPCAttachment specifies
vpc-1/databasebut VPC only hasvpc-1/frontend
VLAN conflict
- VPC subnet specifies VLAN already in use
- VLANNamespace allocates different VLAN
- Configuration and actual allocation mismatch
nativeVLAN mismatch
- VPCAttachment expects untagged VLAN (
nativeVLAN: true) - Server expects tagged VLAN (interface enp2s1.1010)
- Or vice versa
- VPCAttachment expects untagged VLAN (
Diagnostic Path:
Check VPCAttachment spec
↓
Verify connection exists: kubectl get connection <name> -n fab
↓
Verify subnet exists in VPC: kubectl get vpc <vpc> -o yaml | grep <subnet>
↓
Check Agent CRD for switch interface: Is VLAN configured?
↓
Check Grafana Interface Dashboard: Is interface up?
↓
Check VLAN ID matches expected
Resolution:
- Update VPCAttachment connection or subnet reference in Git
- Fix VPC VLAN assignment if conflict exists
- Adjust nativeVLAN setting to match server configuration
Failure Mode 2: BGP Peering Problems
Symptoms:
- Cross-VPC connectivity fails (VPCPeering exists)
- External connectivity fails (ExternalPeering exists)
- Grafana Fabric Dashboard shows BGP sessions down
Common Root Causes:
BGP session not established
- Neighbor IP unreachable
- ASN mismatch
- Interface IP misconfigured
Route not advertised
- VPCPeering permit list misconfiguration
- ExternalPeering prefix filter blocks routes
Community mismatch
- External community filter blocks routes
- Inbound/outbound community misconfigured
Diagnostic Path:
Check Grafana Fabric Dashboard: Which BGP sessions down?
↓
Check Agent CRD bgpNeighbors for affected switch
↓
Identify neighbor IP and state (idle, active, established)
↓
If state != established:
- Check ExternalAttachment neighbor IP/ASN
- Check switch interface IP configured correctly
- Check external router reachability
↓
If state = established but routes missing:
- Check VPCPeering permit list
- Check ExternalPeering prefix filters
- Check community configuration
Resolution:
- Fix neighbor IP or ASN in ExternalAttachment
- Update VPCPeering permit list to include required subnets
- Verify ExternalPeering prefix filters allow expected routes
- Correct community configuration in External resource
Failure Mode 3: Interface Errors
Symptoms:
- Intermittent connectivity failures
- Packet loss
- Grafana Interface Dashboard shows errors
Common Root Causes:
Physical layer issue
- Bad cable, SFP, or switch port
- Dirty fiber optic connector
MTU mismatch
- Server configured with MTU 9000
- Switch configured with MTU 1500
- Fragmentation issues
Congestion
- Traffic exceeds interface capacity
- No QoS configured
Configuration error
- Speed/duplex mismatch
- LACP misconfiguration on bundled/MCLAG connection
Diagnostic Path:
Check Grafana Interface Dashboard: Which interface has errors?
↓
Check Agent CRD interface counters: ine (input errors), oute (output errors)
↓
Check interface speed and MTU configuration
↓
Check physical layer: SFP status, cable integrity
↓
If congestion: Check traffic patterns in Grafana
↓
If LACP issue: Check server LACP configuration
Resolution:
- Replace faulty cable or SFP
- Align MTU configuration across server and switch
- Implement QoS or upgrade link capacity
- Fix speed/duplex or LACP configuration
Failure Mode 4: Configuration Drift (GitOps Reconciliation Failures)
Symptoms:
- Git shows correct configuration
- kubectl shows different configuration
- ArgoCD shows "OutOfSync" status
Common Root Causes:
Manual kubectl changes
- Someone applied changes directly (bypassing Git)
- ArgoCD detects drift
ArgoCD sync disabled
- Auto-sync turned off manually
- Changes in Git not applied
Git webhook broken
- ArgoCD not notified of commits
- Manual sync required
Reconciliation error
- Controller failed to apply configuration
- Dependency missing or validation failed
Diagnostic Path:
Check ArgoCD application status: Synced or OutOfSync?
↓
Compare Git YAML to kubectl get <resource> -o yaml
↓
Check ArgoCD sync history: Last successful sync?
↓
Check controller logs for reconciliation errors
↓
Check kubectl events for VPC/VPCAttachment errors
Resolution:
- Revert manual kubectl changes, re-sync from Git
- Re-enable ArgoCD auto-sync
- Fix Git webhook configuration
- Resolve dependency or validation errors in Git configuration
Concept 3: Diagnostic Workflow (Evidence Collection Order)
When facing an issue, collect evidence in this order for maximum efficiency:
Layer 1: Kubernetes Events (Fastest Check)
Why first? Events quickly reveal configuration errors, validation failures, and reconciliation issues.
# Check for recent Warning events
kubectl get events --field-selector type=Warning --sort-by='.lastTimestamp'
# Check events for specific resource
kubectl describe vpc webapp-vpc
kubectl describe vpcattachment webapp-vpc-server-07
Look for:
VLANConflict,SubnetOverlap(configuration errors)DependencyMissing(VPC/Connection not found)ReconcileFailed(controller errors)ValidationFailed(spec validation errors)
What events tell you:
- Whether the controller accepted the configuration
- If validation passed
- If dependencies exist
- If reconciliation succeeded
Time investment: 10-30 seconds
When to move on: If no Warning events exist, proceed to Layer 2.
Layer 2: Agent CRD Status (Detailed Switch State)
Why second? Agent CRD is the source of truth for actual switch configuration and operational state.
# Check agent readiness
kubectl get agents -n fab
# View detailed switch state
kubectl get agent leaf-04 -n fab -o yaml
# Check BGP neighbors
kubectl get agent leaf-04 -n fab -o jsonpath='{.status.state.bgpNeighbors}' | jq
# Check interface state
kubectl get agent leaf-04 -n fab -o jsonpath='{.status.state.interfaces.Ethernet8}' | jq
Look for:
- BGP neighbor state:
establishedvsidleoractive - Interface oper state:
upvsdown - VLAN configuration: Which VLANs are configured on which interfaces
- Interface counters:
ine(input errors),oute(output errors) - Platform health: PSU status, fan speeds, temperature
What Agent CRD tells you:
- Actual switch configuration (not just desired state)
- Operational status (up/down, established/idle)
- Performance metrics (errors, counters)
- Hardware health
Time investment: 1-2 minutes
When to move on: If Agent CRD shows expected configuration and healthy state, proceed to Layer 3.
Layer 3: Grafana Dashboards (Visual Metrics and Trends)
Why third? Grafana provides visual trends and historical context that kubectl cannot.
Fabric Dashboard:
- BGP session health (all established?)
- BGP session flapping (repeated up/down transitions)
- VPC count (as expected?)
Interfaces Dashboard:
- Interface operational state (up/down over time)
- Error rates (input errors, output errors)
- Traffic patterns (bandwidth utilization, spikes, drops)
- Packet counters (unicast, broadcast, multicast)
Logs Dashboard:
- Recent ERROR logs (switch or controller)
- Filter by switch name for targeted investigation
- Syslog patterns (BGP state changes, interface flaps)
Look for:
- Interfaces with high error rates (physical layer issues)
- BGP sessions flapping (routing instability)
- Traffic patterns indicating congestion or no traffic
- Log patterns correlating with symptom timing
What Grafana tells you:
- Historical trends (when did it start?)
- Intermittent issues (not visible in current state)
- Correlation between events (BGP flap coincides with interface errors)
Time investment: 2-3 minutes
When to move on: If Grafana confirms hypotheses or reveals new patterns, validate with logs.
Layer 4: Controller Logs (Reconciliation Details)
Why fourth? Controller logs explain why reconciliation succeeded or failed.
# Check controller logs for specific resource
kubectl logs -n fab deployment/fabric-controller-manager | grep webapp-vpc-server-07
# Check for errors
kubectl logs -n fab deployment/fabric-controller-manager --tail=200 | grep -i error
# Follow logs live
kubectl logs -n fab deployment/fabric-controller-manager -f
Look for:
- Reconciliation loops (same resource reconciling repeatedly)
- Errors during configuration application
- Validation failures
- Dependency resolution issues
What controller logs tell you:
- Why controller made specific decisions
- Why configuration wasn't applied
- Timing of reconciliation attempts
Time investment: 2-5 minutes
When to use:
- Events show
ReconcileFailedbut reason unclear - Configuration should be applied but isn't
- Debugging GitOps sync issues
Summary: Layered Diagnostic Approach
| Layer | Tool | Purpose | Time | When to Use |
|---|---|---|---|---|
| 1 | kubectl events | Quick error check | 10-30s | Always first |
| 2 | Agent CRD | Switch state verification | 1-2min | After events |
| 3 | Grafana | Trends and historical context | 2-3min | For intermittent issues or validation |
| 4 | Controller logs | Reconciliation debugging | 2-5min | When reconciliation fails or behavior unclear |
Total time for full diagnostic: 5-10 minutes
This layered approach ensures you:
- Start with fastest checks (events)
- Progress to detailed state (Agent CRD)
- Validate with trends (Grafana)
- Deep-dive only when necessary (logs)
Concept 4: Decision Trees for Common Scenarios
Decision trees provide structured diagnostic paths for frequent issues. Follow the tree from top to bottom, testing at each decision point.
Decision Tree 1: Server Cannot Communicate in VPC
Server cannot ping gateway or other servers in VPC
│
├─ Check kubectl events for VPCAttachment
│ │
│ ├─ Warning events present?
│ │ ├─ YES → Read error message, fix configuration
│ │ └─ NO → Continue investigation
│
├─ Check Agent CRD for switch
│ │
│ ├─ Is interface oper=up?
│ │ ├─ NO → Physical layer issue (cable, SFP, port)
│ │ └─ YES → Continue
│ │
│ ├─ Is VLAN configured on interface?
│ │ ├─ NO → VPCAttachment wrong connection or reconciliation failed
│ │ └─ YES → Continue
│ │
│ ├─ Is VLAN ID correct?
│ │ ├─ NO → Configuration error (VLAN mismatch or conflict)
│ │ └─ YES → Continue
│
├─ Check server network interface
│ │
│ ├─ Is interface up and tagged with VLAN?
│ │ ├─ NO → Server configuration issue
│ │ └─ YES → Continue
│
├─ Check Grafana Interface Dashboard
│ │
│ ├─ High error rates on interface?
│ │ ├─ YES → Physical layer issue or MTU mismatch
│ │ └─ NO → Escalate (unexpected issue)
Example Usage:
- Server-07 cannot ping gateway 10.20.10.1
- Check events: No warnings → Continue
- Check Agent CRD leaf-04 Ethernet8: oper=up → Continue
- Check VLAN on Ethernet8: VLAN 1020 configured
- VPC expects VLAN 1025 → VLAN mismatch identified!
Decision Tree 2: Cross-VPC Connectivity Fails (VPCPeering Exists)
Server in VPC-1 cannot reach server in VPC-2
│
├─ Check kubectl events for VPCPeering
│ │
│ ├─ Warning events?
│ │ ├─ YES → Configuration error in permit list
│ │ └─ NO → Continue
│
├─ Verify VPCPeering permit includes both subnets
│ │
│ ├─ Permit list missing subnet?
│ │ ├─ YES → Add subnet to permit list
│ │ └─ NO → Continue
│
├─ Check both VPCs in same IPv4Namespace
│ │
│ ├─ Different namespaces?
│ │ ├─ YES → Configuration error (VPCPeering requires same namespace)
│ │ └─ NO → Continue
│
├─ Check BGP sessions in Grafana Fabric Dashboard
│ │
│ ├─ BGP sessions down?
│ │ ├─ YES → Investigate BGP issue (Agent CRD bgpNeighbors)
│ │ └─ NO → Continue
│
├─ Check VPC subnet isolation flags
│ │
│ ├─ isolated: true but no permit?
│ │ ├─ YES → Add permit list entry
│ │ └─ NO → Escalate (unexpected issue)
Example Usage:
- VPC-1 server cannot reach VPC-2 server
- Check events: No warnings → Continue
- Check VPCPeering permit list: Only includes VPC-1 frontend, missing VPC-2 backend
- Permit list incomplete → Add VPC-2 backend to permit
Decision Tree 3: VPCAttachment Shows Success But Doesn't Work
kubectl describe shows no errors, but server has no connectivity
│
├─ Verify VPCAttachment references correct connection
│ │
│ ├─ kubectl get vpcattachment <name> -o jsonpath='{.spec.connection}'
│ │
│ ├─ Connection matches server's actual connection?
│ │ ├─ NO → Root cause found: Wrong connection reference
│ │ └─ YES → Continue
│
├─ Verify subnet exists in VPC
│ │
│ ├─ kubectl get vpc <vpc> -o yaml | grep <subnet>
│ │
│ ├─ Subnet exists?
│ │ ├─ NO → Root cause found: Subnet doesn't exist
│ │ └─ YES → Continue
│
├─ Check Agent CRD for switch
│ │
│ ├─ VLAN configured on interface?
│ │ ├─ NO → Reconciliation failed (check controller logs)
│ │ └─ YES → Continue
│
├─ Check nativeVLAN setting
│ │
│ ├─ VPCAttachment nativeVLAN matches server expectation?
│ │ ├─ NO → Root cause found: nativeVLAN mismatch
│ │ └─ YES → Escalate (unexpected issue)
Example Usage:
- VPCAttachment created, no events, but server has no connectivity
- Check connection:
server-07--unbundled--leaf-04→ Matches - Check subnet exists:
customer-app-vpc/frontend→ Exists - Check Agent CRD: VLAN 1020 configured
- VPC expects VLAN 1025 → VLAN mismatch (allocation conflict)
Conclusion
You've mastered the systematic troubleshooting framework for Hedgehog fabric operations!
What You Learned
Troubleshooting Methodology:
- Hypothesis-driven investigation framework
- Evidence-based elimination of possibilities
- Root cause identification with confidence
Common Failure Modes:
- VPC attachment issues (VLAN conflicts, wrong connections)
- BGP peering problems (permit lists, communities)
- Interface errors (physical layer, MTU, congestion)
- Configuration drift (GitOps sync issues)
Diagnostic Workflow:
- Layer 1: kubectl events (fastest check)
- Layer 2: Agent CRD (detailed switch state)
- Layer 3: Grafana (trends and historical context)
- Layer 4: Controller logs (reconciliation details)
Decision Trees:
- Server cannot communicate in VPC
- Cross-VPC connectivity fails
- VPCAttachment shows success but doesn't work
Key Takeaways
Systematic approach beats random checking - Hypothesis-driven investigation saves time and ensures thorough diagnosis
Evidence collection order matters - Start fast (events), then detailed (Agent CRD), then historical (Grafana)
"Success" doesn't mean "working" - No error events doesn't guarantee correct configuration (e.g., VLAN mismatch)
Decision trees provide structure - Follow diagnostic paths for common scenarios to avoid missing steps
Root cause identification requires testing - Eliminate possibilities until one remains, don't assume
Next Steps
In Module 4.1b: Hands-On Fabric Diagnosis Lab, you'll apply this methodology in a real troubleshooting scenario. You'll:
- Use the layered diagnostic approach
- Apply decision trees to identify root cause
- Practice hypothesis-driven investigation
- Document findings and solutions
The framework you learned here is the foundation for all troubleshooting work. Master it now, and you'll diagnose issues confidently in any scenario.
You're now equipped with systematic troubleshooting methodology. See you in Module 4.1b for hands-on practice!