Pre-Support Diagnostic Checklist
Introduction
You've completed Modules 3.1-3.3, learning to:
- Query Prometheus metrics
- Interpret Grafana dashboards
- Check kubectl events and Agent CRD status
These skills enable you to self-resolve many common issues: VLAN conflicts, connection misconfigurations, subnet overlaps, and invalid CIDRs. When Grafana shows no traffic and kubectl events reveal a VLAN conflict, you know to choose a different VLAN. When a VPCAttachment fails because a connection doesn't exist, you fix the connection name. These are configuration errors—fixable with your observability toolkit.
But sometimes, despite your troubleshooting, you'll encounter issues that require support escalation:
- Controller crashes repeatedly
- Agent disconnects without clear cause
- Switch hardware failures
- VPC appears successful but traffic doesn't flow
- Performance degradation with unknown root cause
- Unexpected fabric behavior that doesn't match configuration
The Critical Question
When do I escalate versus keep troubleshooting?
And when I escalate, what information does support need?
This module answers both questions. You'll learn a systematic approach to pre-escalation diagnostics that ensures you've tried basic troubleshooting, collected comprehensive evidence, and can make confident escalation decisions. You'll practice the complete workflow: checklist execution, diagnostic collection, escalation decision-making, and support ticket writing.
The Support Philosophy
"Support is not a last resort—it's a strategic resource."
Escalating early with good diagnostics is better than spending hours on an unsolvable problem. Support teams are partners in reliability, not judges of your troubleshooting ability. The key is knowing when to escalate and providing the right data when you do.
This module teaches you to:
- Try first - Execute systematic health check to identify configuration issues
- Collect diagnostics - Gather comprehensive evidence (kubectl + Grafana)
- Escalate confidently - Make informed escalation decisions with complete data
What You'll Learn
- Pre-escalation diagnostic checklist - Systematic 6-step health check before escalation
- Evidence collection - What to gather (kubectl, logs, events, Agent status, Grafana screenshots)
- Escalation triggers - When to escalate versus self-resolve
- Support ticket best practices - Clear problem statement with relevant diagnostics
- Diagnostic package creation - Bundle diagnostics for support team
- Support boundaries - Configuration issues versus bugs
The Integration
This module brings together all Course 3 skills:
Module 3.1 (Prometheus) ──┐
Module 3.2 (Grafana) ──┼──> Complete
Module 3.3 (kubectl) ──┘ Diagnostic
Workflow
You'll use Prometheus queries, Grafana dashboards, kubectl events, and Agent CRD status together in a systematic escalation workflow. This integrated approach is what separates novice operators from experienced ones who can confidently navigate the boundary between self-resolution and escalation.
Learning Objectives
By the end of this module, you will be able to:
- Execute diagnostic checklist - Run comprehensive 6-step health check before escalation
- Collect fabric diagnostics systematically - Gather kubectl outputs, logs, events, Agent status, and Grafana screenshots
- Identify escalation triggers - Determine when to escalate versus self-resolve
- Write effective support tickets - Document issues clearly with relevant diagnostics
- Package diagnostics - Create diagnostic bundle for support team
- Understand support boundaries - Know what support can help with versus operational configuration issues
Prerequisites
- Module 3.1 completion (Fabric Telemetry Overview)
- Module 3.2 completion (Dashboard Interpretation)
- Module 3.3 completion (Events & Status Monitoring)
- Understanding of kubectl, Grafana, and Prometheus
- Experience with VPC troubleshooting from Course 2
Scenario: Complete Pre-Escalation Diagnostic Collection
A user reports that their newly created VPC customer-app-vpc with VPCAttachment for server-07 isn't working. Server-07 cannot reach other servers in the VPC. You'll execute the full diagnostic checklist, collect evidence, and determine if this requires escalation or self-resolution.
Environment Access:
- Grafana: http://localhost:3000
- Prometheus: http://localhost:9090
- kubectl: Already configured
Task 1: Execute Pre-Escalation Checklist (3 minutes)
Objective: Systematically check fabric health using the 6-step checklist
Step 1.1: Kubernetes Resource Health
# Check VPC and VPCAttachment exist
kubectl get vpc customer-app-vpc
kubectl get vpcattachment -A | grep server-07
# Check controller pods
kubectl get pods -n fab
Look for: All pods in Running state. Pods in CrashLoopBackOff or Error → escalation trigger. If controller is crashing, escalate immediately.
Step 1.2: Check Events
# Check VPC and VPCAttachment events
kubectl get events --field-selector involvedObject.name=customer-app-vpc
kubectl describe vpcattachment customer-app-vpc--server-07
Look for:
- ✅
Normalevents: Created, VNIAllocated, SubnetsConfigured, Ready - ❌
Warningevents: ValidationFailed, VLANConflict, SubnetOverlap, DependencyMissing
Decision: Warning events with clear config errors → self-resolve. No error events but issue persists → escalate.
Step 1.3: Agent Status
# Find switch for server-07
kubectl get connections -n fab | grep server-07
# Check agent and interface
kubectl get agent leaf-04 -n fab
kubectl get agent leaf-04 -n fab -o jsonpath='{.status.state.interfaces.Ethernet8}' | jq
Look for:
- ✅
oper: "up", low error counters - ❌
oper: "down"or high errors → check cabling
Step 1.4: BGP Health
# Check BGP neighbors
kubectl get agent leaf-04 -n fab -o jsonpath='{.status.state.bgpNeighbors}' | jq
# Alternative: Query Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=bgp_neighbor_state{state!="established"}' | jq
Look for: All neighbors state: "established". If down, check spine connectivity.
Step 1.5: Grafana Dashboard Review
Review dashboards at http://localhost:3000:
- Fabric Dashboard: BGP sessions, VPC count
- Interfaces Dashboard: leaf-04/Ethernet8 state, VLAN, traffic
- Platform Dashboard: CPU/memory, temperature, PSU/fans
- Logs Dashboard: ERROR logs
- Critical Resources Dashboard: ASIC resource utilization
Step 1.6: Controller and Agent Logs
# Controller logs
kubectl logs -n fab deployment/fabric-controller-manager --tail=200 | grep -i error
# If crashed, get previous logs
kubectl logs -n fab deployment/fabric-controller-manager --previous
Look for: ERROR/PANIC messages, reconciliation loops. Escalate if found.
Checklist Outcomes:
A) Self-Resolve: Clear config errors (VLAN conflict, connection not found, subnet overlap) → Fix in Gitea
B) Escalate: Controller crashes, agent issues, no errors but doesn't work, hardware failures → Collect diagnostics
Success Criteria:
- ✅ Executed all 6 steps
- ✅ Identified self-resolve vs escalation
- ✅ Documented findings
Task 2: Collect Diagnostic Evidence (2-3 minutes)
Objective: Gather complete diagnostic bundle for support
Quick Collection Script:
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
OUTDIR="hedgehog-diagnostics-${TIMESTAMP}"
mkdir -p ${OUTDIR}
# Problem description
echo "Issue: server-07 cannot reach other servers in customer-app-vpc" > ${OUTDIR}/diagnostic-timestamp.txt
date >> ${OUTDIR}/diagnostic-timestamp.txt
# Collect CRDs
kubectl get vpc,vpcattachment,vpcpeering -A -o yaml > ${OUTDIR}/crds-vpc.yaml
kubectl get switches,servers,connections -n fab -o yaml > ${OUTDIR}/crds-wiring.yaml
kubectl get agents -n fab -o yaml > ${OUTDIR}/crds-agents.yaml
# Collect events and logs
kubectl get events --all-namespaces --sort-by='.lastTimestamp' > ${OUTDIR}/events.log
kubectl logs -n fab deployment/fabric-controller-manager --tail=2000 > ${OUTDIR}/controller.log
kubectl logs -n fab deployment/fabric-controller-manager --previous > ${OUTDIR}/controller-previous.log 2>/dev/null || echo "No previous logs" > ${OUTDIR}/controller-previous.log
# Collect status and versions
kubectl get pods -n fab > ${OUTDIR}/pods-fab.txt
kubectl version > ${OUTDIR}/kubectl-version.txt
kubectl get agents -n fab -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.version}{"\n"}{end}' > ${OUTDIR}/agent-versions.txt
# Collect agent details for affected switch (leaf-04)
kubectl get agent leaf-04 -n fab -o yaml > ${OUTDIR}/agent-leaf-04.yaml
kubectl get agent leaf-04 -n fab -o jsonpath='{.status.state.interfaces}' | jq > ${OUTDIR}/agent-leaf-04-interfaces.json
# Compress
tar czf ${OUTDIR}.tar.gz ${OUTDIR}
echo "Created: ${OUTDIR}.tar.gz"
Collect Grafana Screenshots:
- Navigate to "Hedgehog Interfaces" dashboard, filter leaf-04/Ethernet8, screenshot
- Navigate to "Hedgehog Logs" dashboard, filter leaf-04, screenshot errors
- Add to bundle:
cp grafana-*.png ${OUTDIR}/ && tar czf ${OUTDIR}.tar.gz ${OUTDIR}
Success Criteria:
- ✅ Diagnostic bundle created with CRDs, events, logs, Agent status, versions
- ✅ Grafana screenshots captured
- ✅ Bundle includes problem statement
Task 3: Determine Escalation Decision (1 minute)
Objective: Decide if issue requires support escalation
Self-Resolve (Configuration Errors):
- VLAN conflict, subnet overlap, connection not found, invalid CIDR, missing VPC → Fix in Gitea, commit
Escalate (System/Unexpected Issues):
- Controller crashes, agent disconnects, no errors but doesn't work, hardware failures, performance issues
Document Decision:
echo "Decision: [Self-Resolve/Escalate]" >> ${OUTDIR}/diagnostic-timestamp.txt
echo "Reason: [Explanation]" >> ${OUTDIR}/diagnostic-timestamp.txt
Success Criteria:
- ✅ Clear decision documented with reasoning
Task 4: Write Support Ticket (2 minutes)
Objective: Document issue clearly for support
Support Ticket Template:
# Issue Summary
**Problem:** [One-sentence description]
**Severity:** [P1-P4]
**Impacted Resources:** [VPCs, servers, switches]
**Started:** [Timestamp UTC]
**Recent Changes:** [What changed?]
# Environment
**Hedgehog Version:** Controller image, Agent versions
**Fabric Topology:** Spine/leaf count
**Kubernetes Version:** [version]
# Symptoms
**Observable:** [What you see in Grafana/kubectl/logs]
**Expected:** [What should happen]
**Actual:** [What is happening]
# Diagnostics Completed
**Checklist:** [✅ kubectl events, Agent status, BGP health, Grafana, logs, CRDs]
**Troubleshooting Attempted:** [What you tried and results]
**Findings:** [What you discovered]
# Attachments
**Diagnostic Bundle:** hedgehog-diagnostics-[timestamp].tar.gz
**Grafana Screenshots:** [filenames with descriptions]
# Impact
**Users Affected:** [number/type]
**Business Impact:** [critical/non-critical]
**Workaround:** [if any]
Save ticket:
# Add to bundle
cp support-ticket.md ${OUTDIR}/
tar czf ${OUTDIR}.tar.gz ${OUTDIR}
Key Principles:
- Be concise and specific
- Show your troubleshooting work
- Attach complete diagnostics
- Avoid speculation, stick to observations
- Professional tone
Success Criteria:
- ✅ Clear problem statement with diagnostics
- ✅ Troubleshooting documented
- ✅ Professional, specific tone
Lab Summary
What You Accomplished:
You completed the full pre-escalation diagnostic workflow:
- ✅ Executed systematic 6-step health check (Kubernetes, events, agents, BGP, Grafana, logs)
- ✅ Collected comprehensive diagnostics (CRDs, events, logs, Agent status, versions, topology)
- ✅ Made escalation decision (self-resolve vs escalate)
- ✅ Wrote effective support ticket (clear, concise, with evidence)
- ✅ Created diagnostic bundle for support (compressed, complete, well-organized)
Key Takeaways:
- Checklist ensures thoroughness - Don't skip steps, even if you think you know the issue
- Evidence collection is systematic - kubectl + Grafana + logs provide complete picture
- Escalation triggers are clear - Configuration errors = self-resolve, unexpected behavior = escalate
- Support tickets need specifics - Problem statement + diagnostics + troubleshooting attempts
- Diagnostic automation saves time - Script collection vs manual (see Concept 5)
- Early escalation with good data is better than endless troubleshooting
Real-World Application:
This workflow applies to production fabric operations:
- When on-call and encountering fabric issues
- When users report VPC connectivity problems
- When Grafana alerts fire for fabric components
- When changes don't produce expected results
The diagnostic checklist and support ticket template are directly usable in production.
Concepts & Deep Dive
Concept 1: Pre-Escalation Diagnostic Checklist
The 6-step checklist ensures systematic troubleshooting before escalation.
Why Checklists? Prevent skipped steps, reduce cognitive load, ensure consistent quality.
The 6 Steps:
- Kubernetes Resources: Verify VPCs, Agents, controller pods exist and running
- Events: Check for Warning events indicating config errors vs bugs
- Agent Status: Confirm agents connected, Ready, recent heartbeat
- BGP Health: All sessions established (use kubectl or Prometheus)
- Grafana: Check Fabric, Interfaces, Platform, Logs, Critical Resources dashboards
- Logs: Check controller/agent logs for ERROR/PANIC messages
Decision Matrix:
| Indicator | Self-Resolve | Escalate |
|---|---|---|
| Events | VLAN conflict, connection not found | No events but issue persists |
| Pods | Running | CrashLoopBackOff |
| Agents | Ready, recent heartbeat | Disconnecting repeatedly |
| BGP | Established | Flapping without cause |
| Logs | Config validation errors | PANIC, fatal errors |
Using the Checklist: Execute all steps → Document findings → Correlate data → Make decision → Act
Concept 2: Evidence Collection
Complete diagnostic package includes:
- Problem Statement: What broke, when, what changed, expected vs actual
- CRDs: Full YAML of VPCs, wiring, agents, namespaces
- Events: All events sorted by timestamp, Warning events separately
- Logs: Controller logs (current + previous if crashed)
- Agent Status: Per-switch YAML with BGP, interfaces, platform health
- Versions: Kubernetes, controller, agent versions
- Topology: Switches, connections, servers
- Grafana Screenshots: Interfaces, Logs, Fabric, Platform dashboards
- Timestamp: When diagnostics collected
Package it:
tar czf hedgehog-diagnostics-$(date +%Y%m%d-%H%M%S).tar.gz hedgehog-diagnostics-*/
Don't include: Secrets/credentials (redact), unrelated resources, excessive logs, binaries
Concept 3: Escalation Triggers
ESCALATE When:
- Controller Issues: CrashLoopBackOff, PANIC/fatal errors, repeated restarts, reconciliation loops
- Agent Issues: Repeated disconnects (not reboot), can't connect (switch reachable), unexpected errors
- Hardware Failures: Switch not booting, kernel panics, persistent sensor errors
- Unexpected Behavior: Ready status but doesn't work, config not applied, impossible metrics, BGP flapping
- Performance Issues: Degradation without capacity issues, slow reconciliation, API unresponsive
- Data Plane: Traffic not forwarding despite correct config, unexplained packet loss, VXLAN issues
DO NOT ESCALATE (Self-Resolve):
- Config Errors: VLAN conflict, subnet overlap, invalid CIDR, connection not found, missing VPC
- Expected Warnings: Unused interfaces down, VPC deletion blocked by attachments, validation errors
- Operational Questions: How to configure, best practices, capacity → Check docs
Decision Tree:
Issue → Checklist → Clear config error? YES → Self-resolve
↓ NO
→ Controller/Agent crash? YES → Escalate
↓ NO
→ Hardware failure? YES → Escalate
↓ NO
→ Config correct but not working? YES → Escalate
↓ NO
→ Operational question? YES → Check docs
Rule: If events/logs clearly explain (config error) → Self-resolve. If unexpected errors or no explanation → Escalate.
Concept 4: Support Ticket Best Practices
Essential Components:
- Clear Summary: One-sentence problem statement (specific, not vague)
- Severity: P1 (critical) → P4 (question) with business impact
- Impacted Resources: Exact VPCs, servers, switches
- Timeline: When started, what changed
- Symptoms: Observable evidence (Grafana/kubectl/logs)
- Expected vs Actual: What should vs does happen
- Troubleshooting: What you tried and results
- Diagnostics: Attached bundle with contents list
- Environment: Versions (controller, agents, k8s), topology
- Impact: Users affected, business impact, workarounds
Key Principles:
- Be concise and specific
- Show your work
- Attach complete diagnostics
- Report observations, not speculation
- Professional tone
Common Mistakes to Avoid:
- ❌ Vague: "It's not working" → ✅ Specific: "Server-03 cannot reach VPC gateway 10.10.1.1"
- ❌ Missing diagnostics → ✅ Attached bundle
- ❌ No troubleshooting → ✅ Documented attempts
- ❌ Emotional language → ✅ Professional tone
Concept 5: Diagnostic Collection Automation
Automate collection for consistency and speed (30 seconds vs 5-10 minutes manual).
Basic Script Structure:
#!/bin/bash
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
OUTDIR="hedgehog-diagnostics-${TIMESTAMP}"
mkdir -p ${OUTDIR}
# Problem description prompt
read -p "Problem description: " DESC
echo "Issue: ${DESC}" > ${OUTDIR}/diagnostic-timestamp.txt
date >> ${OUTDIR}/diagnostic-timestamp.txt
# Collect CRDs, events, logs
kubectl get vpc,vpcattachment,vpcpeering -A -o yaml > ${OUTDIR}/crds-vpc.yaml 2>&1
kubectl get switches,servers,connections -n fab -o yaml > ${OUTDIR}/crds-wiring.yaml 2>&1
kubectl get agents -n fab -o yaml > ${OUTDIR}/crds-agents.yaml 2>&1
kubectl get events --all-namespaces --sort-by='.lastTimestamp' > ${OUTDIR}/events.log 2>&1
kubectl logs -n fab deployment/fabric-controller-manager --tail=2000 > ${OUTDIR}/controller.log 2>&1
# Versions and status
kubectl version > ${OUTDIR}/kubectl-version.txt 2>&1
kubectl get pods -n fab > ${OUTDIR}/pods-fab.txt 2>&1
kubectl get agents -n fab -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.version}{"\n"}{end}' > ${OUTDIR}/agent-versions.txt 2>&1
# Per-switch details (prompt for affected switches)
read -p "Affected switches (comma-separated): " SWITCHES
for switch in ${SWITCHES//,/ }; do
kubectl get agent ${switch} -n fab -o yaml > ${OUTDIR}/agent-${switch}.yaml 2>&1
kubectl get agent ${switch} -n fab -o jsonpath='{.status.state.interfaces}' | jq > ${OUTDIR}/agent-${switch}-interfaces.json 2>&1
done
# Compress
tar czf ${OUTDIR}.tar.gz ${OUTDIR}
echo "Created: ${OUTDIR}.tar.gz ($(du -h ${OUTDIR}.tar.gz | cut -f1))"
Benefits: Consistency, completeness, speed, error handling
Redacting Secrets:
find ${OUTDIR} -type f -exec sed -i 's/password:.*/password: REDACTED/g' {} \;
find ${OUTDIR} -type f -exec sed -i 's/token:.*/token: REDACTED/g' {} \;
Troubleshooting
Issue: Diagnostic Bundle Too Large
Symptom: Bundle exceeds upload limit (>10 MB)
Solutions:
- Reduce log tail:
--tail=500instead of--tail=2000 - Split into multiple bundles (CRDs, logs, agent status separately)
- Upload to cloud storage and share link
Issue: kubectl Commands Timing Out
Symptom: Script hangs on kubectl commands
Solutions:
- Add timeout:
kubectl get vpc -A -o yaml --request-timeout=30s - Collect only critical resources
- Increase cluster timeout in kubeconfig
Issue: Previous Controller Logs Not Available
Symptom: kubectl logs --previous fails
Cause: Controller hasn't crashed (expected)
Solution: Script handles gracefully with 2>/dev/null || echo "No previous logs"
Issue: Agent CRD Has No Status
Symptom: kubectl get agent <switch> shows status: {}
Troubleshooting:
- Check if agent pod exists:
kubectl get pods -n fab | grep agent-<switch> - Check switch registered:
kubectl get switch <switch> -n fab - Check boot logs:
hhfab vlab serial <switch> - Check fabric-boot:
kubectl logs -n fab deployment/fabric-boot
Solution: Wait for registration or investigate boot failure
Issue: Redacting Sensitive Information
Automated redaction:
find ${DIR} -type f -exec sed -i 's/password:.*/password: REDACTED/g' {} \;
find ${DIR} -type f -exec sed -i 's/token:.*/token: REDACTED/g' {} \;
Manual review: grep -ri "password\|token\|secret" hedgehog-diagnostics-*/
Issue: Support Requests Additional Data
Common requests:
- Time-range metrics:
curl 'http://localhost:9090/api/v1/query_range?query=...' - Switch CLI:
hhfab vlab ssh leaf-04 "show running-config" - Detailed interface stats:
kubectl get agent leaf-04 -o jsonpath='{.status.state.interfaces.Ethernet8}'
Best practice: Keep original bundle, add supplemental files to ticket
Resources
Kubernetes Documentation
Hedgehog Documentation
- Hedgehog CRD Reference (CRD_REFERENCE.md in research folder)
- Hedgehog Observability Guide (OBSERVABILITY.md in research folder)
- Hedgehog Fabric Controller Documentation
Related Modules
- Previous: Module 3.3: Events & Status Monitoring
- Pathway: Network Like a Hyperscaler
Diagnostic Collection Quick Reference
6-Step Checklist:
# 1. Kubernetes Resource Health
kubectl get vpc,vpcattachment -A
kubectl get switches,agents -n fab
kubectl get pods -n fab
# 2. Events (Last Hour)
kubectl get events --all-namespaces --field-selector type=Warning --sort-by='.lastTimestamp'
# 3. Switch Agent Status
kubectl get agents -n fab
kubectl get agent <switch> -n fab -o yaml
# 4. BGP Health
kubectl get agent <switch> -n fab -o jsonpath='{.status.state.bgpNeighbors}' | jq
# 5. Grafana Dashboards
# - Fabric Dashboard (BGP, VPC health)
# - Interfaces Dashboard (interface state, errors)
# - Platform Dashboard (hardware health)
# - Logs Dashboard (error logs)
# 6. Controller Logs
kubectl logs -n fab deployment/fabric-controller-manager --tail=200 | grep -i error
Diagnostic Collection:
# Use automated script
./diagnostic-collect.sh
# OR manual collection
OUTDIR="hedgehog-diagnostics-$(date +%Y%m%d-%H%M%S)"
mkdir -p ${OUTDIR}
kubectl get vpc,vpcattachment -A -o yaml > ${OUTDIR}/crds-vpc.yaml
kubectl get events --all-namespaces --sort-by='.lastTimestamp' > ${OUTDIR}/events.log
kubectl logs -n fab deployment/fabric-controller-manager --tail=2000 > ${OUTDIR}/controller.log
kubectl get agent <switch> -n fab -o yaml > ${OUTDIR}/agent-<switch>.yaml
tar czf ${OUTDIR}.tar.gz ${OUTDIR}
Course 3 Complete! You've mastered Hedgehog fabric observability and diagnostic collection. You can now:
- Monitor fabric health proactively (Grafana dashboards)
- Troubleshoot issues systematically (kubectl + Grafana)
- Collect comprehensive diagnostics (6-step checklist)
- Make confident escalation decisions (self-resolve vs escalate)
- Write effective support tickets (clear, specific, with evidence)
Your Next Steps:
- Apply this workflow in your Hedgehog fabric operations
- Customize the diagnostic script for your environment
- Build confidence in escalation decisions
- Partner with support as a strategic resource for reliability
You are now equipped with the complete observability and escalation toolkit for Hedgehog fabric operations.