Pre-Support Diagnostic Checklist

intermediate 15 minutes

hedgehog fabric kubernetes diagnostics troubleshooting support escalation

Introduction

You've completed Modules 3.1-3.3, learning to:

Query Prometheus metrics
Interpret Grafana dashboards
Check kubectl events and Agent CRD status

These skills enable you to self-resolve many common issues: VLAN conflicts, connection misconfigurations, subnet overlaps, and invalid CIDRs. When Grafana shows no traffic and kubectl events reveal a VLAN conflict, you know to choose a different VLAN. When a VPCAttachment fails because a connection doesn't exist, you fix the connection name. These are configuration errors—fixable with your observability toolkit.

But sometimes, despite your troubleshooting, you'll encounter issues that require support escalation:

Controller crashes repeatedly
Agent disconnects without clear cause
Switch hardware failures
VPC appears successful but traffic doesn't flow
Performance degradation with unknown root cause
Unexpected fabric behavior that doesn't match configuration

The Critical Question

When do I escalate versus keep troubleshooting?

And when I escalate, what information does support need?

This module answers both questions. You'll learn a systematic approach to pre-escalation diagnostics that ensures you've tried basic troubleshooting, collected comprehensive evidence, and can make confident escalation decisions. You'll practice the complete workflow: checklist execution, diagnostic collection, escalation decision-making, and support ticket writing.

The Support Philosophy

"Support is not a last resort—it's a strategic resource."

Escalating early with good diagnostics is better than spending hours on an unsolvable problem. Support teams are partners in reliability, not judges of your troubleshooting ability. The key is knowing when to escalate and providing the right data when you do.

This module teaches you to:

Try first - Execute systematic health check to identify configuration issues
Collect diagnostics - Gather comprehensive evidence (kubectl + Grafana)
Escalate confidently - Make informed escalation decisions with complete data

What You'll Learn

Pre-escalation diagnostic checklist - Systematic 6-step health check before escalation
Evidence collection - What to gather (kubectl, logs, events, Agent status, Grafana screenshots)
Escalation triggers - When to escalate versus self-resolve
Support ticket best practices - Clear problem statement with relevant diagnostics
Diagnostic package creation - Bundle diagnostics for support team
Support boundaries - Configuration issues versus bugs

The Integration

This module brings together all Course 3 skills:

Module 3.1 (Prometheus)  ──┐
Module 3.2 (Grafana)     ──┼──> Complete
Module 3.3 (kubectl)     ──┘     Diagnostic
                                 Workflow

You'll use Prometheus queries, Grafana dashboards, kubectl events, and Agent CRD status together in a systematic escalation workflow. This integrated approach is what separates novice operators from experienced ones who can confidently navigate the boundary between self-resolution and escalation.

Learning Objectives

By the end of this module, you will be able to:

Execute diagnostic checklist - Run comprehensive 6-step health check before escalation
Collect fabric diagnostics systematically - Gather kubectl outputs, logs, events, Agent status, and Grafana screenshots
Identify escalation triggers - Determine when to escalate versus self-resolve
Write effective support tickets - Document issues clearly with relevant diagnostics
Package diagnostics - Create diagnostic bundle for support team
Understand support boundaries - Know what support can help with versus operational configuration issues

Prerequisites

Module 3.1 completion (Fabric Telemetry Overview)
Module 3.2 completion (Dashboard Interpretation)
Module 3.3 completion (Events & Status Monitoring)
Understanding of kubectl, Grafana, and Prometheus
Experience with VPC troubleshooting from Course 2

Scenario: Complete Pre-Escalation Diagnostic Collection

A user reports that their newly created VPC customer-app-vpc with VPCAttachment for server-07 isn't working. Server-07 cannot reach other servers in the VPC. You'll execute the full diagnostic checklist, collect evidence, and determine if this requires escalation or self-resolution.

Environment Access:

Grafana: http://localhost:3000
Prometheus: http://localhost:9090
kubectl: Already configured

Task 1: Execute Pre-Escalation Checklist (3 minutes)

Objective: Systematically check fabric health using the 6-step checklist

Step 1.1: Kubernetes Resource Health

# Check VPC and VPCAttachment exist
kubectl get vpc customer-app-vpc
kubectl get vpcattachment -A | grep server-07

# Check controller pods
kubectl get pods -n fab

Look for: All pods in Running state. Pods in CrashLoopBackOff or Error → escalation trigger. If controller is crashing, escalate immediately.

Step 1.2: Check Events

# Check VPC and VPCAttachment events
kubectl get events --field-selector involvedObject.name=customer-app-vpc
kubectl describe vpcattachment customer-app-vpc--server-07

Look for:

✅ Normal events: Created, VNIAllocated, SubnetsConfigured, Ready
❌ Warning events: ValidationFailed, VLANConflict, SubnetOverlap, DependencyMissing

Decision: Warning events with clear config errors → self-resolve. No error events but issue persists → escalate.

Step 1.3: Agent Status

# Find switch for server-07
kubectl get connections -n fab | grep server-07

# Check agent and interface
kubectl get agent leaf-04 -n fab
kubectl get agent leaf-04 -n fab -o jsonpath='{.status.state.interfaces.Ethernet8}' | jq

Look for:

✅ oper: "up", low error counters
❌ oper: "down" or high errors → check cabling

Step 1.4: BGP Health

# Check BGP neighbors
kubectl get agent leaf-04 -n fab -o jsonpath='{.status.state.bgpNeighbors}' | jq

# Alternative: Query Prometheus
curl -s 'http://localhost:9090/api/v1/query?query=bgp_neighbor_state{state!="established"}' | jq

Look for: All neighbors state: "established". If down, check spine connectivity.

Step 1.5: Grafana Dashboard Review

Review dashboards at http://localhost:3000:

Fabric Dashboard: BGP sessions, VPC count
Interfaces Dashboard: leaf-04/Ethernet8 state, VLAN, traffic
Platform Dashboard: CPU/memory, temperature, PSU/fans
Logs Dashboard: ERROR logs
Critical Resources Dashboard: ASIC resource utilization

Step 1.6: Controller and Agent Logs

# Controller logs
kubectl logs -n fab deployment/fabric-controller-manager --tail=200 | grep -i error

# If crashed, get previous logs
kubectl logs -n fab deployment/fabric-controller-manager --previous

Look for: ERROR/PANIC messages, reconciliation loops. Escalate if found.

Checklist Outcomes:

A) Self-Resolve: Clear config errors (VLAN conflict, connection not found, subnet overlap) → Fix in Gitea

B) Escalate: Controller crashes, agent issues, no errors but doesn't work, hardware failures → Collect diagnostics

Success Criteria:

✅ Executed all 6 steps
✅ Identified self-resolve vs escalation
✅ Documented findings

Task 2: Collect Diagnostic Evidence (2-3 minutes)

Objective: Gather complete diagnostic bundle for support

Quick Collection Script:

TIMESTAMP=$(date +%Y%m%d-%H%M%S)
OUTDIR="hedgehog-diagnostics-${TIMESTAMP}"
mkdir -p ${OUTDIR}

# Problem description
echo "Issue: server-07 cannot reach other servers in customer-app-vpc" > ${OUTDIR}/diagnostic-timestamp.txt
date >> ${OUTDIR}/diagnostic-timestamp.txt

# Collect CRDs
kubectl get vpc,vpcattachment,vpcpeering -A -o yaml > ${OUTDIR}/crds-vpc.yaml
kubectl get switches,servers,connections -n fab -o yaml > ${OUTDIR}/crds-wiring.yaml
kubectl get agents -n fab -o yaml > ${OUTDIR}/crds-agents.yaml

# Collect events and logs
kubectl get events --all-namespaces --sort-by='.lastTimestamp' > ${OUTDIR}/events.log
kubectl logs -n fab deployment/fabric-controller-manager --tail=2000 > ${OUTDIR}/controller.log
kubectl logs -n fab deployment/fabric-controller-manager --previous > ${OUTDIR}/controller-previous.log 2>/dev/null || echo "No previous logs" > ${OUTDIR}/controller-previous.log

# Collect status and versions
kubectl get pods -n fab > ${OUTDIR}/pods-fab.txt
kubectl version > ${OUTDIR}/kubectl-version.txt
kubectl get agents -n fab -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.version}{"\n"}{end}' > ${OUTDIR}/agent-versions.txt

# Collect agent details for affected switch (leaf-04)
kubectl get agent leaf-04 -n fab -o yaml > ${OUTDIR}/agent-leaf-04.yaml
kubectl get agent leaf-04 -n fab -o jsonpath='{.status.state.interfaces}' | jq > ${OUTDIR}/agent-leaf-04-interfaces.json

# Compress
tar czf ${OUTDIR}.tar.gz ${OUTDIR}
echo "Created: ${OUTDIR}.tar.gz"

Collect Grafana Screenshots:

Navigate to "Hedgehog Interfaces" dashboard, filter leaf-04/Ethernet8, screenshot
Navigate to "Hedgehog Logs" dashboard, filter leaf-04, screenshot errors
Add to bundle: cp grafana-*.png ${OUTDIR}/ && tar czf ${OUTDIR}.tar.gz ${OUTDIR}

Success Criteria:

✅ Diagnostic bundle created with CRDs, events, logs, Agent status, versions
✅ Grafana screenshots captured
✅ Bundle includes problem statement

Task 3: Determine Escalation Decision (1 minute)

Objective: Decide if issue requires support escalation

Self-Resolve (Configuration Errors):

VLAN conflict, subnet overlap, connection not found, invalid CIDR, missing VPC → Fix in Gitea, commit

Escalate (System/Unexpected Issues):

Controller crashes, agent disconnects, no errors but doesn't work, hardware failures, performance issues

Document Decision:

echo "Decision: [Self-Resolve/Escalate]" >> ${OUTDIR}/diagnostic-timestamp.txt
echo "Reason: [Explanation]" >> ${OUTDIR}/diagnostic-timestamp.txt

Success Criteria:

✅ Clear decision documented with reasoning

Task 4: Write Support Ticket (2 minutes)

Objective: Document issue clearly for support

Support Ticket Template:

# Issue Summary
**Problem:** [One-sentence description]
**Severity:** [P1-P4]
**Impacted Resources:** [VPCs, servers, switches]
**Started:** [Timestamp UTC]
**Recent Changes:** [What changed?]

# Environment
**Hedgehog Version:** Controller image, Agent versions
**Fabric Topology:** Spine/leaf count
**Kubernetes Version:** [version]

# Symptoms
**Observable:** [What you see in Grafana/kubectl/logs]
**Expected:** [What should happen]
**Actual:** [What is happening]

# Diagnostics Completed
**Checklist:** [✅ kubectl events, Agent status, BGP health, Grafana, logs, CRDs]
**Troubleshooting Attempted:** [What you tried and results]
**Findings:** [What you discovered]

# Attachments
**Diagnostic Bundle:** hedgehog-diagnostics-[timestamp].tar.gz
**Grafana Screenshots:** [filenames with descriptions]

# Impact
**Users Affected:** [number/type]
**Business Impact:** [critical/non-critical]
**Workaround:** [if any]

Save ticket:

# Add to bundle
cp support-ticket.md ${OUTDIR}/
tar czf ${OUTDIR}.tar.gz ${OUTDIR}

Key Principles:

Be concise and specific
Show your troubleshooting work
Attach complete diagnostics
Avoid speculation, stick to observations
Professional tone

Success Criteria:

✅ Clear problem statement with diagnostics
✅ Troubleshooting documented
✅ Professional, specific tone

Lab Summary

What You Accomplished:

You completed the full pre-escalation diagnostic workflow:

✅ Executed systematic 6-step health check (Kubernetes, events, agents, BGP, Grafana, logs)
✅ Collected comprehensive diagnostics (CRDs, events, logs, Agent status, versions, topology)
✅ Made escalation decision (self-resolve vs escalate)
✅ Wrote effective support ticket (clear, concise, with evidence)
✅ Created diagnostic bundle for support (compressed, complete, well-organized)

Key Takeaways:

Checklist ensures thoroughness - Don't skip steps, even if you think you know the issue
Evidence collection is systematic - kubectl + Grafana + logs provide complete picture
Escalation triggers are clear - Configuration errors = self-resolve, unexpected behavior = escalate
Support tickets need specifics - Problem statement + diagnostics + troubleshooting attempts
Diagnostic automation saves time - Script collection vs manual (see Concept 5)
Early escalation with good data is better than endless troubleshooting

Real-World Application:

This workflow applies to production fabric operations:

When on-call and encountering fabric issues
When users report VPC connectivity problems
When Grafana alerts fire for fabric components
When changes don't produce expected results

The diagnostic checklist and support ticket template are directly usable in production.

Concepts & Deep Dive

Concept 1: Pre-Escalation Diagnostic Checklist

The 6-step checklist ensures systematic troubleshooting before escalation.

Why Checklists? Prevent skipped steps, reduce cognitive load, ensure consistent quality.

The 6 Steps:

Kubernetes Resources: Verify VPCs, Agents, controller pods exist and running
Events: Check for Warning events indicating config errors vs bugs
Agent Status: Confirm agents connected, Ready, recent heartbeat
BGP Health: All sessions established (use kubectl or Prometheus)
Grafana: Check Fabric, Interfaces, Platform, Logs, Critical Resources dashboards
Logs: Check controller/agent logs for ERROR/PANIC messages

Decision Matrix:

Indicator	Self-Resolve	Escalate
Events	VLAN conflict, connection not found	No events but issue persists
Pods	Running	CrashLoopBackOff
Agents	Ready, recent heartbeat	Disconnecting repeatedly
BGP	Established	Flapping without cause
Logs	Config validation errors	PANIC, fatal errors

Using the Checklist: Execute all steps → Document findings → Correlate data → Make decision → Act

Concept 2: Evidence Collection

Complete diagnostic package includes:

Problem Statement: What broke, when, what changed, expected vs actual
CRDs: Full YAML of VPCs, wiring, agents, namespaces
Events: All events sorted by timestamp, Warning events separately
Logs: Controller logs (current + previous if crashed)
Agent Status: Per-switch YAML with BGP, interfaces, platform health
Versions: Kubernetes, controller, agent versions
Topology: Switches, connections, servers
Grafana Screenshots: Interfaces, Logs, Fabric, Platform dashboards
Timestamp: When diagnostics collected

Package it:

tar czf hedgehog-diagnostics-$(date +%Y%m%d-%H%M%S).tar.gz hedgehog-diagnostics-*/

Don't include: Secrets/credentials (redact), unrelated resources, excessive logs, binaries

Concept 3: Escalation Triggers

ESCALATE When:

Controller Issues: CrashLoopBackOff, PANIC/fatal errors, repeated restarts, reconciliation loops
Agent Issues: Repeated disconnects (not reboot), can't connect (switch reachable), unexpected errors
Hardware Failures: Switch not booting, kernel panics, persistent sensor errors
Unexpected Behavior: Ready status but doesn't work, config not applied, impossible metrics, BGP flapping
Performance Issues: Degradation without capacity issues, slow reconciliation, API unresponsive
Data Plane: Traffic not forwarding despite correct config, unexplained packet loss, VXLAN issues

DO NOT ESCALATE (Self-Resolve):

Config Errors: VLAN conflict, subnet overlap, invalid CIDR, connection not found, missing VPC
Expected Warnings: Unused interfaces down, VPC deletion blocked by attachments, validation errors
Operational Questions: How to configure, best practices, capacity → Check docs

Decision Tree:

Issue → Checklist → Clear config error? YES → Self-resolve
                 ↓ NO
                 → Controller/Agent crash? YES → Escalate
                 ↓ NO
                 → Hardware failure? YES → Escalate
                 ↓ NO
                 → Config correct but not working? YES → Escalate
                 ↓ NO
                 → Operational question? YES → Check docs

Rule: If events/logs clearly explain (config error) → Self-resolve. If unexpected errors or no explanation → Escalate.

Concept 4: Support Ticket Best Practices

Essential Components:

Clear Summary: One-sentence problem statement (specific, not vague)
Severity: P1 (critical) → P4 (question) with business impact
Impacted Resources: Exact VPCs, servers, switches
Timeline: When started, what changed
Symptoms: Observable evidence (Grafana/kubectl/logs)
Expected vs Actual: What should vs does happen
Troubleshooting: What you tried and results
Diagnostics: Attached bundle with contents list
Environment: Versions (controller, agents, k8s), topology
Impact: Users affected, business impact, workarounds

Key Principles:

Be concise and specific
Show your work
Attach complete diagnostics
Report observations, not speculation
Professional tone

Common Mistakes to Avoid:

❌ Vague: "It's not working" → ✅ Specific: "Server-03 cannot reach VPC gateway 10.10.1.1"
❌ Missing diagnostics → ✅ Attached bundle
❌ No troubleshooting → ✅ Documented attempts
❌ Emotional language → ✅ Professional tone

Concept 5: Diagnostic Collection Automation

Automate collection for consistency and speed (30 seconds vs 5-10 minutes manual).

Basic Script Structure:

#!/bin/bash
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
OUTDIR="hedgehog-diagnostics-${TIMESTAMP}"
mkdir -p ${OUTDIR}

# Problem description prompt
read -p "Problem description: " DESC
echo "Issue: ${DESC}" > ${OUTDIR}/diagnostic-timestamp.txt
date >> ${OUTDIR}/diagnostic-timestamp.txt

# Collect CRDs, events, logs
kubectl get vpc,vpcattachment,vpcpeering -A -o yaml > ${OUTDIR}/crds-vpc.yaml 2>&1
kubectl get switches,servers,connections -n fab -o yaml > ${OUTDIR}/crds-wiring.yaml 2>&1
kubectl get agents -n fab -o yaml > ${OUTDIR}/crds-agents.yaml 2>&1
kubectl get events --all-namespaces --sort-by='.lastTimestamp' > ${OUTDIR}/events.log 2>&1
kubectl logs -n fab deployment/fabric-controller-manager --tail=2000 > ${OUTDIR}/controller.log 2>&1

# Versions and status
kubectl version > ${OUTDIR}/kubectl-version.txt 2>&1
kubectl get pods -n fab > ${OUTDIR}/pods-fab.txt 2>&1
kubectl get agents -n fab -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.version}{"\n"}{end}' > ${OUTDIR}/agent-versions.txt 2>&1

# Per-switch details (prompt for affected switches)
read -p "Affected switches (comma-separated): " SWITCHES
for switch in ${SWITCHES//,/ }; do
    kubectl get agent ${switch} -n fab -o yaml > ${OUTDIR}/agent-${switch}.yaml 2>&1
    kubectl get agent ${switch} -n fab -o jsonpath='{.status.state.interfaces}' | jq > ${OUTDIR}/agent-${switch}-interfaces.json 2>&1
done

# Compress
tar czf ${OUTDIR}.tar.gz ${OUTDIR}
echo "Created: ${OUTDIR}.tar.gz ($(du -h ${OUTDIR}.tar.gz | cut -f1))"

Benefits: Consistency, completeness, speed, error handling

Redacting Secrets:

find ${OUTDIR} -type f -exec sed -i 's/password:.*/password: REDACTED/g' {} \;
find ${OUTDIR} -type f -exec sed -i 's/token:.*/token: REDACTED/g' {} \;

Troubleshooting

Issue: Diagnostic Bundle Too Large

Symptom: Bundle exceeds upload limit (>10 MB)

Solutions:

Reduce log tail: --tail=500 instead of --tail=2000
Split into multiple bundles (CRDs, logs, agent status separately)
Upload to cloud storage and share link

Issue: kubectl Commands Timing Out

Symptom: Script hangs on kubectl commands

Solutions:

Add timeout: kubectl get vpc -A -o yaml --request-timeout=30s
Collect only critical resources
Increase cluster timeout in kubeconfig

Issue: Previous Controller Logs Not Available

Symptom: kubectl logs --previous fails

Cause: Controller hasn't crashed (expected)

Solution: Script handles gracefully with 2>/dev/null || echo "No previous logs"

Issue: Agent CRD Has No Status

Symptom: kubectl get agent <switch> shows status: {}

Troubleshooting:

Check if agent pod exists: kubectl get pods -n fab | grep agent-<switch>
Check switch registered: kubectl get switch <switch> -n fab
Check boot logs: hhfab vlab serial <switch>
Check fabric-boot: kubectl logs -n fab deployment/fabric-boot

Solution: Wait for registration or investigate boot failure

Issue: Redacting Sensitive Information

Automated redaction:

find ${DIR} -type f -exec sed -i 's/password:.*/password: REDACTED/g' {} \;
find ${DIR} -type f -exec sed -i 's/token:.*/token: REDACTED/g' {} \;

Manual review: grep -ri "password\|token\|secret" hedgehog-diagnostics-*/

Issue: Support Requests Additional Data

Common requests:

Time-range metrics: curl 'http://localhost:9090/api/v1/query_range?query=...'
Switch CLI: hhfab vlab ssh leaf-04 "show running-config"
Detailed interface stats: kubectl get agent leaf-04 -o jsonpath='{.status.state.interfaces.Ethernet8}'

Best practice: Keep original bundle, add supplemental files to ticket

Resources

Kubernetes Documentation

Hedgehog Documentation

Hedgehog CRD Reference (CRD_REFERENCE.md in research folder)
Hedgehog Observability Guide (OBSERVABILITY.md in research folder)
Hedgehog Fabric Controller Documentation

Related Modules

Previous: Module 3.3: Events & Status Monitoring
Pathway: Network Like a Hyperscaler

Diagnostic Collection Quick Reference

6-Step Checklist:

# 1. Kubernetes Resource Health
kubectl get vpc,vpcattachment -A
kubectl get switches,agents -n fab
kubectl get pods -n fab

# 2. Events (Last Hour)
kubectl get events --all-namespaces --field-selector type=Warning --sort-by='.lastTimestamp'

# 3. Switch Agent Status
kubectl get agents -n fab
kubectl get agent <switch> -n fab -o yaml

# 4. BGP Health
kubectl get agent <switch> -n fab -o jsonpath='{.status.state.bgpNeighbors}' | jq

# 5. Grafana Dashboards
# - Fabric Dashboard (BGP, VPC health)
# - Interfaces Dashboard (interface state, errors)
# - Platform Dashboard (hardware health)
# - Logs Dashboard (error logs)

# 6. Controller Logs
kubectl logs -n fab deployment/fabric-controller-manager --tail=200 | grep -i error

Diagnostic Collection:

# Use automated script
./diagnostic-collect.sh

# OR manual collection
OUTDIR="hedgehog-diagnostics-$(date +%Y%m%d-%H%M%S)"
mkdir -p ${OUTDIR}

kubectl get vpc,vpcattachment -A -o yaml > ${OUTDIR}/crds-vpc.yaml
kubectl get events --all-namespaces --sort-by='.lastTimestamp' > ${OUTDIR}/events.log
kubectl logs -n fab deployment/fabric-controller-manager --tail=2000 > ${OUTDIR}/controller.log
kubectl get agent <switch> -n fab -o yaml > ${OUTDIR}/agent-<switch>.yaml

tar czf ${OUTDIR}.tar.gz ${OUTDIR}

Course 3 Complete! You've mastered Hedgehog fabric observability and diagnostic collection. You can now:

Monitor fabric health proactively (Grafana dashboards)
Troubleshoot issues systematically (kubectl + Grafana)
Collect comprehensive diagnostics (6-step checklist)
Make confident escalation decisions (self-resolve vs escalate)
Write effective support tickets (clear, specific, with evidence)

Your Next Steps:

Apply this workflow in your Hedgehog fabric operations
Customize the diagnostic script for your environment
Build confidence in escalation decisions
Partner with support as a strategic resource for reliability

You are now equipped with the complete observability and escalation toolkit for Hedgehog fabric operations.