MonitoringUpdated July 3, 2026
Runbook: Load Balancer VIP Availability Alert
runbookazure-monitoralertsload-balancernetworkinghealth-probestroubleshootingincident-responseservicenowinfrastructure-as-code
Runbook: Load Balancer VIP Availability Alert
Alert Details
- Metric: VIP Availability
- Threshold: <75% for 15 minutes
- Severity: 1 (Error)
Impact
Frontend service disruption. External users may receive 502/503 errors or connection timeouts.
Investigation Steps
1. Check Load Balancer Health Probes
- Azure Portal → Load Balancers → [LB Name] → Health Probes
- Verify probe configuration (port, protocol, interval)
- Check probe status (Success vs. Failed)
2. Review Backend Pool Health
- Load Balancer → Backend Pools
- Verify healthy backend instance count
- Check which instances are failing health probes
- Minimum healthy: 2 instances for production
3. Investigate Backend VM Issues
For each unhealthy backend:
-
Check VM availability (is it running?)
-
Check CPU/memory/disk metrics
-
Verify application service is running
-
Test health probe endpoint manually:
curl -v http://<backend-ip>:<probe-port>/health
4. Verify Network Connectivity
- Check NSG rules allow health probe traffic
- Verify firewall rules (source: Azure Load Balancer tag)
- Confirm backend VMs in correct subnet/VNET
5. Review Recent Changes
- Recent backend VM deployments
- Application version changes
- NSG/firewall rule modifications
- Load balancer configuration changes
Remediation
[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes.
Investigation Actions
- Identify which backends are unhealthy (use investigation steps above)
- Test health probe endpoint manually from healthy VM
- Check NSG rules and firewall configurations
- Document findings in ServiceNow incident
Short-Term Resolution
Open ServiceNow Incident:
- If all backends unhealthy → Epic_Azure_Infrastructure_Ops:
- Tier 3 Support will investigate common issue (NSG change, port binding)
- Application service restart coordinated with Epic - Azure (National West)
- Network configuration review
- If some backends unhealthy → Epic_Azure_Infrastructure_Ops:
- Backend pool member adjustment via Terraform
- Unhealthy instance investigation
- Health probe configuration review
- If health endpoint failing → Epic - Azure (National West):
- Application team will fix health endpoint
- Port binding verification
- Application configuration correction
Long-Term Resolution
Create GitHub Issue: Epic on Azure Ops Issues
- Engineering Team will implement permanent solutions:
- Auto-scaling for backend pool via Terraform
- Backend redundancy increase (minimum 3 instances)
- Health check retry logic in application code
- Application Gateway with WAF configuration
- Blue/green deployment implementation
- All solutions implemented through CI/CD pipeline
- Load balancer changes via Terraform, application fixes via standard deployment
Load Balancer vs. Backend Differentiation
Load Balancer Issue
- All backends report unhealthy simultaneously
- Health probe configuration incorrect
- NSG blocking health probe traffic
Backend Issue
- Individual backends failing over time
- Application crashes or hangs
- Resource exhaustion (CPU/memory/disk)
Escalation
- Network Team: Open ServiceNow incident with Network assignment group for load balancer config issues
- Epic_Azure_Infrastructure_Ops: Open ServiceNow incident for backend VM issues or persistent health probe failures
- Epic - Azure (National West): Open ServiceNow incident if health endpoint failing
Related Alerts
- DIP Availability (removed/disabled): Was generating duplicate alerts
- Backend VM alerts (CPU, memory, availability)
Historical Context
- VIP availability <90% threshold was too sensitive
- New threshold <75% allows for backend rotation without alerting
- <75% = <3.75 min downtime in 15-min window (acceptable for health probe cycles)