MonitoringUpdated July 3, 2026

Runbook: Virtual Machine Availability Alert

runbookazure-monitoralertsvm-availabilitydisaster-recoverytroubleshootingincident-responseservicenowinfrastructure-as-code

Runbook: Virtual Machine Availability Alert

Alert Details

Metric: VM Availability
Threshold: <50% for 30 minutes

Impact

Service disruption. Users cannot access application. Potential data loss if VM crashed.

Investigation Steps

1. Check VM Status

Azure Portal → Virtual Machines → [VM Name] → Overview
Check current status: Running, Stopped, Deallocated, Failed
Review "Resource Health" section

2. Review Activity Log

Azure Portal → VM → Activity Log
Filter last 2 hours
Look for:
- Planned maintenance events
- User-initiated reboots/shutdowns
- Azure platform events (host failures)
- Auto-shutdown policies

3. Check Boot Diagnostics

Azure Portal → VM → Boot Diagnostics
Review serial console output
Look for OS-level boot failures
Check for kernel panics (Linux) or BSOD (Windows)

4. Review Resource Health

Azure Portal → VM → Resource Health
Check for known Azure platform issues
Review historical availability data

5. Correlate with Other Alerts

Check if memory alert fired before availability drop (OOM crash)
Check if disk alert fired (disk exhaustion crash)
Review network alerts (connectivity loss)

Remediation

[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes.

Investigation Actions

Check boot diagnostics logs for root cause
Review application logs from before crash
Verify VM resource sufficiency (CPU/memory/disk)
Document VM state and crash details in ServiceNow incident

Short-Term Resolution

Open ServiceNow Incident with Epic_Azure_Infrastructure_Ops:

Tier 3 Support will review and implement resolution:
- VM restart (if stopped)
- Azure Support ticket (if platform failure)
- Snapshot creation (if recurring crash)
- Resource adjustment if undersized
For Platform Events: Monitor Azure Service Health, wait for Azure resolution
All changes coordinated with application teams

Long-Term Resolution

Create GitHub Issue: Epic on Azure Ops Issues

Engineering Team will implement permanent solutions:
- Auto-healing for scale sets via Terraform
- Multiple instance deployment with load balancer
- Application crash root cause fix
- Availability Zones configuration for higher SLA
- Azure Site Recovery for disaster recovery
All solutions implemented through CI/CD pipeline
Infrastructure changes via Terraform, application fixes via standard deployment

Differentiate Planned vs. Unplanned

Planned Maintenance (Ignore)

User-initiated reboot
Azure planned maintenance window
Patching windows (Tuesday 2-6 AM, Weekend 2-4 AM)
Auto-shutdown schedules

Unplanned Downtime (Investigate)

Crash/hang without user action
Azure platform host failure
Application-induced crash (OOM, disk full)
Network isolation

Escalation

Epic_Azure_Infrastructure_Ops: Open ServiceNow incident for persistent availability issues or recurring crashes
Azure Support: For platform-level failures
Epic - Azure (National West): Open ServiceNow incident if crash caused by application code

Related Alerts

Low Memory (OOM killer terminates VM)
Disk IOPS/Space (disk full crashes)
CPU Spike (may precede hang/crash)

Historical Context

Common causes in OHEMR Epic environment:

Care Everywhere VMs: Memory exhaustion crashes
Planned patching windows (should be suppressed)
Epic background process crashes