Navigation
MonitoringUpdated July 3, 2026

Runbook: Virtual Machine Availability Alert

runbookazure-monitoralertsvm-availabilitydisaster-recoverytroubleshootingincident-responseservicenowinfrastructure-as-code

Runbook: Virtual Machine Availability Alert

Alert Details

  • Metric: VM Availability
  • Threshold: <50% for 30 minutes

Impact

Service disruption. Users cannot access application. Potential data loss if VM crashed.

Investigation Steps

1. Check VM Status

  • Azure Portal → Virtual Machines → [VM Name] → Overview
  • Check current status: Running, Stopped, Deallocated, Failed
  • Review "Resource Health" section

2. Review Activity Log

  • Azure Portal → VM → Activity Log
  • Filter last 2 hours
  • Look for:
    • Planned maintenance events
    • User-initiated reboots/shutdowns
    • Azure platform events (host failures)
    • Auto-shutdown policies

3. Check Boot Diagnostics

  • Azure Portal → VM → Boot Diagnostics
  • Review serial console output
  • Look for OS-level boot failures
  • Check for kernel panics (Linux) or BSOD (Windows)

4. Review Resource Health

  • Azure Portal → VM → Resource Health
  • Check for known Azure platform issues
  • Review historical availability data

5. Correlate with Other Alerts

  • Check if memory alert fired before availability drop (OOM crash)
  • Check if disk alert fired (disk exhaustion crash)
  • Review network alerts (connectivity loss)

Remediation

[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes.

Investigation Actions

  1. Check boot diagnostics logs for root cause
  2. Review application logs from before crash
  3. Verify VM resource sufficiency (CPU/memory/disk)
  4. Document VM state and crash details in ServiceNow incident

Short-Term Resolution

Open ServiceNow Incident with Epic_Azure_Infrastructure_Ops:

  • Tier 3 Support will review and implement resolution:
    • VM restart (if stopped)
    • Azure Support ticket (if platform failure)
    • Snapshot creation (if recurring crash)
    • Resource adjustment if undersized
  • For Platform Events: Monitor Azure Service Health, wait for Azure resolution
  • All changes coordinated with application teams

Long-Term Resolution

Create GitHub Issue: Epic on Azure Ops Issues

  • Engineering Team will implement permanent solutions:
    • Auto-healing for scale sets via Terraform
    • Multiple instance deployment with load balancer
    • Application crash root cause fix
    • Availability Zones configuration for higher SLA
    • Azure Site Recovery for disaster recovery
  • All solutions implemented through CI/CD pipeline
  • Infrastructure changes via Terraform, application fixes via standard deployment

Differentiate Planned vs. Unplanned

Planned Maintenance (Ignore)

  • User-initiated reboot
  • Azure planned maintenance window
  • Patching windows (Tuesday 2-6 AM, Weekend 2-4 AM)
  • Auto-shutdown schedules

Unplanned Downtime (Investigate)

  • Crash/hang without user action
  • Azure platform host failure
  • Application-induced crash (OOM, disk full)
  • Network isolation

Escalation

  • Epic_Azure_Infrastructure_Ops: Open ServiceNow incident for persistent availability issues or recurring crashes
  • Azure Support: For platform-level failures
  • Epic - Azure (National West): Open ServiceNow incident if crash caused by application code

Related Alerts

  • Low Memory (OOM killer terminates VM)
  • Disk IOPS/Space (disk full crashes)
  • CPU Spike (may precede hang/crash)

Historical Context

Common causes in OHEMR Epic environment:

  • Care Everywhere VMs: Memory exhaustion crashes
  • Planned patching windows (should be suppressed)
  • Epic background process crashes