Navigation
MonitoringUpdated July 3, 2026

Runbook: Virtual Machine High CPU Utilization

runbookazure-monitoralertsvm-performancecputroubleshootingincident-responseservicenowinfrastructure-as-code

Runbook: Virtual Machine High CPU Utilization

Alert Details

  • Metric: Percentage CPU
  • Critical Threshold: >95% for 30 minutes
  • Warning Threshold: >90% for 1 hour

Impact

Application response times may be degraded. Users may experience slow page loads or timeouts.

Investigation Steps

1. Check CPU Metrics in Azure Portal

  • Navigate to Azure Portal → Virtual Machines → [VM Name] → Metrics
  • Select "Percentage CPU" metric
  • Adjust time range to last 24 hours
  • Look for patterns: sustained vs. spike, time correlation

2. Identify CPU-Consuming Process

[!NOTE] VM Connection Methods

  • Azure Portal: VM → Connect → Choose connection method (Bastion, RDP, SSH, Serial Console)
  • Access Requirements: Contributor or VM Contributor role on VM or resource group
  • Serial Console: Requires boot diagnostics enabled (Azure Portal → VM → Boot diagnostics)
  • Network Access: Bastion provides browser-based access without public IP requirements

Windows:

# Connect via RDP (Azure Portal → VM → Connect → RDP)
# OR via Azure Serial Console (VM → Serial Console)
# Open Task Manager → Performance → CPU
# Switch to Processes tab, sort by CPU column

Linux:

# Connect via SSH (Azure Portal → VM → Connect → SSH)
# OR via Azure Serial Console (VM → Serial Console)
top -bn1 | head -20
# Or for more detail:
ps aux --sort=-%cpu | head -20

3. Check Application Insights

  • Navigate to Application Insights → Performance
  • Review slow requests and dependencies
  • Check for database query slowness
  • Correlate with deployment timeline

4. Review Recent Changes

  • Check Activity Log for recent deployments
  • Review application configuration changes
  • Verify no recent auto-scaling events

Remediation

[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes.

Investigation Actions

  1. Review application logs for errors that might cause CPU loops
  2. Check for runaway background jobs or batch processes
  3. Identify process consuming CPU (use investigation steps above)
  4. Document findings in ServiceNow incident

Short-Term Resolution

Open ServiceNow Incident with Epic_Azure_Infrastructure_Ops:

  • Tier 3 Support will review and implement changes via incident or change request:
    • VM scaling (D4s_v5 → D8s_v5, D8s_v5 → D16s_v5)
    • Scale set instance count adjustment
    • Temporary rate limiting configuration
  • All changes implemented through Terraform/IaC
  • No manual Azure Portal changes

Long-Term Resolution

Create GitHub Issue: Epic on Azure Ops Issues

  • Engineering Team will implement permanent solutions:
    • Application performance profiling and optimization
    • Caching layer implementation (Redis, Memcached)
    • Database query optimization
    • Architecture improvements (async processing, microservices)
    • Auto-scaling configuration via Terraform
  • All solutions implemented through CI/CD pipeline
  • Changes tracked via GitHub issue → PR → deployment workflow

Escalation

  • Epic_Azure_Infrastructure_Ops: Open ServiceNow incident if CPU >95% for >1 hour with no resolution
  • Epic - Azure (National West): Open ServiceNow incident if CPU issue correlated with specific application code or database queries
    • Application-related CPU issues
    • Database query performance issues

Related Alerts

  • High Memory Usage (may indicate memory leak causing CPU thrashing)
  • Disk IOPS Throttling (I/O wait can appear as high CPU)

Historical Context

Common causes in OHEMR Epic environment:

  • HSWeb background processing jobs
  • Interconnect batch processing
  • Anti-virus full scans during business hours