MonitoringUpdated July 3, 2026

Runbook: Virtual Machine High CPU Utilization

runbookazure-monitoralertsvm-performancecputroubleshootingincident-responseservicenowinfrastructure-as-code

Runbook: Virtual Machine High CPU Utilization

Alert Details

Metric: Percentage CPU
Critical Threshold: >95% for 30 minutes
Warning Threshold: >90% for 1 hour

Impact

Application response times may be degraded. Users may experience slow page loads or timeouts.

Investigation Steps

1. Check CPU Metrics in Azure Portal

Navigate to Azure Portal → Virtual Machines → [VM Name] → Metrics
Select "Percentage CPU" metric
Adjust time range to last 24 hours
Look for patterns: sustained vs. spike, time correlation

2. Identify CPU-Consuming Process

[!NOTE] VM Connection Methods

Azure Portal: VM → Connect → Choose connection method (Bastion, RDP, SSH, Serial Console)

Access Requirements: Contributor or VM Contributor role on VM or resource group

Serial Console: Requires boot diagnostics enabled (Azure Portal → VM → Boot diagnostics)

Network Access: Bastion provides browser-based access without public IP requirements

Windows:

# Connect via RDP (Azure Portal → VM → Connect → RDP)
# OR via Azure Serial Console (VM → Serial Console)
# Open Task Manager → Performance → CPU
# Switch to Processes tab, sort by CPU column

Linux:

# Connect via SSH (Azure Portal → VM → Connect → SSH)
# OR via Azure Serial Console (VM → Serial Console)
top -bn1 | head -20
# Or for more detail:
ps aux --sort=-%cpu | head -20

3. Check Application Insights

Navigate to Application Insights → Performance
Review slow requests and dependencies
Check for database query slowness
Correlate with deployment timeline

4. Review Recent Changes

Check Activity Log for recent deployments
Review application configuration changes
Verify no recent auto-scaling events

Remediation

[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes.

Investigation Actions

Review application logs for errors that might cause CPU loops
Check for runaway background jobs or batch processes
Identify process consuming CPU (use investigation steps above)
Document findings in ServiceNow incident

Short-Term Resolution

Open ServiceNow Incident with Epic_Azure_Infrastructure_Ops:

Tier 3 Support will review and implement changes via incident or change request:
- VM scaling (D4s_v5 → D8s_v5, D8s_v5 → D16s_v5)
- Scale set instance count adjustment
- Temporary rate limiting configuration
All changes implemented through Terraform/IaC
No manual Azure Portal changes

Long-Term Resolution

Create GitHub Issue: Epic on Azure Ops Issues

Engineering Team will implement permanent solutions:
- Application performance profiling and optimization
- Caching layer implementation (Redis, Memcached)
- Database query optimization
- Architecture improvements (async processing, microservices)
- Auto-scaling configuration via Terraform
All solutions implemented through CI/CD pipeline
Changes tracked via GitHub issue → PR → deployment workflow

Escalation

Epic_Azure_Infrastructure_Ops: Open ServiceNow incident if CPU >95% for >1 hour with no resolution
Epic - Azure (National West): Open ServiceNow incident if CPU issue correlated with specific application code or database queries
- Application-related CPU issues
- Database query performance issues

Related Alerts

High Memory Usage (may indicate memory leak causing CPU thrashing)
Disk IOPS Throttling (I/O wait can appear as high CPU)

Historical Context

Common causes in OHEMR Epic environment:

HSWeb background processing jobs
Interconnect batch processing
Anti-virus full scans during business hours