MonitoringUpdated July 3, 2026
Runbook: Virtual Machine High CPU Utilization
runbookazure-monitoralertsvm-performancecputroubleshootingincident-responseservicenowinfrastructure-as-code
Runbook: Virtual Machine High CPU Utilization
Alert Details
- Metric: Percentage CPU
- Critical Threshold: >95% for 30 minutes
- Warning Threshold: >90% for 1 hour
Impact
Application response times may be degraded. Users may experience slow page loads or timeouts.
Investigation Steps
1. Check CPU Metrics in Azure Portal
- Navigate to Azure Portal → Virtual Machines → [VM Name] → Metrics
- Select "Percentage CPU" metric
- Adjust time range to last 24 hours
- Look for patterns: sustained vs. spike, time correlation
2. Identify CPU-Consuming Process
[!NOTE] VM Connection Methods
- Azure Portal: VM → Connect → Choose connection method (Bastion, RDP, SSH, Serial Console)
- Access Requirements: Contributor or VM Contributor role on VM or resource group
- Serial Console: Requires boot diagnostics enabled (Azure Portal → VM → Boot diagnostics)
- Network Access: Bastion provides browser-based access without public IP requirements
Windows:
# Connect via RDP (Azure Portal → VM → Connect → RDP)
# OR via Azure Serial Console (VM → Serial Console)
# Open Task Manager → Performance → CPU
# Switch to Processes tab, sort by CPU column
Linux:
# Connect via SSH (Azure Portal → VM → Connect → SSH)
# OR via Azure Serial Console (VM → Serial Console)
top -bn1 | head -20
# Or for more detail:
ps aux --sort=-%cpu | head -20
3. Check Application Insights
- Navigate to Application Insights → Performance
- Review slow requests and dependencies
- Check for database query slowness
- Correlate with deployment timeline
4. Review Recent Changes
- Check Activity Log for recent deployments
- Review application configuration changes
- Verify no recent auto-scaling events
Remediation
[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes.
Investigation Actions
- Review application logs for errors that might cause CPU loops
- Check for runaway background jobs or batch processes
- Identify process consuming CPU (use investigation steps above)
- Document findings in ServiceNow incident
Short-Term Resolution
Open ServiceNow Incident with Epic_Azure_Infrastructure_Ops:
- Tier 3 Support will review and implement changes via incident or change request:
- VM scaling (D4s_v5 → D8s_v5, D8s_v5 → D16s_v5)
- Scale set instance count adjustment
- Temporary rate limiting configuration
- All changes implemented through Terraform/IaC
- No manual Azure Portal changes
Long-Term Resolution
Create GitHub Issue: Epic on Azure Ops Issues
- Engineering Team will implement permanent solutions:
- Application performance profiling and optimization
- Caching layer implementation (Redis, Memcached)
- Database query optimization
- Architecture improvements (async processing, microservices)
- Auto-scaling configuration via Terraform
- All solutions implemented through CI/CD pipeline
- Changes tracked via GitHub issue → PR → deployment workflow
Escalation
- Epic_Azure_Infrastructure_Ops: Open ServiceNow incident if CPU >95% for >1 hour with no resolution
- Epic - Azure (National West): Open ServiceNow incident if CPU issue correlated with specific application code or database queries
- Application-related CPU issues
- Database query performance issues
Related Alerts
- High Memory Usage (may indicate memory leak causing CPU thrashing)
- Disk IOPS Throttling (I/O wait can appear as high CPU)
Historical Context
Common causes in OHEMR Epic environment:
- HSWeb background processing jobs
- Interconnect batch processing
- Anti-virus full scans during business hours