MonitoringUpdated July 3, 2026
Runbook: Virtual Machine Low Available Memory
runbookazure-monitoralertsvm-performancememorytroubleshootingincident-responseservicenowinfrastructure-as-code
Runbook: Virtual Machine Low Available Memory
Alert Details
- Metric: Available Memory Bytes
- Critical Threshold: ≤2% for 30 minutes
- Warning Threshold: ≤15% for 1 hour
Impact
VM may become unresponsive or crash. Application errors likely. OOM (Out of Memory) killer may terminate processes.
Investigation Steps
1. Check Memory Metrics
- Azure Portal → VM → Metrics → "Available Memory Bytes"
- Compare with "Percentage Memory" metric
- Review 24-hour trend to identify leak vs. capacity issue
2. Identify Memory-Consuming Processes
[!NOTE] VM Connection Methods
- Azure Portal: VM → Connect → Choose connection method (Bastion, RDP, SSH, Serial Console)
- Access Requirements: Contributor or VM Contributor role on VM or resource group
- Serial Console: Requires boot diagnostics enabled (Azure Portal → VM → Boot diagnostics)
- Network Access: Bastion provides browser-based access without public IP requirements
Windows:
# Connect via RDP (Azure Portal → VM → Connect → RDP)
# OR via Azure Serial Console (VM → Serial Console)
# Task Manager → Performance → Memory
# Sort Processes by Memory column
# Or use PowerShell:
Get-Process | Sort-Object -Property WS -Descending | Select-Object -First 10
Linux:
# Connect via SSH (Azure Portal → VM → Connect → SSH)
# OR via Azure Serial Console (VM → Serial Console)
free -h
top -bn1 | head -20
# Or detailed view:
ps aux --sort=-%mem | head -20
3. Check for Memory Leaks
- Review application logs for repeated object allocation
- Check IIS/Tomcat/Java heap usage
- Monitor memory over time (increasing = likely leak)
4. Review Recent Changes
- Recent application deployments
- Configuration changes
- New background jobs or services
Remediation
[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes.
Investigation Actions
- Identify memory-leaking process (use investigation steps above)
- Review application logs for repeated object allocation
- Monitor memory trend (increasing = likely leak)
- Document findings in ServiceNow incident
Short-Term Resolution
Open ServiceNow Incident with Epic_Azure_Infrastructure_Ops:
- Tier 3 Support will review and implement changes via incident or change request:
- VM scaling for more RAM (D4s_v5 → D8s_v5, D8s_v5 → D16s_v5)
- Application service restart (if safe)
- Load reduction configuration
- Cache clearing procedures
- All changes implemented through Terraform/IaC
- VM restarts coordinated with application teams
Long-Term Resolution
Create GitHub Issue: Epic on Azure Ops Issues
- Engineering Team will implement permanent solutions:
- Application memory leak fix (code profiling and remediation)
- Memory usage optimization
- Connection pooling implementation
- Memory-based auto-scaling via Terraform
- Swap space configuration (Linux) for emergency headroom
- All solutions implemented through CI/CD pipeline
- Changes tracked via GitHub issue → PR → deployment workflow
Escalation
- Epic - Azure (National West): Open ServiceNow incident for memory leak investigation or database-related memory pressure
- Application memory leak investigation
- Database process memory issues
- Epic_Azure_Infrastructure_Ops: Open ServiceNow incident for VM scaling assistance or persistent memory issues
Related Alerts
- VM Availability (memory exhaustion causes crashes)
- High CPU (memory paging/thrashing appears as CPU)
Historical Context
Common causes in OHEMR Epic environment:
- Care Everywhere VMs: Known memory pressure issues
- Citrix VDA VMs: Training environment undersized
- Epic Cache processes with improper limits