MonitoringUpdated July 3, 2026

Runbook: Virtual Machine Low Available Memory

runbookazure-monitoralertsvm-performancememorytroubleshootingincident-responseservicenowinfrastructure-as-code

Runbook: Virtual Machine Low Available Memory

Alert Details

Metric: Available Memory Bytes
Critical Threshold: ≤2% for 30 minutes
Warning Threshold: ≤15% for 1 hour

Impact

VM may become unresponsive or crash. Application errors likely. OOM (Out of Memory) killer may terminate processes.

Investigation Steps

1. Check Memory Metrics

Azure Portal → VM → Metrics → "Available Memory Bytes"
Compare with "Percentage Memory" metric
Review 24-hour trend to identify leak vs. capacity issue

2. Identify Memory-Consuming Processes

[!NOTE] VM Connection Methods

Azure Portal: VM → Connect → Choose connection method (Bastion, RDP, SSH, Serial Console)

Access Requirements: Contributor or VM Contributor role on VM or resource group

Serial Console: Requires boot diagnostics enabled (Azure Portal → VM → Boot diagnostics)

Network Access: Bastion provides browser-based access without public IP requirements

Windows:

# Connect via RDP (Azure Portal → VM → Connect → RDP)
# OR via Azure Serial Console (VM → Serial Console)
# Task Manager → Performance → Memory
# Sort Processes by Memory column
# Or use PowerShell:
Get-Process | Sort-Object -Property WS -Descending | Select-Object -First 10

Linux:

# Connect via SSH (Azure Portal → VM → Connect → SSH)
# OR via Azure Serial Console (VM → Serial Console)
free -h
top -bn1 | head -20
# Or detailed view:
ps aux --sort=-%mem | head -20

3. Check for Memory Leaks

Review application logs for repeated object allocation
Check IIS/Tomcat/Java heap usage
Monitor memory over time (increasing = likely leak)

4. Review Recent Changes

Recent application deployments
Configuration changes
New background jobs or services

Remediation

[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes.

Investigation Actions

Identify memory-leaking process (use investigation steps above)
Review application logs for repeated object allocation
Monitor memory trend (increasing = likely leak)
Document findings in ServiceNow incident

Short-Term Resolution

Open ServiceNow Incident with Epic_Azure_Infrastructure_Ops:

Tier 3 Support will review and implement changes via incident or change request:
- VM scaling for more RAM (D4s_v5 → D8s_v5, D8s_v5 → D16s_v5)
- Application service restart (if safe)
- Load reduction configuration
- Cache clearing procedures
All changes implemented through Terraform/IaC
VM restarts coordinated with application teams

Long-Term Resolution

Create GitHub Issue: Epic on Azure Ops Issues

Engineering Team will implement permanent solutions:
- Application memory leak fix (code profiling and remediation)
- Memory usage optimization
- Connection pooling implementation
- Memory-based auto-scaling via Terraform
- Swap space configuration (Linux) for emergency headroom
All solutions implemented through CI/CD pipeline
Changes tracked via GitHub issue → PR → deployment workflow

Escalation

Epic - Azure (National West): Open ServiceNow incident for memory leak investigation or database-related memory pressure
- Application memory leak investigation
- Database process memory issues
Epic_Azure_Infrastructure_Ops: Open ServiceNow incident for VM scaling assistance or persistent memory issues

Related Alerts

VM Availability (memory exhaustion causes crashes)
High CPU (memory paging/thrashing appears as CPU)

Historical Context

Common causes in OHEMR Epic environment:

Care Everywhere VMs: Known memory pressure issues
Citrix VDA VMs: Training environment undersized
Epic Cache processes with improper limits