Navigation
MonitoringUpdated July 3, 2026

Runbook: Virtual Machine Low Available Memory

runbookazure-monitoralertsvm-performancememorytroubleshootingincident-responseservicenowinfrastructure-as-code

Runbook: Virtual Machine Low Available Memory

Alert Details

  • Metric: Available Memory Bytes
  • Critical Threshold: ≤2% for 30 minutes
  • Warning Threshold: ≤15% for 1 hour

Impact

VM may become unresponsive or crash. Application errors likely. OOM (Out of Memory) killer may terminate processes.

Investigation Steps

1. Check Memory Metrics

  • Azure Portal → VM → Metrics → "Available Memory Bytes"
  • Compare with "Percentage Memory" metric
  • Review 24-hour trend to identify leak vs. capacity issue

2. Identify Memory-Consuming Processes

[!NOTE] VM Connection Methods

  • Azure Portal: VM → Connect → Choose connection method (Bastion, RDP, SSH, Serial Console)
  • Access Requirements: Contributor or VM Contributor role on VM or resource group
  • Serial Console: Requires boot diagnostics enabled (Azure Portal → VM → Boot diagnostics)
  • Network Access: Bastion provides browser-based access without public IP requirements

Windows:

# Connect via RDP (Azure Portal → VM → Connect → RDP)
# OR via Azure Serial Console (VM → Serial Console)
# Task Manager → Performance → Memory
# Sort Processes by Memory column
# Or use PowerShell:
Get-Process | Sort-Object -Property WS -Descending | Select-Object -First 10

Linux:

# Connect via SSH (Azure Portal → VM → Connect → SSH)
# OR via Azure Serial Console (VM → Serial Console)
free -h
top -bn1 | head -20
# Or detailed view:
ps aux --sort=-%mem | head -20

3. Check for Memory Leaks

  • Review application logs for repeated object allocation
  • Check IIS/Tomcat/Java heap usage
  • Monitor memory over time (increasing = likely leak)

4. Review Recent Changes

  • Recent application deployments
  • Configuration changes
  • New background jobs or services

Remediation

[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes.

Investigation Actions

  1. Identify memory-leaking process (use investigation steps above)
  2. Review application logs for repeated object allocation
  3. Monitor memory trend (increasing = likely leak)
  4. Document findings in ServiceNow incident

Short-Term Resolution

Open ServiceNow Incident with Epic_Azure_Infrastructure_Ops:

  • Tier 3 Support will review and implement changes via incident or change request:
    • VM scaling for more RAM (D4s_v5 → D8s_v5, D8s_v5 → D16s_v5)
    • Application service restart (if safe)
    • Load reduction configuration
    • Cache clearing procedures
  • All changes implemented through Terraform/IaC
  • VM restarts coordinated with application teams

Long-Term Resolution

Create GitHub Issue: Epic on Azure Ops Issues

  • Engineering Team will implement permanent solutions:
    • Application memory leak fix (code profiling and remediation)
    • Memory usage optimization
    • Connection pooling implementation
    • Memory-based auto-scaling via Terraform
    • Swap space configuration (Linux) for emergency headroom
  • All solutions implemented through CI/CD pipeline
  • Changes tracked via GitHub issue → PR → deployment workflow

Escalation

  • Epic - Azure (National West): Open ServiceNow incident for memory leak investigation or database-related memory pressure
    • Application memory leak investigation
    • Database process memory issues
  • Epic_Azure_Infrastructure_Ops: Open ServiceNow incident for VM scaling assistance or persistent memory issues

Related Alerts

  • VM Availability (memory exhaustion causes crashes)
  • High CPU (memory paging/thrashing appears as CPU)

Historical Context

Common causes in OHEMR Epic environment:

  • Care Everywhere VMs: Known memory pressure issues
  • Citrix VDA VMs: Training environment undersized
  • Epic Cache processes with improper limits