MonitoringUpdated July 3, 2026
Runbook: Windows Virtual Machine Low Disk Space
runbookazure-monitoralertsvm-performancedisk-spacestoragetroubleshootingincident-responseservicenowinfrastructure-as-code
Runbook: Windows Virtual Machine Low Disk Space
Alert Details
- Metric: LogicalDisk % Free Space
- Metric Namespace: Azure.VM.Windows.GuestMetrics
- Critical Threshold: <10% free space for 15 minutes
- Warning Threshold: <15% free space for 15 minutes
- Resource Type: Windows Virtual Machines (requires Azure Monitor Agent)
Impact
Low disk space can cause:
- Application failures and crashes
- Database corruption
- Log file write failures
- Service interruptions
- System instability
- Inability to process new data or transactions
Investigation Steps
1. Check Disk Space Metrics in Azure Portal
- Navigate to Azure Portal → Virtual Machines → [VM Name] → Metrics
- Select "LogicalDisk % Free Space" metric
- Adjust time range to last 24 hours
- Identify which logical disk (C:, D:, E:, etc.) triggered the alert
- Look for patterns: gradual growth vs. sudden spike
2. Identify Disk Usage and Large Files
[!NOTE] VM Connection Methods
- Azure Portal: VM → Connect → Choose connection method (Bastion, RDP, Serial Console)
- Access Requirements: Contributor or VM Contributor role on VM or resource group
- Serial Console: Requires boot diagnostics enabled (Azure Portal → VM → Boot diagnostics)
- Network Access: Bastion provides browser-based access without public IP requirements
Windows:
# Connect via RDP (Azure Portal → VM → Connect → RDP)
# OR via Azure Serial Console (VM → Serial Console)
# Check disk space on all drives
Get-PSDrive -PSProvider FileSystem | Select-Object Name, Used, Free, @{Name="UsedPercent";Expression={[math]::Round(($_.Used / ($_.Used + $_.Free)) * 100, 2)}}
# Find largest folders on C: drive (top 20)
Get-ChildItem C:\ -Directory -ErrorAction SilentlyContinue |
ForEach-Object {
$size = (Get-ChildItem $_.FullName -Recurse -ErrorAction SilentlyContinue |
Measure-Object -Property Length -Sum).Sum
[PSCustomObject]@{
Path = $_.FullName
SizeGB = [math]::Round($size / 1GB, 2)
}
} | Sort-Object SizeGB -Descending | Select-Object -First 20
# Find largest files (top 20)
Get-ChildItem C:\ -File -Recurse -ErrorAction SilentlyContinue |
Sort-Object Length -Descending |
Select-Object -First 20 FullName, @{Name="SizeMB";Expression={[math]::Round($_.Length / 1MB, 2)}}
# Check log file sizes
Get-ChildItem "C:\Windows\Logs" -Recurse -ErrorAction SilentlyContinue |
Measure-Object -Property Length -Sum |
Select-Object @{Name="TotalSizeGB";Expression={[math]::Round($_.Sum / 1GB, 2)}}
# Check temp folder sizes
Get-ChildItem $env:TEMP -Recurse -ErrorAction SilentlyContinue |
Measure-Object -Property Length -Sum |
Select-Object @{Name="TotalSizeGB";Expression={[math]::Round($_.Sum / 1GB, 2)}}
3. Check Application Logs
- Review application logs for disk-related errors or warnings
- Check Epic application logs (if applicable):
- Epic Cache logs
- Interconnect logs
- Print Spool directories
- Look for failed log rotation or archiving processes
- Check database logs for growth patterns
4. Review Disk Growth Trends
- Azure Portal → VM → Metrics → "LogicalDisk % Free Space"
- Analyze historical data to understand growth rate
- Determine if this is gradual growth or sudden consumption
- Correlate with application deployments or batch job schedules
5. Check for Common Space Consumers
Common Windows locations to check:
C:\Windows\Logs- Windows system logsC:\Windows\Temp- Windows temp filesC:\Users\*\AppData\Local\Temp- User temp filesC:\inetpub\logs- IIS logs- SQL Server log files
- Application-specific log directories
- Database backup files
- Windows Update cache
Remediation
[!WARNING] Infrastructure as Code Policy All infrastructure changes must be implemented through proper incident/change management. Do not make manual changes to infrastructure.
Investigation Actions
- Identify the logical disk with low space (use investigation steps above)
- Determine the top space-consuming folders and files
- Review disk space growth trends to estimate when disk will be full
- Document findings in ServiceNow incident with:
- Affected disk (C:, D:, etc.)
- Current free space percentage
- Top 5 space-consuming folders/files
- Growth rate (GB per day/week)
Short-Term Resolution
Open ServiceNow Incident with Epic_Azure_Infrastructure_Ops:
- Tier 3 Support will review and implement changes via incident or change request:
- Disk expansion (must be done through Terraform/IaC)
- Safe cleanup of temporary files and logs
- Log rotation configuration
- Move old data to Azure Blob Storage (archive tier)
- Database log file shrinking (if safe and appropriate)
- All changes implemented through Terraform/IaC
- No manual Azure Portal disk resizing
Safe Temporary Cleanup (coordinate with application teams):
- Clear Windows temp files (use Disk Cleanup utility)
- Archive old application logs
- Clear IIS logs older than retention period
- Remove old Windows Update files
Long-Term Resolution
Create GitHub Issue: Epic on Azure Ops Issues
- Engineering Team will implement permanent solutions:
- Proper disk sizing based on workload analysis
- Automated log rotation and archiving
- Log forwarding to centralized logging (Splunk)
- Database maintenance plans for log management
- Storage tiering strategy (hot data on VM, cold data on Blob Storage)
- Monitoring and alerting for log file growth
- Disk auto-expansion policies via Terraform
- All solutions implemented through CI/CD pipeline
- Changes tracked via GitHub issue → PR → deployment workflow
Terraform Configuration Example
module "metric_alerts_disk_space" {
source = "terraform.uhg.com/uhg-customer-modules/private-registry-metric-alerts/epic"
version = "1.7.5"
short_name = "DISK"
explanation = "Low disk space detected. Investigation: Check disk usage trends, identify large files/folders, review application logs for disk space issues. Remediation: Clean up old logs, expand disk size, move data to alternate storage."
target_resource_location = local.resource_location
target_resource_type = "Microsoft.Compute/virtualMachines"
target_name = local.target_name
metric_name = "LogicalDisk % Free Space"
metric_namespace = "Azure.VM.Windows.GuestMetrics"
resource_group_name = azurerm_resource_group.ohemr-rg.name
scopes = concat(local.alert_scopes_app_rg_ids, local.alert_scopes_odb_rg_ids)
email_recipients = {
prod_action_group = {
name = "prod_action_group"
email_address = local.recipients
use_common_alert_schema = true
}
}
event_hub = {
event_hub_npd = {
name = "As per region"
event_hub_namespace = "As per region"
event_hub_name = "diagnostic-logs"
use_common_alert_schema = false
}
}
alerts = {
critical = {
threshold = 10 # Less than 10% free space
severity = 0
aggregation = "Average"
operator = "LessThan"
severity_name = "critical"
metric_details = "Disk Free Space Less than 10% ${local.environment} ${local.resource_location}"
frequency = "PT5M"
window_size = "PT15M"
}
warning = {
threshold = 15 # Less than 15% free space
severity = 2
aggregation = "Average"
operator = "LessThan"
severity_name = "warning"
metric_details = "Disk Free Space Less than 15% ${local.environment} ${local.resource_location}"
frequency = "PT5M"
window_size = "PT15M"
}
}
}
Escalation
- Epic_Azure_Infrastructure_Ops: Open ServiceNow incident if disk space <5% or requires immediate expansion
- Epic - Azure (National West): Open ServiceNow incident if disk space issue is caused by application behavior or database growth
- Application log file explosion
- Database transaction log growth
- Application data retention issues
Related Alerts
- High CPU Usage (disk I/O operations can cause CPU spikes)
- Application Performance Issues (disk space can cause application errors)
- Database Performance (disk space affects database operations)
Historical Context
Common causes in OHEMR Epic environment:
- IIS log files not rotated properly
- Epic Print Spool directory growth
- SQL Server transaction log growth
- Windows Update cache accumulation
- Epic Cache local storage growth
- Temp file accumulation from batch jobs