PostmortemsUpdated July 3, 2026

Postmortem: West Training Servers Unexpected Shutdown - March 10, 2025

postmortemvm-shutdownterraformazuredisk-controllermonitoringalertstraining-environmentincident-response

Critical Issue -Root Cause Analysis -West Training Servers shutdown unexpectedly on 3/10

1. Summary 16 West Training VMs were shutdown unexpectedly due to (2) failed updates to VM Disk Controller type "NVE" that are not supported on Gen1 VM images

Two terraform updates – affected (8) VMs each by two different SPNs:
Update 1 caused VMs to stop, deallocate and remain off at 4:29a (8)
Update 2 caused VMs to stop, deallocate and reaming off at 6:50a (8)
All VMs were in Epic Non-Prod subscription
NOTE: Citrix also had a provisioning update 2 hrs later, that failed due to max CPU allocation reached – but not related to above

1.1. Initial Findings

A planned changed to a VM, with a successful TF plan (test) triggered a state file update on multiple VMs deployed as a set
TF Plan did not indicate that changes to related deployments would affect the running status of the VM upon failure.
Monitoring alerts did fire – but emails were NOT sent due to know issue with current Email Distribution list (DL)

1.2. Current State

VM were off for approximately 2 hours when Epic admins logged in
VM were turned back on ~ 11a EST

2. Alert Monitoring

All 16 VM shutdowns were detected by Alert Monitoring
Alerts 1st set were raised within 2 min @ 4:31a (8)
Alerts 2nd set were raised in 2 min @ 6:51a (8)
After reboot – Alert status was changed to Resolved (8)

2.1. Initial Findings: Email Notification – failed

Monitoring alerts did fire – but emails were NOT sent due to know issue with current Email Distribution list (DL)
Need to determine why (8) alerts did NOT auto-resolve

2.2. Resolution

A new Email Notification Group has been create to address receiving service notification from Azure
ohemrcloudalerts
Enabling Resource Health alerts to detect platform related issues is planned in this PI

3. Terraform Apply Failed

The deployment to change the VM storage type from standard to NVE supported failed during the deployment with the error code:

Disk Controller Type property 'NVMe' is not supported by the OS image or disk specified for the VM. Disk Controller types supported by the OS are 'SCSI'.

3.1. Findings

Making changes to VM disk-controllers and storage type should be tested prior to deployment in an upper-environment. This type of change caused the VM deployment to fail and Azure API put the VM in a deallocsated state

3.2. Resolution

Update code to ignore changes to Gen1 VMs, test and provide recommendation to reduce risk to resource state changes
Validate the state of the servers in the workspace after deployment to confirm the health and state

4. Further Recommendations

Terraform core services changes should be done in CloudTest prior to moving to higher environments
Architecture changes to existing patterns (VM configs etc) should be tested, validated and then reviewed in the ARB
To prevent unintended deletes or adverse changes Resource Locks for critical shared resources should be deployed as planned in the LLD:

Target	Level	Lock
Any core/shared networking	Resource Group	Delete
Virtual Networks	Resource Group	Delete
VNet Peerings	Resource Group	Delete
Routing Tables	Resource Group	Read
Network Security Groups	Resource Group	Read
Application Security Groups	Resource Group	Read
Virtual Appliances (NGFW, WAF, SD-WAN)	Resource Group	Delete
Domain controllers	Resource Group	Delete
Public Ips	Resource Group	Delete