PostmortemsUpdated July 3, 2026
Postmortem: West Training Servers Unexpected Shutdown - March 10, 2025
postmortemvm-shutdownterraformazuredisk-controllermonitoringalertstraining-environmentincident-response
Critical Issue -Root Cause Analysis -West Training Servers shutdown unexpectedly on 3/10
1. Summary 16 West Training VMs were shutdown unexpectedly due to (2) failed updates to VM Disk Controller type "NVE" that are not supported on Gen1 VM images
- Two terraform updates – affected (8) VMs each by two different SPNs:
- Update 1 caused VMs to stop, deallocate and remain off at 4:29a (8)
- Update 2 caused VMs to stop, deallocate and reaming off at 6:50a (8)
- All VMs were in Epic Non-Prod subscription
- NOTE: Citrix also had a provisioning update 2 hrs later, that failed due to max CPU allocation reached – but not related to above
1.1. Initial Findings
- A planned changed to a VM, with a successful TF plan (test) triggered a state file update on multiple VMs deployed as a set
- TF Plan did not indicate that changes to related deployments would affect the running status of the VM upon failure.
- Monitoring alerts did fire – but emails were NOT sent due to know issue with current Email Distribution list (DL)
1.2. Current State
- VM were off for approximately 2 hours when Epic admins logged in
- VM were turned back on ~ 11a EST
2. Alert Monitoring
- All 16 VM shutdowns were detected by Alert Monitoring
- Alerts 1st set were raised within 2 min @ 4:31a (8)
- Alerts 2nd set were raised in 2 min @ 6:51a (8)
- After reboot – Alert status was changed to Resolved (8)
2.1. Initial Findings: Email Notification – failed
- Monitoring alerts did fire – but emails were NOT sent due to know issue with current Email Distribution list (DL)
- Need to determine why (8) alerts did NOT auto-resolve
2.2. Resolution
- A new Email Notification Group has been create to address receiving service notification from Azure
- ohemrcloudalerts
- Enabling Resource Health alerts to detect platform related issues is planned in this PI
3. Terraform Apply Failed
- The deployment to change the VM storage type from standard to NVE supported failed during the deployment with the error code:
Disk Controller Type property 'NVMe' is not supported by the OS image or disk specified for the VM. Disk Controller types supported by the OS are 'SCSI'.
3.1. Findings
- Making changes to VM disk-controllers and storage type should be tested prior to deployment in an upper-environment. This type of change caused the VM deployment to fail and Azure API put the VM in a deallocsated state
3.2. Resolution
- Update code to ignore changes to Gen1 VMs, test and provide recommendation to reduce risk to resource state changes
- Validate the state of the servers in the workspace after deployment to confirm the health and state
4. Further Recommendations
- Terraform core services changes should be done in CloudTest prior to moving to higher environments
- Architecture changes to existing patterns (VM configs etc) should be tested, validated and then reviewed in the ARB
- To prevent unintended deletes or adverse changes Resource Locks for critical shared resources should be deployed as planned in the LLD:
| Target | Level | Lock |
|---|---|---|
| Any core/shared networking | Resource Group | Delete |
| Virtual Networks | Resource Group | Delete |
| VNet Peerings | Resource Group | Delete |
| Routing Tables | Resource Group | Read |
| Network Security Groups | Resource Group | Read |
| Application Security Groups | Resource Group | Read |
| Virtual Appliances (NGFW, WAF, SD-WAN) | Resource Group | Delete |
| Domain controllers | Resource Group | Delete |
| Public Ips | Resource Group | Delete |