ODB Snapshot Refresh Process
Introduction
This is a newly developed process and breaking changes are likely. While the usage in AWX from
a user's perspective might not change much, many of the links below may be invalid. Contact
the development team for further assistance.
This outlines the automated database refresh process for Epic ODB instances (also known as Iris
databases). The workflow leverages Iris's envcopy tool to orchestrate the refresh, utilizing a
custom BACKUP_RESTORE hook script for cloud disk management and data integrity validation. By
coordinating a series of scripted and automated steps, this process ensures that database
environments can be refreshed quickly, reliably, and with minimal manual intervention.
The major stages of the process include preparing the source database for copying, invoking
infrastructure automation through AWX and Ansible, managing Azure managed disk snapshots and
attachments, performing file system integrity checks, and returning control to envcopy for
finalization. This method supports safe migration and refresh of database environments, enabling
consistent and predictable updates for Epic ODB instances.
Getting Started
AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.
AWX
The entire process can be run from AWX. You can reach AWX at the links below:
| Environment | URL |
|---|---|
| Prod | https://epic-awx.service.aide-0085665.ap.central.azu.grid.uhg.com/ |
| CloudTest | TBD |
Getting Access
To gain access to the snapshot refresh jobs, a user must belong to the follow groups in Secure:
eoa_awx_userseoa_awx_infra_ops
Pre-Refresh Considerations
- Review the required variables and ensure their correctness. See this section for more information on altering variables
- Ensure source and target environments are healthy and ready.
- There could be Azure Resource Quota limits that prevent the hook script from creating enough Managed Disks. Work with cloud teams to address this.
- The hook script uses Azure SPNs to authenticate. Work with cloud teams if authentication errors are observed.
Scenarios
This solution supports two types of scenarios: Inter-VM (multiple VMs) and Intra-VM (same VM). The hook script that runs detects they scenario type by inspecting snap_refresh_source.hostname and snap_refresh_target.hostname. If they match, it performs intra-VM steps. If they do not match, those steps are omitted.
During an intra-VM refresh, the source instance's logical volume (LV) and volume group (VG) are renamed and source LV/VG UUID are rewritten. This is done to prevent name/UUID collisions when the new disks are attached.
| AWX Group | Scenario Name (if needed) | Type | Description | Source VM | Source Instance | Target VM | Target Instance |
|---|---|---|---|---|---|---|---|
| wsupmir_to_wsup2 | WSUPMIR_TO_WSUP2 | Inter-VM | WSUPMIR prod mirror to WSUP2 | zwplodbew302.ms.ds.uhc.com | WSUPMIR | zwplodbew303.ms.ds.uhc.com | WSUP2 |
| wsupclmir_to_wsup2cl | WSUPCLMIR_TO_WSUP2CL | Inter-VM | WSUPCLMIR prod mirror to WSUP2CL | zwplodbcl302.ms.ds.uhc.com | WSUPCLMIR | zwplodbcl303.ms.ds.uhc.com | WSUP2CL |
| wsupmir_to_prd | WSUPMIR_TO_PRD | Inter-VM | WSUPMIR prod mirror to PRD | zwplodbew302.ms.ds.uhc.com | WSUPMIR | zwplodbew501.ms.ds.uhc.com | PRD |
| wsupmir_to_dr | WSUPMIR_TO_DR | Inter-VM | WSUPMIR prod mirror to Prod DR | zwplodbew302.ms.ds.uhc.com | WSUPMIR | zcrplodbew601.ms.ds.uhc.com | PRD |
| wsupmir_to_rpt | WSUPMIR_TO_RPT | Inter-VM | WSUPMIR prod mirror to RPT | zwplodbew302.ms.ds.uhc.com | WSUPMIR | zwplodbew401.ms.ds.uhc.com | RPT |
| ex_source_to_target | EX_SOURCE_TO_TARGET | Intra-VM | Example group for intra-VM refreshes | zwplodbew302.ms.ds.uhc.com | SOURCEINST | zwplodbew302.ms.ds.uhc.com | TARGETINST |
Running a Refresh
| Refresh Type | Inventory Group | AWX Link |
|---|---|---|
| ODB Refresh w/ envcopy | west | ODB Refresh |
| ODB Refresh w/o envcopy | west | ODB Refresh Without Envcopy |
- To start the process, open the AWX Job Template linked above or find the desired template from the list by clicking "Templates" under "Resources" on the left sidebar.
- This opens the Launch modal. On the "Inventory" step, the appropriate inventory should be set to a default. Click "Next"
- On the "Inventory Groups" step, ensure that the group matching the intended scenario is selected (see above). Click "Next"
- On the "Other prompts" step, ensure that the "Limit" field contains the correct AWX group name listed above. Increase "Verbosity" if desired. Leave "Job Tags" and "Skip Tags" should be blank.
- The "Preview" step is the final step before launch. Ensure the Iris environments are ready for refresh. Once confirmed, click "Launch"
Making Variable Changes
- Open the AWX Job Template linked above or find the desired template from the list by clicking "Templates" under "Resources" on the left sidebar.
- On the "Details" tab, click the link found in the "Inventory" field. This takes you to the AWX Inventory configured for this Job Template
- You can view/edit variable data in YAML or JSON. Use the toggle beside "Variables" to switch. Click "Edit".
- Make changes to the variables necessary and click "Save" Refer to the variable explanations below for more information.
Required Variables
Use caution when specifying `disk_pattern` variables. These rely on regular expressions (regex) to search for which disks to manipulate during the process. Be sure to validate the supplied patterns against the existing list of Azure Managed Disk names before proceeding
| Variable | Example | Description |
|---|---|---|
snap_refresh_project_name | A unique identifier for this database refresh effort | |
snap_refresh_source | Dictionary describing necessary inputs related to the source (see below) | |
snap_refresh_source.vm_name | zwplodbew001 | Host name containing source instance |
snap_refresh_source.ip | IP Address for source VM | |
snap_refresh_source.rg | Azure resource group containing source instance VM | |
snap_refresh_source.disk_pattern | ^.*supmir[0-9]{2}01.*$ | Regular expression that can filter the list of disk names attached to the source to only those desired for refresh. |
snap_refresh_source.instance | supmir | Iris source instance name |
snap_refresh_source.vg_name | supmir01vg | LVM volume group name of the source disks |
snap_refresh_source.lv_name | supmir01lv | LVM logical volume name of the source disks |
snap_refresh_target | Dictionary describing necessary inputs related to the target (see below). | |
snap_refresh_target.vm_name | zwplodbew001 | Host name containing target instance |
snap_refresh_target.ip | IP Address for target VMS | |
snap_refresh_target.rg | Azure resource group containing target instance VM | |
snap_refresh_target.disk_pattern | ^.*rpt[0-9]{2}01.*$ | Regular expression that can filter the list of disk names attached to the target to only those desired for refresh. This is the list of disks that will be detached from the target |
snap_refresh_target.instance | rpt | Iris target instance name |
snap_refresh_target.vg_name | rpt01vg | LVM volume group name of the target disks |
snap_refresh_target.lv_name | rpt01lv | LVM logical volume name of the target disks |
snap_refresh_target.mount_point | /epic/rpt01 | Directory to which the target instance's data logical volume is mounted |
Optional Variables
These variables are not required but can be used to alter the behavior of the refresh process:
| Variable | Default | Description |
|---|---|---|
snap_refresh_evt_id | <undefined> | Specific, pre-existing EVT record ID to use. Do not use with snap_refresh_evt_template |
snap_refresh_logfile_path | /epic/logs/snap_refresh.log | Hook script logfile location |
snap_refresh_bin_path | /usr/local/bin/snap_refresh.sh | Full path to the hook script on the target VM |
snap_refresh_bin_perms.owner | epicadm | User ownership for hook script location and script |
snap_refresh_bin_perms.group | epicsys | Group ownership for hook script location and script |
snap_refresh_bin_perms.mode | 0770 | File permissions for the hook script location |
snap_refresh_start_after_completion | false | Indicates whether to start the target database once the refresh is done. |
snap_refresh_default_disk_pattern | adhoc | Appended to disk name to name the snapshots |
snap_refresh_polling_mins | 5 | Time in minutes to wait for disk/snapshot hydration |
snap_refresh_polling_retries | 48 | Number of times to retry checks for disk/snapshot hydration. Note: snap_refresh_polling_mins * snap_refresh_polling_retries represents the maximum amount of time the process will wait for hydration. |
snap_refresh_freeze_source | false | Indicates whether to freeze the source instance before taking snapshots. envcopy typically handles this but can be toggled to ensure consistent snapshots when testing |
Hook Script Config File
The hook script relies on a .cfg file for things like Azure client secrets as well as the source/target configuration defined in the AWX groups for each scenario described above. This is placed onto the server for the duration of the scenario but is deleted on success or failure.
# Refresh Project Info
SNAP_REFRESH_PROJECT_NAME={{ snap_refresh_project_name }}
SNAP_REFRESH_INTRA_VM={{ snap_refresh_intra_vm }}
# ODB Host Info
SRC_VM_HOSTNAME={{ snap_refresh_source.vm_name }}
SRC_VM_RG={{ snap_refresh_source.rg }}
SRC_VM_DATADISK_PATTERN={{ snap_refresh_source.disk_pattern }}
SRC_VM_INST_NAME={{ snap_refresh_source.instance }}
SRC_VM_LV_NAME={{ snap_refresh_source.lv_name }}
SRC_VM_VG_NAME={{ snap_refresh_source.vg_name }}
SRC_VM_MOUNT_POINT={{ snap_refresh_source.mount_point }}
TAR_VM_HOSTNAME={{ snap_refresh_target.vm_name }}
TAR_VM_RG={{ snap_refresh_target.rg }}
TAR_VM_DATADISK_PATTERN={{ snap_refresh_target.disk_pattern }}
TAR_VM_INST_NAME={{ snap_refresh_target.instance }}
TAR_VM_LV_NAME={{ snap_refresh_target.lv_name }}
TAR_VM_VG_NAME={{ snap_refresh_target.vg_name }}
TAR_VM_MOUNT_POINT={{ snap_refresh_target.mount_point }}
# Azure CLI Config
AZ_CLIENT_ID={{ az_auth_client_id }}
AZ_CLIENT_SECRET={{ az_auth_client_secret }}
AZ_TENANT_ID={{ az_auth_tenant_id }}
AZ_SUBSCRIPTION_ID={{ az_auth_subscription_id }}
# Script Defaults
POLLING_INTERVAL={{ snap_refresh_polling_interval }}
LOGFILE_PATH={{ snap_refresh_logfile_path }}
```text
## Hook Script Error Codes
| No. | Name | Description |
| --- | ---- | ----------- |
| 4 | ERR_AZ_LOGIN | Error logging into Azure |
| 5 | ERR_AZ_LOGOUT | Error logging out of Azure |
| 6 | ERR_AZ_SNAP_CREATE | Error during snapshot creation |
| 7 | ERR_AZ_SRC_VM_DISK_INFO | Error looking up source VM disk info |
| 8 | ERR_AZ_TAR_VM_DISK_CREATE | Error creating disks |
| 10 | ERR_AZ_DISK_ATTACHMENT | Error attaching disks to target VM |
| 11 | ERR_AZ_DISK_MOUNT | Error mounting attached disks on the target VM OS |
| 12 | ERR_AZ_DISK_DETACHMENT | Error detaching "old" disks from target VM |
| 13 | ERR_AZ_DISK_MOUNT | Error mounting donor LVs to target mount point |
| 14 | ERR_DONOR_NOT_FOUND | Error finding donor LVs on target VM |
## High Level Flow
```mermaid
sequenceDiagram
participant awx as AWX
participant target as Target
participant azure as Azure
awx->>awx: 1. invoke start playbook
awx->>target: 2. create envcopy.conf
awx->>target: 3. run scenario
target->>target: 4. freeze
target->>target: 5. fileCopy/backupRestore phase
target->>azure: 6. snapshot disks
target->>azure: 7. create disks from snapshots
target->>azure: 8. attach disks to target VM
azure->>target: 9. validate donated disks
target->>azure: 10. detach "old" disks
azure->>target: 11. mount donated LV
target->>target: 12. envcopy completion
```text
1. A user launches an AWX Job Template to initiate the process.
2. A new envcopy scenario is created. A new envcopy.conf configuration file is generated from a template and playbook inputs/role defaults.
3. The new scenario is run.
4. As part of the envcopy process and scenario config, the source instance is frozen.
5. Executes our custom data transfer method. This is where the custom snapshot-refresh hookscript is invoked.
6. The hook script creates snapshots from the source instance's data disks.
7. Via Azure CLI, the hook script creates new managed disks from the snapshots.
8. Via Azure CLI, the hook script attaches the new managed disks to the target VM.
9. On the target instance, the hook script checks the donated disks for valid LVM and filesystem configuration.
10. Via Azure CLI, the hook script detaches the "old" or "previous" data disks from the target VM.
11. On the target VM, the hook script mounts the donated logical volume (LV) to the mount point configured in playbook inputs.
12. On the target instance, envcopy resumes control and finishes its tasks.
### Intra-VM Refresh Branches
During an intra-VM refresh, a few extra steps are performed. Prior to snapshot creation, the source instance's logical volume (LV) and volume group (VG) are renamed and source LV/VG UUID are rewritten. This is done to prevent name/UUID collisions when the new disks are attached. Following snapshot hydration, those LVs/VGs are renamed back, unmounted, UUIDs rewritten, reactivated, then remounted.
## `envcopy` Phases
| Stage | Action |
| ----- | ------ |
| preExecution | Execute PREEXECUTION hook scripts |
| saveConfig | Run `saveConfig^%ZeENV` |
| preCopy | Run `preCopy^ZeENV` |
| preRefresh | Run PREREFRESH hook scripts |
| fileCopy/backupRestore | Copy database files, or use the BACKUP_RESTORE hook script if configured |
| postFileCopy | Execute POSTFILECOPY hook scripts |
| startDestination | Start the target environment |
| preRunTasks | Execute PRERUNTASKS hook scripts |
| runTasks | Run `runTasks^%ZeENV` |
| postRunTasks | Execute POSTRUNTASKS hook scripts |
| destFinalRunlevel | Bring the target environment to its final runlevel |
| sourceFinalRunlevel | Bring the source environment to its final runlevel |
| final | Execute the FINAL hook scripts |
## Azure Authentication
This process expects the existence of an Azure Service Principal Name (SPN) with the permissions below:
- Create snapshots
- Create managed disks (from snapshots)
- Attach/detach Disks
- Show managed disk information
- Show VM information
It is recommended practice to pull these credentials from a secure source such as HCP Vault. See the example playbook below.
## More Information
1. [Non-Production ODB Migrations Strategy Guide (Galaxy)](https://galaxy.epic.com/?#Browse/page=1!68!50!100315688)