Infrastructure Practices
Infrastructure Practices
Infrastructure as Code Excellence and Operational Standards
What's Covered: Terraform best practices, Ansible automation, troubleshooting, and performance optimization
Infrastructure as Code (IaC) Excellence
Terraform Best Practices
Canonical consumer pattern (environment repo → private registry module):
# Good: consumer of canonical Epic on Azure module with pinned version and full tag contract
module "odb_linux_vm" {
source = "terraform.uhg.com/uhg-customer-modules/linux-resources/azurerm"
version = "3.4.1"
name = var.vm_name
location = var.location # westus2 (paired with westcentralus)
resource_group_name = var.resource_group_name
vm_size = var.vm_size # validated against Epic-approved SKU list
tags = merge(local.standard_tags, {
Component = "ApplicationServer"
})
}
# Bad: hand-rolled deprecated resource, unpinned, missing canonical tags
resource "azurerm_virtual_machine" "vm" { # deprecated — replaced by azurerm_linux_virtual_machine
name = "epic-prod-vm-01"
location = "East US" # not on the Epic on Azure region pair
vm_size = "Standard_D4s_v3"
}
Module Development Standards
```text
ohemr-epic-private-registry-<resource>/
├── main.tf # Primary resources
├── variables.tf # Input variables
├── outputs.tf # Output values
├── versions.tf # Provider requirements (pinned)
├── README.md # Module documentation
├── examples/ # Usage examples
│ └── basic/
└── tests/ # Terratest cases
└── terraform_test.go
```
Module Best Practices:
# variables.tf - Well-documented inputs
variable "vm_size" {
description = "The size of the Virtual Machine"
type = string
default = "Standard_D2s_v3"
validation {
condition = contains([
"Standard_D2s_v3", "Standard_D4s_v3", "Standard_D8s_v3"
], var.vm_size)
error_message = "VM size must be a supported Epic-approved SKU."
}
}
# outputs.tf - Useful return values
output "vm_id" {
description = "The ID of the Virtual Machine"
value = azurerm_linux_virtual_machine.main.id
}
output "vm_private_ip" {
description = "The private IP address of the Virtual Machine"
value = azurerm_network_interface.main.private_ip_address
}
Tagging contract
```hcl
locals {
standard_tags = {
Environment = "production"
Application = "epic-odb"
Owner = "[email protected]"
CostCenter = "12345"
DataClassification = "PHI" # PHI or NONPHI — drives backup retention and audit
ManagedBy = "Terraform"
}
}
```
See [Azure Resource Tagging Strategy](../../infrastructure/tagging-strategy/) for the full tag schema, allowed values, and Azure Policy enforcement details.
State Management
**State Locking:**
- Use Azure Storage Account for state storage
- Enable blob versioning for state history
- Configure state locking to prevent conflicts
- Regular state backups and retention policies
Ansible Configuration Management
Ansible Standards
Playbook Structure
Standard Playbook Format:
---
- name: Configure Epic Application Servers
hosts: epic_app_servers
become: yes
gather_facts: yes
vars:
epic_version: "{{ epic_app_version | default('2024.1') }}"
epic_environment: "{{ environment }}"
pre_tasks:
- name: Update system packages
package:
name: "*"
state: latest
when: update_packages | default(false) | bool
roles:
- role: base-os-config
vars:
base_packages:
- htop
- vim
- curl
- role: epic-application
vars:
epic_config: "{{ epic_app_config }}"
epic_database_host: "{{ hostvars[groups['epic_db'][0]]['ansible_host'] }}"
- role: monitoring-agent
vars:
monitoring_endpoints:
- dynatrace
- splunk
post_tasks:
- name: Verify Epic services are running
service:
name: "{{ item }}"
state: started
enabled: yes
loop:
- epic-app
- epic-scheduler
register: service_status
failed_when: service_status.failed
handlers:
- name: restart epic services
service:
name: "{{ item }}"
state: restarted
loop:
- epic-app
- epic-scheduler
Role Development
!!! example "Role Structure"
text ansible-role-epic-app/ ├── defaults/ │ └── main.yml # Default variables ├── files/ │ └── config/ # Static files ├── handlers/ │ └── main.yml # Event handlers ├── meta/ │ └── main.yml # Role metadata ├── tasks/ │ └── main.yml # Task definitions ├── templates/ │ └── epic.conf.j2 # Jinja2 templates ├── tests/ │ └── test.yml # Role tests ├── vars/ │ └── main.yml # Role variables └── README.md # Role documentation
Role Task Example:
---
# tasks/main.yml
- name: Create Epic application directory
file:
path: "{{ epic_app_dir }}"
state: directory
owner: "{{ epic_user }}"
group: "{{ epic_group }}"
mode: '0755'
tags:
- epic
- filesystem
- name: Deploy Epic application configuration
template:
src: epic.conf.j2
dest: "{{ epic_app_dir }}/epic.conf"
owner: "{{ epic_user }}"
group: "{{ epic_group }}"
mode: '0644'
backup: yes
notify:
- restart epic services
tags:
- epic
- configuration
- name: Install Epic application packages
package:
name: "{{ epic_packages }}"
state: present
tags:
- epic
- packages
Troubleshooting & Issue Resolution
Common Infrastructure Issues
**Resolution Steps:**
1. Check Azure Service Health for region issues
2. Verify subscription quotas and limits
3. Review NSG rules and effective security rules
4. Enable boot diagnostics for troubleshooting
5. Use Azure Serial Console for direct access
**Resolution Steps:**
1. Run `terraform plan -refresh-only` to detect drift (the standalone `terraform refresh` command is deprecated)
2. Use `terraform import` for manual resources
3. Resolve state locks with `terraform force-unlock`
4. Pin provider versions in `versions.tf`
5. Document and reverse manual changes
Debugging Toolkit
Azure CLI Debugging Commands:
# Check VM status and properties
az vm show --resource-group rg-epic-prod --name vm-epic-app-01 \
--query '{name:name,powerState:instanceView.statuses[1].displayStatus,provisioningState:provisioningState}'
# Review recent activity logs for resource group
az monitor activity-log list --resource-group rg-epic-prod \
--max-events 10 --query '[].{Time:eventTimestamp,Operation:operationName.value,Status:status.value}'
# Check network security group rules
az network nsg rule list --resource-group rg-epic-prod --nsg-name nsg-epic-app \
--query '[].{Name:name,Priority:priority,Access:access,Protocol:protocol,Direction:direction}'
# Get VM boot diagnostics
az vm boot-diagnostics get-boot-log --resource-group rg-epic-prod --name vm-epic-app-01
# Check disk usage and performance
az vm show --resource-group rg-epic-prod --name vm-epic-app-01 \
--query 'storageProfile.osDisk.{name:name,diskSizeGb:diskSizeGb,caching:caching}'
Terraform Debugging Commands:
# Enable detailed logging
export TF_LOG=DEBUG
export TF_LOG_PATH=terraform-debug.log
# Plan with detailed output
terraform plan -detailed-exitcode -out=plan.tfplan
# Show current state in human-readable format
terraform show
# Check for configuration drift (state-only refresh)
terraform plan -refresh-only
# Import existing Azure resources
terraform import azurerm_virtual_machine.example /subscriptions/.../resourceGroups/.../providers/Microsoft.Compute/virtualMachines/vm-name
# Validate configuration
terraform validate
# Check formatting
terraform fmt -check -diff
Ansible Debugging:
# Run playbook with increased verbosity
ansible-playbook -vvv playbook.yml
# Check inventory and host connectivity
ansible all -m ping --inventory inventory/
# Test specific tasks with tags
ansible-playbook playbook.yml --tags "configuration" --check
# Gather facts from hosts
ansible epic_servers -m setup --inventory inventory/
# Test role syntax
ansible-playbook --syntax-check playbook.yml
Performance Optimization
Database Performance Tuning
!!! example "Epic Database Optimization"
Index Optimization:
sql -- Monitor slow queries SELECT TOP 10 total_elapsed_time/1000/1000 AS [Total Time (s)], execution_count, total_elapsed_time/execution_count/1000 AS [Avg Time (ms)], SUBSTRING(st.text, (qs.statement_start_offset/2)+1, CASE WHEN qs.statement_end_offset = -1 THEN LEN(CONVERT(nvarchar(max), st.text))*2 ELSE qs.statement_end_offset END - qs.statement_start_offset)/2) AS query_text FROM sys.dm_exec_query_stats qs CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st ORDER BY total_elapsed_time DESC;
**Connection Pool Configuration:**
```yaml
# Epic database connection settings
epic_db_config:
max_connections: 100
initial_pool_size: 10
connection_timeout: 30
idle_timeout: 300
validation_query: "SELECT 1"
```
Application Scaling Patterns
```hcl
resource "azurerm_linux_virtual_machine_scale_set" "epic_app" {
name = "${var.environment}-epic-app-vmss"
resource_group_name = var.resource_group_name
location = var.location
sku = var.vm_size
instances = var.initial_capacity
admin_username = var.admin_username
upgrade_mode = "Manual"
automatic_os_upgrade_policy {
enable_automatic_os_upgrade = false
disable_automatic_rollback = false
}
tags = var.common_tags
}
# Auto-scaling rules
resource "azurerm_monitor_autoscale_setting" "epic_app" {
name = "${var.environment}-epic-app-autoscale"
location = var.location
profile {
name = "default"
capacity {
default = 2
minimum = 1
maximum = 10
}
rule {
metric_trigger {
metric_name = "Percentage CPU"
threshold = 75
time_aggregation = "Average"
time_window = "PT5M"
frequency = "PT1M"
statistic = "Average"
}
scale_action {
direction = "Increase"
type = "ChangeCount"
value = "1"
cooldown = "PT5M"
}
}
}
}
```
Load Balancer Optimization
**Session Affinity for Epic:**
```hcl
resource "azurerm_lb_rule" "epic_app" {
name = "epic-app-rule"
protocol = "Tcp"
frontend_port = 443
backend_port = 443
enable_floating_ip = false
idle_timeout_in_minutes = 4
load_distribution = "SourceIP" # Session affinity
disable_outbound_snat = false
}
```
Advanced Infrastructure Practices
Disaster Recovery Preparation
backup {
frequency = "Daily"
time = "23:00"
}
retention_daily {
count = 30
}
retention_weekly {
count = 12
weekdays = ["Sunday"]
}
retention_monthly {
count = 12
weekdays = ["Sunday"]
weeks = ["First"]
}
}
```
**Cross-Region Replication:**
```hcl
resource "azurerm_site_recovery_replicated_vm" "epic_dr" {
name = "${var.vm_name}-asr"
resource_group_name = var.dr_resource_group_name
recovery_vault_name = var.recovery_vault_name
source_recovery_fabric_name = var.source_fabric_name
source_vm_id = azurerm_linux_virtual_machine.epic_app.id
recovery_replication_policy_id = var.replication_policy_id
source_recovery_protection_container_name = var.source_container_name
target_resource_group_id = var.target_resource_group_id
target_recovery_fabric_id = var.target_fabric_id
target_recovery_protection_container_id = var.target_container_id
}
```
Infrastructure Resilience
!!! example "Multi-Region Deployment"
Regional Failover Configuration — Epic on Azure standard region pair is westus2 (primary) ↔ westcentralus (secondary):
```hcl
# Primary region resources
module "epic_primary" {
source = "../modules/epic-infrastructure"
environment = var.environment
region = "westus2"
is_primary = true
# Database configuration
db_failover_group_enabled = true
db_geo_backup_enabled = true
tags = merge(var.common_tags, {
Region = "Primary"
})
}
# Secondary region resources
module "epic_secondary" {
source = "../modules/epic-infrastructure"
environment = var.environment
region = "westcentralus"
is_primary = false
# Reference primary region resources
primary_db_server_id = module.epic_primary.db_server_id
tags = merge(var.common_tags, {
Region = "Secondary"
})
}
# Traffic Manager for DNS failover
resource "azurerm_traffic_manager_profile" "epic" {
name = "${var.environment}-epic-tm"
resource_group_name = var.resource_group_name
traffic_routing_method = "Priority"
dns_config {
relative_name = "${var.environment}-epic"
ttl = 30
}
monitor_config {
protocol = "HTTPS"
port = 443
path = "/health"
interval_in_seconds = 30
timeout_in_seconds = 10
tolerated_number_of_failures = 3
}
}
```
Performance Monitoring Integration
log {
category = "Administrative"
enabled = true
retention_policy {
enabled = true
days = 90
}
}
metric {
category = "AllMetrics"
enabled = true
retention_policy {
enabled = true
days = 90
}
}
}
```
**Custom Metrics Collection:**
```yaml
# Ansible task for custom metrics
- name: Deploy Azure Monitor Agent
include_role:
name: azure-monitor-agent
vars:
ama_config:
data_collection_rules:
- name: "epic-performance-metrics"
performance_counters:
- "\\Processor(_Total)\\% Processor Time"
- "\\Memory\\Available MBytes"
- "\\LogicalDisk(C:)\\% Free Space"
- "\\Network Interface(*)\\Bytes Total/sec"
custom_logs:
- path: "/opt/epic/logs/application.log"
format: "json"
```
Getting Help with Infrastructure
!!! question "Infrastructure Questions" - Terraform Issues: Check Terraform Enterprise documentation - Ansible Problems: Review AWX Platform guides - Azure Troubleshooting: Use Azure CLI debugging commands - Performance Issues: Check Monitoring Strategy
!!! question "Emergency Procedures" - Production Outages: Contact on-call engineer immediately - Infrastructure Failures: Follow Incident Management process - Security Incidents: Escalate through security team channels - Data Recovery: Engage disaster recovery procedures
Infrastructure Practices | Epic on Azure Team Guidelines
These practices ensure reliable, scalable infrastructure supporting critical healthcare systems. Contribute improvements based on operational experience.
Last updated: September 2025