Getting StartedUpdated July 3, 2026

Infrastructure Practices

standardsinfrastructureiacterraformansibleautomationoperationstroubleshootingperformance

Infrastructure Practices

Infrastructure as Code Excellence and Operational Standards

What's Covered: Terraform best practices, Ansible automation, troubleshooting, and performance optimization

Infrastructure as Code (IaC) Excellence

Terraform Best Practices

Canonical consumer pattern (environment repo → private registry module):

# Good: consumer of canonical Epic on Azure module with pinned version and full tag contract
module "odb_linux_vm" {
  source  = "terraform.uhg.com/uhg-customer-modules/linux-resources/azurerm"
  version = "3.4.1"

  name                = var.vm_name
  location            = var.location           # westus2 (paired with westcentralus)
  resource_group_name = var.resource_group_name
  vm_size             = var.vm_size            # validated against Epic-approved SKU list

  tags = merge(local.standard_tags, {
    Component = "ApplicationServer"
  })
}

# Bad: hand-rolled deprecated resource, unpinned, missing canonical tags
resource "azurerm_virtual_machine" "vm" {  # deprecated — replaced by azurerm_linux_virtual_machine
  name     = "epic-prod-vm-01"
  location = "East US"                       # not on the Epic on Azure region pair
  vm_size  = "Standard_D4s_v3"
}

Module Development Standards

```text
ohemr-epic-private-registry-<resource>/
├── main.tf              # Primary resources
├── variables.tf         # Input variables
├── outputs.tf           # Output values
├── versions.tf          # Provider requirements (pinned)
├── README.md            # Module documentation
├── examples/            # Usage examples
│   └── basic/
└── tests/               # Terratest cases
    └── terraform_test.go
```

Module Best Practices:

# variables.tf - Well-documented inputs
variable "vm_size" {
  description = "The size of the Virtual Machine"
  type        = string
  default     = "Standard_D2s_v3"

  validation {
    condition = contains([
      "Standard_D2s_v3", "Standard_D4s_v3", "Standard_D8s_v3"
    ], var.vm_size)
    error_message = "VM size must be a supported Epic-approved SKU."
  }
}

# outputs.tf - Useful return values
output "vm_id" {
  description = "The ID of the Virtual Machine"
  value       = azurerm_linux_virtual_machine.main.id
}

output "vm_private_ip" {
  description = "The private IP address of the Virtual Machine"
  value       = azurerm_network_interface.main.private_ip_address
}

Tagging contract

```hcl
locals {
  standard_tags = {
    Environment        = "production"
    Application        = "epic-odb"
    Owner              = "[email protected]"
    CostCenter         = "12345"
    DataClassification = "PHI"           # PHI or NONPHI — drives backup retention and audit
    ManagedBy          = "Terraform"
  }
}
```

See [Azure Resource Tagging Strategy](../../infrastructure/tagging-strategy/) for the full tag schema, allowed values, and Azure Policy enforcement details.

State Management

**State Locking:**
- Use Azure Storage Account for state storage
- Enable blob versioning for state history
- Configure state locking to prevent conflicts
- Regular state backups and retention policies

Ansible Configuration Management

Ansible Standards

Playbook Structure

Standard Playbook Format:

---
- name: Configure Epic Application Servers
  hosts: epic_app_servers
  become: yes
  gather_facts: yes
  vars:
    epic_version: "{{ epic_app_version | default('2024.1') }}"
    epic_environment: "{{ environment }}"

  pre_tasks:
    - name: Update system packages
      package:
        name: "*"
        state: latest
      when: update_packages | default(false) | bool

  roles:
    - role: base-os-config
      vars:
        base_packages:
          - htop
          - vim
          - curl

    - role: epic-application
      vars:
        epic_config: "{{ epic_app_config }}"
        epic_database_host: "{{ hostvars[groups['epic_db'][0]]['ansible_host'] }}"

    - role: monitoring-agent
      vars:
        monitoring_endpoints:
          - dynatrace
          - splunk

  post_tasks:
    - name: Verify Epic services are running
      service:
        name: "{{ item }}"
        state: started
        enabled: yes
      loop:
        - epic-app
        - epic-scheduler
      register: service_status
      failed_when: service_status.failed

  handlers:
    - name: restart epic services
      service:
        name: "{{ item }}"
        state: restarted
      loop:
        - epic-app
        - epic-scheduler

Role Development

!!! example "Role Structure" text ansible-role-epic-app/ ├── defaults/ │ └── main.yml # Default variables ├── files/ │ └── config/ # Static files ├── handlers/ │ └── main.yml # Event handlers ├── meta/ │ └── main.yml # Role metadata ├── tasks/ │ └── main.yml # Task definitions ├── templates/ │ └── epic.conf.j2 # Jinja2 templates ├── tests/ │ └── test.yml # Role tests ├── vars/ │ └── main.yml # Role variables └── README.md # Role documentation

Role Task Example:

---
# tasks/main.yml
- name: Create Epic application directory
  file:
    path: "{{ epic_app_dir }}"
    state: directory
    owner: "{{ epic_user }}"
    group: "{{ epic_group }}"
    mode: '0755'
  tags:
    - epic
    - filesystem

- name: Deploy Epic application configuration
  template:
    src: epic.conf.j2
    dest: "{{ epic_app_dir }}/epic.conf"
    owner: "{{ epic_user }}"
    group: "{{ epic_group }}"
    mode: '0644'
    backup: yes
  notify:
    - restart epic services
  tags:
    - epic
    - configuration

- name: Install Epic application packages
  package:
    name: "{{ epic_packages }}"
    state: present
  tags:
    - epic
    - packages

Troubleshooting & Issue Resolution

Common Infrastructure Issues

**Resolution Steps:**
1. Check Azure Service Health for region issues
2. Verify subscription quotas and limits
3. Review NSG rules and effective security rules
4. Enable boot diagnostics for troubleshooting
5. Use Azure Serial Console for direct access

**Resolution Steps:**
1. Run `terraform plan -refresh-only` to detect drift (the standalone `terraform refresh` command is deprecated)
2. Use `terraform import` for manual resources
3. Resolve state locks with `terraform force-unlock`
4. Pin provider versions in `versions.tf`
5. Document and reverse manual changes

Debugging Toolkit

Azure CLI Debugging Commands:

# Check VM status and properties
az vm show --resource-group rg-epic-prod --name vm-epic-app-01 \
  --query '{name:name,powerState:instanceView.statuses[1].displayStatus,provisioningState:provisioningState}'

# Review recent activity logs for resource group
az monitor activity-log list --resource-group rg-epic-prod \
  --max-events 10 --query '[].{Time:eventTimestamp,Operation:operationName.value,Status:status.value}'

# Check network security group rules
az network nsg rule list --resource-group rg-epic-prod --nsg-name nsg-epic-app \
  --query '[].{Name:name,Priority:priority,Access:access,Protocol:protocol,Direction:direction}'

# Get VM boot diagnostics
az vm boot-diagnostics get-boot-log --resource-group rg-epic-prod --name vm-epic-app-01

# Check disk usage and performance
az vm show --resource-group rg-epic-prod --name vm-epic-app-01 \
  --query 'storageProfile.osDisk.{name:name,diskSizeGb:diskSizeGb,caching:caching}'

Terraform Debugging Commands:

# Enable detailed logging
export TF_LOG=DEBUG
export TF_LOG_PATH=terraform-debug.log

# Plan with detailed output
terraform plan -detailed-exitcode -out=plan.tfplan

# Show current state in human-readable format
terraform show

# Check for configuration drift (state-only refresh)
terraform plan -refresh-only

# Import existing Azure resources
terraform import azurerm_virtual_machine.example /subscriptions/.../resourceGroups/.../providers/Microsoft.Compute/virtualMachines/vm-name

# Validate configuration
terraform validate

# Check formatting
terraform fmt -check -diff

Ansible Debugging:

# Run playbook with increased verbosity
ansible-playbook -vvv playbook.yml

# Check inventory and host connectivity
ansible all -m ping --inventory inventory/

# Test specific tasks with tags
ansible-playbook playbook.yml --tags "configuration" --check

# Gather facts from hosts
ansible epic_servers -m setup --inventory inventory/

# Test role syntax
ansible-playbook --syntax-check playbook.yml

Performance Optimization

Database Performance Tuning

!!! example "Epic Database Optimization" Index Optimization: sql -- Monitor slow queries SELECT TOP 10 total_elapsed_time/1000/1000 AS [Total Time (s)], execution_count, total_elapsed_time/execution_count/1000 AS [Avg Time (ms)], SUBSTRING(st.text, (qs.statement_start_offset/2)+1, CASE WHEN qs.statement_end_offset = -1 THEN LEN(CONVERT(nvarchar(max), st.text))*2 ELSE qs.statement_end_offset END - qs.statement_start_offset)/2) AS query_text FROM sys.dm_exec_query_stats qs CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st ORDER BY total_elapsed_time DESC;

**Connection Pool Configuration:**
```yaml
# Epic database connection settings
epic_db_config:
  max_connections: 100
  initial_pool_size: 10
  connection_timeout: 30
  idle_timeout: 300
  validation_query: "SELECT 1"
```

Application Scaling Patterns

```hcl
resource "azurerm_linux_virtual_machine_scale_set" "epic_app" {
  name                = "${var.environment}-epic-app-vmss"
  resource_group_name = var.resource_group_name
  location            = var.location
  sku                 = var.vm_size
  instances           = var.initial_capacity
  admin_username      = var.admin_username
  upgrade_mode        = "Manual"

  automatic_os_upgrade_policy {
    enable_automatic_os_upgrade = false
    disable_automatic_rollback  = false
  }

  tags = var.common_tags
}

# Auto-scaling rules
resource "azurerm_monitor_autoscale_setting" "epic_app" {
  name     = "${var.environment}-epic-app-autoscale"
  location = var.location

  profile {
    name = "default"

    capacity {
      default = 2
      minimum = 1
      maximum = 10
    }

    rule {
      metric_trigger {
        metric_name      = "Percentage CPU"
        threshold        = 75
        time_aggregation = "Average"
        time_window      = "PT5M"
        frequency        = "PT1M"
        statistic        = "Average"
      }

      scale_action {
        direction = "Increase"
        type      = "ChangeCount"
        value     = "1"
        cooldown  = "PT5M"
      }
    }
  }
}
```

Load Balancer Optimization

**Session Affinity for Epic:**
```hcl
resource "azurerm_lb_rule" "epic_app" {
  name                           = "epic-app-rule"
  protocol                       = "Tcp"
  frontend_port                  = 443
  backend_port                   = 443
  enable_floating_ip            = false
  idle_timeout_in_minutes       = 4
  load_distribution             = "SourceIP"  # Session affinity
  disable_outbound_snat         = false
}
```

Advanced Infrastructure Practices

Disaster Recovery Preparation

  backup {
    frequency = "Daily"
    time      = "23:00"
  }

  retention_daily {
    count = 30
  }

  retention_weekly {
    count    = 12
    weekdays = ["Sunday"]
  }

  retention_monthly {
    count    = 12
    weekdays = ["Sunday"]
    weeks    = ["First"]
  }
}
```

**Cross-Region Replication:**
```hcl
resource "azurerm_site_recovery_replicated_vm" "epic_dr" {
  name                                      = "${var.vm_name}-asr"
  resource_group_name                       = var.dr_resource_group_name
  recovery_vault_name                       = var.recovery_vault_name
  source_recovery_fabric_name              = var.source_fabric_name
  source_vm_id                             = azurerm_linux_virtual_machine.epic_app.id
  recovery_replication_policy_id           = var.replication_policy_id
  source_recovery_protection_container_name = var.source_container_name

  target_resource_group_id                = var.target_resource_group_id
  target_recovery_fabric_id              = var.target_fabric_id
  target_recovery_protection_container_id = var.target_container_id
}
```

Infrastructure Resilience

!!! example "Multi-Region Deployment" Regional Failover Configuration — Epic on Azure standard region pair is westus2 (primary) ↔ westcentralus (secondary): ```hcl # Primary region resources module "epic_primary" { source = "../modules/epic-infrastructure"

  environment     = var.environment
  region         = "westus2"
  is_primary     = true

  # Database configuration
  db_failover_group_enabled = true
  db_geo_backup_enabled    = true

  tags = merge(var.common_tags, {
    Region = "Primary"
  })
}

# Secondary region resources
module "epic_secondary" {
  source = "../modules/epic-infrastructure"

  environment     = var.environment
  region         = "westcentralus"
  is_primary     = false

  # Reference primary region resources
  primary_db_server_id = module.epic_primary.db_server_id

  tags = merge(var.common_tags, {
    Region = "Secondary"
  })
}

# Traffic Manager for DNS failover
resource "azurerm_traffic_manager_profile" "epic" {
  name                = "${var.environment}-epic-tm"
  resource_group_name = var.resource_group_name

  traffic_routing_method = "Priority"

  dns_config {
    relative_name = "${var.environment}-epic"
    ttl          = 30
  }

  monitor_config {
    protocol                     = "HTTPS"
    port                        = 443
    path                        = "/health"
    interval_in_seconds         = 30
    timeout_in_seconds          = 10
    tolerated_number_of_failures = 3
  }
}
```

Performance Monitoring Integration

  log {
    category = "Administrative"
    enabled  = true

    retention_policy {
      enabled = true
      days    = 90
    }
  }

  metric {
    category = "AllMetrics"
    enabled  = true

    retention_policy {
      enabled = true
      days    = 90
    }
  }
}
```

**Custom Metrics Collection:**
```yaml
# Ansible task for custom metrics
- name: Deploy Azure Monitor Agent
  include_role:
    name: azure-monitor-agent
  vars:
    ama_config:
      data_collection_rules:
        - name: "epic-performance-metrics"
          performance_counters:
            - "\\Processor(_Total)\\% Processor Time"
            - "\\Memory\\Available MBytes"
            - "\\LogicalDisk(C:)\\% Free Space"
            - "\\Network Interface(*)\\Bytes Total/sec"
          custom_logs:
            - path: "/opt/epic/logs/application.log"
              format: "json"
```

Getting Help with Infrastructure

!!! question "Infrastructure Questions" - Terraform Issues: Check Terraform Enterprise documentation - Ansible Problems: Review AWX Platform guides - Azure Troubleshooting: Use Azure CLI debugging commands - Performance Issues: Check Monitoring Strategy

!!! question "Emergency Procedures" - Production Outages: Contact on-call engineer immediately - Infrastructure Failures: Follow Incident Management process - Security Incidents: Escalate through security team channels - Data Recovery: Engage disaster recovery procedures

Infrastructure Practices | Epic on Azure Team Guidelines

These practices ensure reliable, scalable infrastructure supporting critical healthcare systems. Contribute improvements based on operational experience.

Last updated: September 2025