Navigation
MonitoringUpdated July 3, 2026

Azure Metric Alert Terraform Code - Reference

monitoringalertingterraforminfrastructure-as-codeazure-monitormetricsautomationepic

Azure Metric Alert Terraform Code Documentation

local variables that are used on all alerts

Local terraform variables used:

  • environment: This is used for naming conventions or tagging

  • resource_location: is the API parameter that specifies the Azure region where a resource will be deployed.

  • target_name: This is for virtual machine metric alerts; it refers to the API call of resources.

  • recipients: This is email recipients list where metric alerts are going to

Alert Monitoring and Troubleshooting

All metric alerts, when fired, are automatically forwarded to Splunk for centralized monitoring and analysis. This provides a unified view of all alert activity across the Epic infrastructure.

Splunk Integration

When any metric alert triggers (critical, warning, or informational), the alert data is sent to Splunk through configured Event Hubs. This allows for:

  • Centralized Alert Monitoring: View all Epic infrastructure alerts in one location
  • Historical Analysis: Track alert patterns and trends over time
  • Correlation Analysis: Identify relationships between different alerts and services
  • Operational Dashboards: Create custom dashboards for monitoring specific services or environments

Finding Alerts in Splunk

To search for metric alerts in Splunk, use the following search query:

index=cloud_epic_azure_nw "data.schemaId"=AzureMonitorMetricAlert

This search will return all Azure Monitor metric alerts that have been triggered across the Epic infrastructure.

Common Splunk Search Refinements

You can refine your search to focus on specific aspects:

# Search for critical alerts only
index=cloud_epic_azure_nw "data.schemaId"=AzureMonitorMetricAlert "data.data.status"=Activated "data.data.context.severity"=0

# Search for alerts from a specific resource type (e.g., NetApp)
index=cloud_epic_azure_nw "data.schemaId"=AzureMonitorMetricAlert "data.data.context.resourceType"="Microsoft.NetApp/netAppAccounts/capacityPools/volumes"

# Search for alerts within a specific time range
index=cloud_epic_azure_nw "data.schemaId"=AzureMonitorMetricAlert earliest=-24h latest=now

# Search for alerts from a specific environment or region
index=cloud_epic_azure_nw "data.schemaId"=AzureMonitorMetricAlert reg=westus3 "epicpro"

Alert Data Structure in Splunk

Each alert entry in Splunk contains detailed information about the triggered alert, including:

  • Alert Details: Alert name, severity, status (Activated/Resolved)
  • Resource Information: Resource ID, resource type, resource group
  • Metric Data: Metric name, threshold values, actual values
  • Timing Information: When the alert fired and evaluation windows
  • Environment Context: Region, environment tags, and other metadata

This comprehensive logging enables effective monitoring, troubleshooting, and operational insights across the entire Epic infrastructure.

Metric Alert Baselines

Example: code of Virtual Machine alert for CPU if you want to use this baseline for different types of alerts such as loadbalancers, express routes or other services. You would need to use its data.resource.id for scope variable and target_name variable you need to use data.resource.name.

module "metric_alerts_cpu/loadbalancer/Express_Route" {

  source  = "terraform.uhg.com/uhg-customer-modules/private-registry-metric-alerts/epic" # which terraform workspace URL you are using

  version = "1.4.0" # which terraform workspace version you are using highly recommend keeping this for best practices

  short_name = "VM" # the nickname of the metric alert

  Optional value: scope = [data.azurerm_lb.loadbalancer.id]  # defines the resource that the alert is monitoring. By default, many examples use Virtual Machines (VMs), but you can target other resource types like Load Balancers, Application Gateways, Storage Accounts, etc.

  explanation = "CPU usage is high" # this is to explain the alerts function and it will be shown on the alert name in the azure UI

  target_resource_location = local.resource_location # this is where you use the local.resource_location value so terraform can deploy to the correct region and for non virtual machine alert you must use data.resource.localtion

  target_resource_type     = "Microsoft.Compute/virtualMachines" # this is the api name of the resource you are trying to target

  target_name              = local.target_name # base value is virtual machine but, if you are not using a virtual machine alert you need to do data.resource.name (resource name) in order for the alert to point to it.

  metric_name              = "Percentage CPU" # This is the API name of the measurement for thresholds

  resource_group_name      = azurerm_resource_group.resource_group.name # which resource group you want to use


  email_recipients = {

    prod_action_group - { (Map of email address)

      name =  "prod_action_group" # name of what it's going to be

      email_address = "${local.recipients}" # this use of local.recipients

      use_common_alert_schema = true # if you want to use common alert schema true or false

    }

  }

  event_hub = {

    event_hub_cloudtest - { (Map of eventhub)

      name                    = "lp-cl-westus3-eventhub-cc751735" # name of eventhub

      event_hub_namespace     = "lp-cl-westus3-eventhub-cc751735" # name of eventhub namespace

      event_hub_name          = "diagnostic-logs" # name of eventhub

      use_common_alert_schema = false # if you want to use common alert schema true or false

    }

  }

  alerts = {

    critical = {

      threshold      = 95 # The value that the metric must exceed to trigger the alert.

      severity       = 0 # severity of the alert from numbers 0-4

      aggregation    = "average" # The aggregation type used to evaluate the metric.

      operator       = "GreaterThanOrEqual" # comparison operator

      severity_name  = "critical" # the name of the severity critical, warning, verbose etc

      metric_details = "Percentage CPU Greater than 95 ${local.environment} ${local.resource_location}" # it adds description of the alert for the azure UI

      frequency      = "PT1M" # How often the alert rule is evaluated

      window_size    = "PT5M" # Defines the time range over which the metric is evaluated

    }

  }

}

Alert Silence Rule

There are a few different types of Alert Silence rule you can do. The Alert Silencing rule can be based on resource name, resource type, and metric alert name. I highly recommend you check the layout of the terraform code to determine which alert would be best.

module "alert_processing_rule_silenced_alert_rule" {

  source  = "terraform.uhg.com/uhg-customer-modules/private-registry-alert-processing-rule-suppression/epic" # which terraform workspace url you are using

  version = "1.5.0" # which terraform workspace version you are using highly recommend keeping this for best practicies

  alert_processing_rule_suppression_name =  "Alert-Processing-Rule-Suppression" # name of alert processing rule which will be seen in the azure UI

  schedule_enabled                       = true # if you want the alert to be permanent or temporary, however you may have to enabled it anyway to able to customize recurrence

  # optional schedule_enabled values:
  effective_from                         = "2025-03-06T01:00:00"
  effective_until                        = "2225-03-06T01:00:00"

  recurrence_daily_enabled               = true # if you want the alert to run everyday

  # optional recurrence_daily_enabled values:
  start_time                             = "23:59:59" # what time should the alert start everyday
  end_time                               = "00:00:00" # what time should the alert end everyday

  recurrence_weekly_enabled              = true # if you want the alert to run weekly

  # optional recurrence_weekly_enabled values:
  days_of_week                           = ["Monday","Tuesday"] # which days should the alert activate on
  start_time                             = "23:59:59" # what time should the alert start everyday
  end_time                               = "00:00:00" # what time should the alert end everyday

  time_zone                              = "Central Standard Time" # which timezone the alert should be using for dates and time

  resource_group_name                    = azurerm_resource_group.resource_group.name # which resource group should the alert be deployed to

  short_name                             = "ohemr APRS" # short name of the alert

  explanation                            = "notifications have been silenced due maintenance window." # how the alert works

  condition                              = true # if you need a condition filter, you can apply the alert to silence resource group, resource type, or alert name please see the terraform code to understand how to switch between the options.

  optional values                        - switching between silencing alert rule, resource group and resource type.

  # silence alert rule option:
  alert_rule_name_enabled                = true # if you want it to enable or not

  alert_rule_name_enabled_operator       = "Equals" # comparison operator

  alert_rule_name_enabled_values         = ["Metric alert to silence"] # the alert name of value that you are silencing

  # silence resource type option:
  target_resource_type_enabled           = true # if you want it to enabled or not

  target_resource_type_operator          = "Equals" # comparison operator

  target_resource_type_values            = ["Microsoft.Compute/virtualMachines", "Microsoft.Network/loadBalancers"] # the resource type you are silencing

  # silence resource group options:
  target_resource_group_enabled          = true # if you want it to be enabled or not

  target_resource_group_operator         = "Contains" comparison operator

  target_resource_group_values           = "resource_group_name" the resource group you are silencing

}

Windows Disk Space Metric Alerts

The disk space monitoring alerts track available disk space percentages across Windows virtual machine logical disks. These alerts help prevent storage-related outages by providing early warning when disk capacity is running low.

[!IMPORTANT] This alert is specifically designed for Windows Virtual Machines only. It uses the Azure.VM.Windows.GuestMetrics namespace which requires the Azure Monitor Agent to be installed on Windows VMs. For Linux VMs, a separate disk space monitoring solution is required.

Key Characteristics

  • Metric Name: LogicalDisk % Free Space
  • Metric Namespace: Azure.VM.Windows.GuestMetrics (requires Azure Monitor Agent)
  • Resource Type: Microsoft.Compute/virtualMachines
  • Alert Type: Percentage-based threshold monitoring

Why This Alert Is Important

Low disk space can cause:

  • Application failures and crashes
  • Database corruption
  • Log file write failures
  • Service interruptions
  • System instability

Alert Configuration

module "metric_alerts_disk_space" {
  source                   = "terraform.uhg.com/uhg-customer-modules/private-registry-metric-alerts/epic"
  version                  = "1.7.5"
  short_name               = "DISK"
  explanation              = "Low disk space detected. Investigation: Check disk usage trends, identify large files/folders, review application logs for disk space issues. Remediation: Clean up old logs, expand disk size, move data to alternate storage."
  target_resource_location = local.resource_location
  target_resource_type     = "Microsoft.Compute/virtualMachines"
  target_name              = local.target_name
  metric_name              = "LogicalDisk % Free Space"
  metric_namespace         = "Azure.VM.Windows.GuestMetrics"
  resource_group_name      = azurerm_resource_group.ohemr-rg.name
  scopes                   = concat(local.alert_scopes_app_rg_ids, local.alert_scopes_odb_rg_ids)

  email_recipients = {
    prod_action_group = {
      name                    = "prod_action_group"
      email_address           = "${local.recipients}"
      use_common_alert_schema = true
    }
  }

  event_hub = {
    event_hub_npd = {
      name                    = "As per region"
      event_hub_namespace     = "As per region"
      event_hub_name          = "diagnostic-logs"
      use_common_alert_schema = false
    }
  }

  alerts = {
    critical = {
      threshold      = 10  # Less than 10% free space
      severity       = 0
      aggregation    = "Average"
      operator       = "LessThan"
      severity_name  = "critical"
      metric_details = "Disk Free Space Less than 10% ${local.environment} ${local.resource_location}"
      frequency      = "PT5M"
      window_size    = "PT15M"
    }

    warning = {
      threshold      = 15  # Less than 15% free space
      severity       = 2
      aggregation    = "Average"
      operator       = "LessThan"
      severity_name  = "warning"
      metric_details = "Disk Free Space Less than 15% ${local.environment} ${local.resource_location}"
      frequency      = "PT5M"
      window_size    = "PT15M"
    }
  }
}

Investigation Steps

When this alert fires:

  1. Identify the disk: Check which logical disk (C:, D:, etc.) triggered the alert
  2. Review disk usage trends: Look at historical data to understand growth patterns
  3. Find large files/folders:
    • Use TreeSize or PowerShell to identify space consumers
    • Check application log directories
    • Review temp folders and user profiles
  4. Check application logs: Look for disk-related errors or warnings

Alert Processing Rule

This Alert processing Rule is a stopgap for backup vault alerts. Currently, Microsoft doesn’t have updated alerts for them. This alert rule redirects Microsoft classic alerts to Eventhub and email using an action group until a modern implementation is established, please continue using this processing rule.

module "alert_processing_rule" {

  source  = "terraform.uhg.com/uhg-customer-modules/private-registry-alert-processing-rule/epic" # which terraform workspace URL you are using

  version = "1.1.1" # which terraform workspace version you are using highly recommend keeping this for best practices

  alert_processing_rule_name = "Ohemr-Alert-Processing-Rule-Backup" # name of alert processing rule

  short_name                 = "Ohemr APR" # nickname of alert processing rule

  explanation                = "is experiencing issues affecting backup and restore operations" # explanation of how the alert processing rule works

  alert_processing_rule_operator = "Equals" # comparison operator

  alert_processing_rule_value    = ["Azure Backup"] # the name of value that you are trying to change

  resource_group_name            = azurerm_resource_group.resource_group.name # which resource group should the alert be deployed to

  metric_name                    = "Backup Vault" # name of metric name


  email_recipients = {

    prod_action_group - { (Map of email address)

      name =  "prod_action_group" # name of what it's going to be

      email_address = "${local.recipients}" # this use of local.recipients

      use_common_alert_schema = true # if you want to use common alert schema true or false

    }

  }

  event_hub = {

    event_hub_cloudtest - { (Map of eventhub)

      name                    = "lp-cl-westus3-eventhub-cc751735" # name of eventhub

      event_hub_namespace     = "lp-cl-westus3-eventhub-cc751735" # name of eventhub namespace

      event_hub_name          = "diagnostic-logs" # name of eventhub

      use_common_alert_schema = false # if you want to use common alert schema true or false

    }

  }

}

Service Health and Administrative Alerts

When entering values for service health I highly recommend you check the values on the Azure UI. Service Health has unique values and won’t use traditional API names best example is service_health_location variables. It uses its own location values and not the traditional API name of them.

If you are using an administrative alert or similar you can just change the category values, however you must have some sort of filter such as resource types or operator name.

module "activity_log_alert_rule_service_health/activity_log_alert_rule_administrative" {

  source  = "terraform.uhg.com/uhg-customer-modules/registry-activity-log-alert/private" # which terraform workspace URL you are using

  version = "1.4.0" # which terraform workspace version you are using highly recommend keeping this for best practices

  activity_log_alert_name = "Activity Log Service Health" # activity log alert name

  resource_group_name     = azurerm_resource_group.ohemr-rg.name # which resource group should the alert be deployed to

  resource_group_location = "global" # which location should the alert be deployed to but it’s only global for service health

  description             = "Service Health of express route, load balancer, and virtual machines" # description of log alert rule service function

  category                = "ServiceHealth" # which log alert rule category you want to implement example service health, administrative, maintenance etc

  metric_name             = "Activity Log Service Health" # the name of the metric you are using

  short_name              = "ohemr LGA" # nickname of the alert

  action_group_details    = "Priority Resources" # this is an addition description of your action group this will show up in the azure UI

  optional value: operator name = "Microsoft.Sql/servers/firewallRules/write" # is the identity that performed an action on a resource this is for administrative alerts.

  optional value: service_health = {

    service_health_priority_services = { (map of service health)

      service_health_locations = ["West US 3"] # locations on where the service health will be monitoring

      services                 = ["Load Balancer"] # azure resources that service health will monitor

      events                   = ["Incident", "Security", "Maintenance", "ActionRequired"] # type of notifications it will monitor
    }

  }

  email_recipients = {

    prod_action_group - { (Map of email address)

      name =  "prod_action_group" # name of what it's going to be

      email_address = "${local.recipients}" # this use of local.recipients

      use_common_alert_schema = true # if you want to use common alert schema true or false

    }

  }

  event_hub = {

    event_hub_cloudtest - { (Map of eventhub)

      name                    = "lp-cl-westus3-eventhub-cc751735" # name of eventhub

      event_hub_namespace     = "lp-cl-westus3-eventhub-cc751735" # name of eventhub namespace

      event_hub_name          = "diagnostic-logs" # name of eventhub

      use_common_alert_schema = false # if you want to use common alert schema true or false

    }

  }

}

Log search alert

This is the log search alert for Patching failures. The alert counts the amount of patching failures that occured and send it to the patching team through email and Splunk.

module "log-search-fail-patch-jobs" {

  source                  = "terraform.uhg.com/uhg-customer-modules/private-registry-log-search-alerts/epic" # which terraform workspace URL you are using

  version                 = "1.1.3" # which terraform workspace version you are using highly recommend keeping this for best practices

  resource_group_name     = azurerm_resource_group.ohemr-rg.name # which resource group you want to use

  resource_group_location = local.resource_location # which location should the alert be deployed

  metric_name             = "log search"  # name of metric name

  short_name              = "ohemr lgs" # nickname of the alert

  action_group_details    = "failed patch jobs" # this is an addition description of your action group this will show up in the azure UI

  identity_type           = "SystemAssigned" # this is for the user identity however, if you want to use your own, you have to modify the private registry in order to do that.

  log_search_alerts = {
    log_search_alert_1 = {

      metric_details                   = "patch failures" # it adds description of the alert for the azure UI

      evaluation_frequency             = "PT10M" # How often the alert rule is evaluated

      window_duration                  = "PT10M" # Defines the time range over which the metric is evaluated

      severity                         = 0 # severity of the alert

      auto_mitigation_enabled          = false # to a setting that determines whether an alert should automatically resolve itself when the alert condition is no longer met

      workspace_alerts_storage_enabled = false #  alert data (such as triggered alerts, alert history, or alert metadata) is stored in a Log Analytics workspace

      description                      = "detects failed patching jobs" # it adds description of the alert for the azure UI

      display_name                     = "Ohemr Alert Failed Patch Jobs" # it adds to the display name of the alert

      enabled                          = true # boolean that determines whether the alert rule is active or disabled.

      query_time_range_override        = null # override the default time range used when executing the Kusto query for the alert

      skip_query_validation            = true # whether Terraform should validate the Kusto query during deployment.

      time_aggregation_method          = "Count" # how metric values are aggregated over a specified time window before being evaluated against a threshold in an alert rule.

      threshold                        = 0 # The value that the metric must exceed to trigger the alert.

      operator                         = "GreaterThan" # comparison logic such as GreaterThan, LessThan etc

      metric_measure_column            = null # which column from your Kusto query contains the numerical values that should be evaluated for alerting.

    }
  }

  dimensions = {
    dimension1 = {
      name     = "vmResourceId" # name of the dimension
      operator = "Include" # comparison operator wherter include and exclude
      values   = ["*"] # values you want to include recommend using * so it can break down the values one by one
    }
  }

  log_search_query = <<-QUERY   # this where you put your KQL query between <<-QUERY and QUERY
  QUERY

  email_recipients = {

    prod_action_group - { (Map of email address)

      name =  "prod_action_group" # name of what it's going to be

      email_address = "${local.recipients}" # this use of local.recipients

      use_common_alert_schema = true # if you want to use common alert schema true or false

    }

  }

  event_hub = {

    event_hub_cloudtest - { (Map of eventhub)

      name                    = "lp-cl-westus3-eventhub-cc751735" # name of eventhub

      event_hub_namespace     = "lp-cl-westus3-eventhub-cc751735" # name of eventhub namespace

      event_hub_name          = "diagnostic-logs" # name of eventhub

      use_common_alert_schema = false # if you want to use common alert schema true or false

    }
  }
}

Azure NetApp Files (ANF) Metric Alerts

Azure NetApp Files monitoring uses a sophisticated template-based approach to generate multiple metric alerts for volumes and capacity pools. Unlike traditional metric alerts that target individual resources, ANF alerts are dynamically generated based on configuration variables and templates.

ANF Alert Architecture

The ANF monitoring system consists of three main components:

  1. Variables (variable.tf): Define metric configurations, thresholds, timing parameters, and specify which volumes and pools to monitor
  2. Local Templates (locals.tf): Create metric templates and calculate dynamic thresholds
  3. Resource Configuration: Volumes and pools are defined as variables in variable.tf with default values

Key Components

Volume Metric Templates

Each volume can monitor up to 19 different metrics:

  • Storage Metrics: volume_consumed_size, percentage_consumed_size, snapshot_size, inode
  • Performance Metrics: read_iops, write_iops, total_iops, other_iops
  • Latency Metrics: read_latency, write_latency
  • Throughput Metrics: read_throughput, write_throughput, total_throughput, other_throughput
  • Replication Metrics: replication_lag, replication_status, replication_transferring
  • Backup Metrics: backup_enabled, backup_operation_complete

Capacity Pool Metrics

Each capacity pool monitors:

  • Pool Storage: pool_consumed_size - monitors pool utilization against allocated size

Dynamic Threshold Calculation

ANF alerts use service-level performance calculations to set appropriate thresholds:

# Service level performance maps
netapp_service_level_throughput_per_tib = { Standard = 16, Premium = 64, Ultra = 128 }
netapp_service_level_iops_per_tib       = { Standard = 1024, Premium = 4096, Ultra = 8192 }

# Per-volume calculations based on allocated size and service level
netapp_volume_calculations = {
  for k, v in local.netapp_volume_inputs : k => {
    allocated_tib        = v.allocated_bytes > 0 ? v.allocated_bytes / local.bytes_per_tib : 0
    max_throughput_mibps = lookup(local.netapp_service_level_throughput_per_tib, v.service_level, 0) * (v.allocated_bytes > 0 ? v.allocated_bytes / local.bytes_per_tib : 0)
    max_iops             = lookup(local.netapp_service_level_iops_per_tib, v.service_level, 0) * (v.allocated_bytes > 0 ? v.allocated_bytes / local.bytes_per_tib : 0)
    service_level        = v.service_level
  }
}

ANF Alert Configuration Example

# Example of how ANF alerts are generated from templates
module "netapp_metric_alerts" {
  source  = "terraform.uhg.com/uhg-customer-modules/private-registry-metric-alerts/epic"
  version = "1.4.0"

  for_each = local.all_netapp_metrics

  short_name               = each.value.short_name
  scope                    = [each.value.target_scope]
  explanation              = each.value.explanation
  target_resource_location = local.resource_location
  target_resource_type     = each.value.target_resource_type
  target_name              = each.value.target_name
  metric_name              = each.value.metric_name
  resource_group_name      = azurerm_resource_group.resource_group.name

  email_recipients = {
    prod_action_group = {
      name                    = "prod_action_group"
      email_address           = local.recipients
      use_common_alert_schema = true
    }
  }

  event_hub = {
    event_hub_cloudtest = {
      name                    = "lp-cl-westus3-eventhub-cc751735"
      event_hub_namespace     = "lp-cl-westus3-eventhub-cc751735"
      event_hub_name          = "diagnostic-logs"
      use_common_alert_schema = false
    }
  }

  alerts = each.value.alerts
}

Adding New ANF Volumes/Pools to Monitoring

To add new ANF volumes or capacity pools to an existing monitoring workspace, follow these steps:

Step 1: Update variable.tf

Add your new volumes and pools to the variable definitions in variable.tf:

# NetApp Capacity Pools configuration
variable "netapp_capacity_pools" {
  type = map(object({
    allocated_size_bytes = number
    service_level        = string
    azure_name           = string
    account_name         = string
    resource_group_name  = string
  }))
  default = {
    existing_pool = {
      allocated_size_bytes = 1099511627776 # 1 TiB
      service_level        = "Standard"
      azure_name           = "Standard"
      account_name         = "ohemr-anf-epic-shared-cus-001"
      resource_group_name  = "ohemr-rg-west-epic-netapp-shared-cus-001"
    }
    new_pool = {
      allocated_size_bytes = 2199023255552 # 2 TiB
      service_level        = "Premium"
      azure_name           = "Premium"
      account_name         = "ohemr-anf-epic-shared-cus-001"
      resource_group_name  = "ohemr-rg-west-epic-netapp-shared-cus-001"
    }
  }
  description = "Map of NetApp capacity pools and their properties"
}

# NetApp Volumes configuration
variable "netapp_volumes" {
  type = map(object({
    max_allocated_bytes  = number
    capacity_pool_name   = string
    azure_name          = string
    account_name        = string
    resource_group_name = string
  }))
  default = {
    existing_volume = {
      max_allocated_bytes = 107374182400 # 100 GiB
      capacity_pool_name  = "existing_pool"
      azure_name         = "epic-shared-volume-001"
      account_name       = "ohemr-anf-epic-shared-cus-001"
      resource_group_name = "ohemr-rg-west-epic-netapp-shared-cus-001"
    }
    new_volume = {
      max_allocated_bytes = 214748364800 # 200 GiB
      capacity_pool_name  = "new_pool"
      azure_name         = "epic-shared-volume-002"
      account_name       = "ohemr-anf-epic-shared-cus-001"
      resource_group_name = "ohemr-rg-west-epic-netapp-shared-cus-001"
    }
  }
  description = "Map of NetApp volumes and their properties"
}

Step 2: Verify Resource Names and Static Configuration

Ensure the Azure resource names match exactly what exists in your Azure subscription:

  • azure_name: The actual NetApp volume/pool name in Azure
  • account_name: The NetApp account containing the resources
  • resource_group_name: The resource group containing the NetApp account
  • capacity_pool_name: Must reference a key from the netapp_capacity_pools map

Important: The allocated_size_bytes, max_allocated_bytes, and service_level values in the monitoring workspace are static configurations. If volume sizes or service levels are changed directly in Azure outside of the monitoring workspace, these changes will not be automatically reflected in the alert thresholds. The monitoring workspace must be manually updated to reflect any changes made to the actual Azure NetApp Files resources.

Step 3: Validate Configuration

Run Terraform validation to ensure your configuration is correct:

terraform init
terraform validate
terraform plan

Step 4: Deploy Changes via TFE

Deploy the changes through the standard Terraform Enterprise (TFE) workflow:

  1. Create Pull Request: Submit your changes via a pull request
  2. Code Review: Wait for pull request review and approval
  3. TFE Deployment: Once approved and merged, TFE will automatically apply the changes to create new metric alerts

Note: Direct terraform apply commands are not used. All deployments go through TFE after proper review and approval process.

Step 5: Verify Alert Creation

After deployment, verify that alerts were created for your new resources:

  1. Check Azure Portal: Navigate to Monitor > Alerts to see the new metric alerts
  2. Expected Alerts Per Volume: Each volume will generate up to 19 metric alerts (one for each enabled metric)
  3. Expected Alerts Per Pool: Each capacity pool will generate 1 metric alert for pool consumed size

Important Considerations

Resource Sizing Requirements

  • Minimum Volume Size: Volumes with max_allocated_bytes = 0 will be excluded from monitoring
  • Service Level Impact: Different service levels (Standard/Premium/Ultra) have different IOPS and throughput limits that affect alert thresholds
  • Static Configuration: Alert thresholds are calculated based on the static values defined in variable.tf. Changes to actual Azure NetApp Files resources (size increases, service level changes) will not automatically update alert thresholds until the monitoring workspace variables are manually updated

Alert Naming Convention

ANF alerts follow this naming pattern:

  • Format: {volume_key}_{metric_key} or {pool_key}_{metric_key}
  • Example: epic_shared_volume_001_read_iops_critical

Metric Names (Azure API)

ANF uses specific Azure metric names:

  • VolumeLogicalSize (volume consumed size)
  • VolumeSnapshotSize (snapshot size)
  • ReadIops, WriteIops, TotalIops, OtherIops
  • AverageReadLatency, AverageWriteLatency (legacy metric names for Azure compatibility)
  • ReadThroughput, WriteThroughput, TotalThroughput, OtherThroughput

Troubleshooting Common Issues

  1. Volume Not Monitored: Check that max_allocated_bytes > 0
  2. Missing Alerts: Verify resource names match exactly in Azure
  3. Threshold Errors: Ensure service levels are spelled correctly (Standard, Premium, Ultra)
  4. Validation Errors: Check that capacity_pool_name references an existing pool key

Advanced Configuration

Custom Thresholds

You can customize alert thresholds by modifying the variables in variable.tf:

variable "netapp_read_iops_warning_percent" {
  type    = number
  default = 80  # 80% of maximum IOPS for warning
}

variable "netapp_read_iops_critical_percent" {
  type    = number
  default = 95  # 95% of maximum IOPS for critical
}

Metric Selection

To disable specific metrics for all volumes, remove them from the enabled_metrics list in locals.tf:

enabled_metrics = [
  "volume_consumed_size", "inode", "percentage_consumed_size", "snapshot_size",
  "read_iops", "write_iops", "other_iops", "total_iops",
  "read_latency", "write_latency",
  # "replication_lag", "replication_status", "replication_transferring",  # Disabled
  "backup_enabled", "backup_operation_complete",
  "read_throughput", "write_throughput", "total_throughput", "other_throughput"
]