MonitoringUpdated July 3, 2026

Metric Alerts Configuration

monitoringalertingmetricsazure-monitorthresholdsinfrastructureperformanceepic

Metric Alerts

Thresholds are split into two categories warning or critical. Warning thresholds purpose is to alert the user of an upcoming issue and it is with discreation to solve the problem. Critical thresholds purpose is to alert the user of an immediate issue and it would need be resolve immediately.

Critical Alerts Thresholds

Percentage CPU Greater than 95 - Critical: This threshold signifies that the CPU usage is extremely high and likely impacting system performance significantly. Immediate action may be required to prevent system overload.
OS Disk IOPS Consumed Percentage Greater than 98 - Critical: This threshold signifies that the OS disk is almost fully utilized in terms of IOPS, potentially causing severe performance degradation.
Data Disk IOPS Consumed Percentage Greater than 98 - Critical: This threshold signifies that the data disk is almost fully utilized in terms of IOPS, potentially causing severe performance degradation.
Virtual Machine Availability Less than 1 - Critical: This threshold indicates that the virtual machine is unavailable, which could mean it is stopped, rebooting, or experiencing a failure. Immediate action is required to restore availability.
Windows Disk Free Space Less than 10% - Critical: This threshold indicates that the disk is critically low on free space, potentially causing application failures, database corruption, log file write failures, or system instability. Immediate action is required to free up space or expand the disk.

Warning Alerts Thresholds

Percentage CPU Greater than 90 - Warning: This threshold indicates that the CPU usage is high and may start affecting performance. It's a signal to monitor the system closely.
OS Disk IOPS Consumed Percentage Greater than 95 - Warning: This threshold indicates that the OS disk is nearing its IOPS limit, which could affect the performance of applications relying on the OS disk.
Data Disk IOPS Consumed Percentage Greater than 95 - Warning: This threshold indicates that the data disk is nearing its IOPS limit, which could affect the performance of applications relying on the data disk.
Windows Disk Free Space Less than 15% - Warning: This threshold indicates that the disk is running low on free space and should be monitored closely. Proactive cleanup or capacity planning is recommended to prevent future issues.

Express Route ARP

ArpAvailability: measures the availability of Address Resolution Protocol (ARP) responses on an ExpressRoute circuit, indicating whether the circuit can successfully resolve IP addresses to MAC addresses for routing traffic.

BgpAvailability: reflects the status of the Border Gateway Protocol (BGP) session, showing whether the routing connection between your on-premises network and Azure is established and functioning correctly.

Alert Backup

The alert backup uses a classic metric alert model at this time. There is no modern metric alert that alerts the user of a failing backup without using alert processing rule to reroute classic metric alert to the corresponding areas. We highly recommend this should be updated when Microsoft use the modern version.

Silenced Resources

There are resource that have their alerts silenced, if in the future they need to be activated here is the list.

ResourceGroups:

PCCAgentlessScanResourceGroup
PublicCloudManaged-ComputeScan
lp-central-logging
dig-security-rg-71b19b490c154-806775379468908544-use1 (dig-security-rg-(values is not consistent with all environments)-use1)

Maintenance Windows:

weekly Saturday-Sunday from 2:00 - 4:00 AM Central

Environments:

cloudtest

Region silenced:

East region have been silenced in all environments till September 1st of 2025

Metric Alert Silenced:

There is only one metric alert that is silenced completely that is lp-cl-resource-group-lock-removed

Loadbalancers

DipAvailability: measures the availability of backend pool instances in an Azure Load Balancer, indicating whether traffic can be successfully routed to healthy virtual machines.

VipAvailability reflects the availability of the frontend IP address, showing whether the load balancer itself is reachable and functioning properly.

Loadbalancers have been created through the same metric alert template however that is outdated. Loadbalancer data is being sent through diagnostic setting through eventhub. This provides much better scalability compared to metric alerts due to not rallying on loadbalancer id's.

Public IP Address

Public IP Address is built the same as Loadbalancer diagnostic setting through eventhub due to scalability problem with metric alerts rallying on public ip address id's.

Azure DDoS Protection Standard service: which helps detect and mitigate distributed denial-of-service (DDoS) attacks targeting your Azure resources. These alerts notify you when suspicious traffic patterns or actual attacks are detected.

Baseline alert table

Alert Description	Threshold Condition
Percentage CPU Greater than 90 - Warning	> 90%
Percentage CPU Greater than 95 - Critical	> 95%
OS Disk IOPS Consumed Percentage Greater than 95 - Warning	> 95%
OS Disk IOPS Consumed Percentage Greater than 98 - Critical	> 98%
Data Disk IOPS Consumed Percentage Greater than 95 - Warning	> 95%
Data Disk IOPS Consumed Percentage Greater than 98 - Critical	> 98%
Windows Disk Free Space Less than 15% - Warning	< 15%
Windows Disk Free Space Less than 10% - Critical	< 10%
Virtual Machine Availability Less than 100% - Critical	> 100%
Express Route BGP Down - Critical	> 95%
Express Route ARP Down - Critical	> 100%
ALB Data Path Availability Less than 90 - Critical	> 90%
ALB Health Probe Status Less than 90 - Error	> 90%

Alert Suppression Rules

Suppress alerts during maintenance windows
Suppress alerts for LaunchPad Resource Groups
Suppress cloud test resources

Activity log alerts administrative and service health

Service Health
Microsoft.Sql/servers/firewallRules/write
Microsoft.Sql/servers/firewallRules/delete
Microsoft.Network/networkSecurityGroups/write
Microsoft.Network/networkSecurityGroups/delete
Microsoft.ClassicNetwork/networkSecurityGroups/write
Microsoft.ClassicNetwork/networkSecurityGroups/delete
Microsoft.Network/networkSecurityGroups/securityRules/write
Microsoft.Network/networkSecurityGroups/securityRules/delete
Microsoft.ClassicNetwork/networkSecurityGroups/securityRules/write
Microsoft.ClassicNetwork/networkSecurityGroups/securityRules/delete

Resources that use diagnostic settings for alerts

Public IP Availabiility
load balancers all types

Azure NetApp Files

Diagnostic setting through eventhub is enabled in Azure NetApp files so that activity logs are forwarded to splunk.

There are metrics corresponding to Capacity pools and Volumes within Azure NetApp file that need to be monitored and alerts triggered when the metric parameters exceed defined thresholds.

Adding New NetApp Volumes or Pools

When adding new NetApp volumes or capacity pools to monitoring workspaces, follow these steps to ensure proper alert generation without naming conflicts:

Files to Update

variable.tf - Define the new volume or pool in the appropriate variable:
- For volumes: Add to netapp_volumes variable
- For pools: Add to netapp_capacity_pools variable
locals.tf - Add resource abbreviation mapping:
- Update the resource_prefix lookup table in netapp_short_name_map
- Create a unique 3-character abbreviation for the new resource

Resource Naming Convention

Each NetApp resource must have a unique 3-character abbreviation to generate unique action group names within the 12-character limit. The naming format is:

{region_prefix}{resource_prefix}{metric_suffix}

Region Prefixes:

wus3 - West US 3
cus - Central US
eus - East US

Resource Prefix Requirements:

Must be exactly 3 characters
Must be unique within the workspace
Should be meaningful/recognizable
Examples: kpr (kuiper), std (standard), wbn (wbs_pnw)

Example Resource Mapping:

resource_prefix = lookup({
  # West US3 NPD NetApp resources
  "wbs_pnw"      = "wbn"
  "standardpool" = "stp"
  "new_volume"   = "nvl"  # New 3-char abbreviation
}, local.netapp_resource_key_map[k].resource_key, "unk")

Validation Requirements

⚠️ Critical: Always validate your changes by running terraform plan and checking for "unk" or "Unk" values in:

Action group names (short_name)
Alert descriptions
Resource identifiers

Validation Command:

terraform plan | grep -i unk

If "unk" or "Unk" appears in the output, it indicates:

Missing resource abbreviation in the resource_prefix lookup
Missing metric suffix mapping in metric_suffix lookup
Incorrect resource key mapping

Common Issues:

Resource name not found in lookup table → returns "unk"
Metric short name not mapped → returns "Unk"
Typos in resource names or mappings

Action Group Name Examples

✅ Correct: wus3wbnBkpEn (West US3 + wbs_pnw + Backup Enabled) ❌ Incorrect: wus3wbnUnk (indicates missing metric suffix mapping) ❌ Incorrect: wus3unkBkpEn (indicates missing resource abbreviation)

Baseline alert table

Alert Description	Threshold Condition
Volume utilization Greater than 80 - Warning	> 80%
Volume consumed size Greater than 85 - Critical	> 85%
Read IOPS Consumed Percentage Greater than 80 - Warning	> 80%
Read IOPS Consumed Percentage Greater than 90 - Critical	> 90%
Write IOPS Consumed Percentage Greater than 80 - Warning	> 80%
Write IOPS Consumed Percentage Greater than 90 - Critical	> 90%
Total IOPS Consumed Percentage Greater than 80 - Warning	> 80%
Total IOPS Consumed Percentage Greater than 90 - Critical	> 90%
other IOPS Consumed Percentage Greater than 150TiB - Warning	> 150 TiB
Other IOPS Consumed Percentage Greater than 200TiB - Critical	> 200 TiB
Read throughput Percentage Greater than 90 - Critical	> 90%
Write throughput Percentage Greater than 90 - Critical	> 90%
Total throughput Percentage Greater than 90 - Critical	> 90%
Other throughput Percentage Greater than 3 MiB/s/TiB - Critical	> 3 MiB/s / TiB
Read latency Greater than 60 milliseconds - Critical	> 60 ms
Write latency Greater than 60 milliseconds - Critical	> 60 ms
Volume replication lag time Greater than 900 second - Critical	> 900 s
Capacity pool consumed size Greater than 85% - Critical	> 85%
Volume Inode Consumed Percentage Greater than 80 - Warning	> 80%
Volume Inode Consumed Percentage Greater than 90 - Critical	> 90%