Metric Alerts Configuration
Metric Alerts
Thresholds are split into two categories warning or critical. Warning thresholds purpose is to alert the user of an upcoming issue and it is with discreation to solve the problem. Critical thresholds purpose is to alert the user of an immediate issue and it would need be resolve immediately.
Critical Alerts Thresholds
-
Percentage CPU Greater than 95 - Critical: This threshold signifies that the CPU usage is extremely high and likely impacting system performance significantly. Immediate action may be required to prevent system overload.
-
OS Disk IOPS Consumed Percentage Greater than 98 - Critical: This threshold signifies that the OS disk is almost fully utilized in terms of IOPS, potentially causing severe performance degradation.
-
Data Disk IOPS Consumed Percentage Greater than 98 - Critical: This threshold signifies that the data disk is almost fully utilized in terms of IOPS, potentially causing severe performance degradation.
-
Virtual Machine Availability Less than 1 - Critical: This threshold indicates that the virtual machine is unavailable, which could mean it is stopped, rebooting, or experiencing a failure. Immediate action is required to restore availability.
-
Windows Disk Free Space Less than 10% - Critical: This threshold indicates that the disk is critically low on free space, potentially causing application failures, database corruption, log file write failures, or system instability. Immediate action is required to free up space or expand the disk.
Warning Alerts Thresholds
-
Percentage CPU Greater than 90 - Warning: This threshold indicates that the CPU usage is high and may start affecting performance. It's a signal to monitor the system closely.
-
OS Disk IOPS Consumed Percentage Greater than 95 - Warning: This threshold indicates that the OS disk is nearing its IOPS limit, which could affect the performance of applications relying on the OS disk.
-
Data Disk IOPS Consumed Percentage Greater than 95 - Warning: This threshold indicates that the data disk is nearing its IOPS limit, which could affect the performance of applications relying on the data disk.
-
Windows Disk Free Space Less than 15% - Warning: This threshold indicates that the disk is running low on free space and should be monitored closely. Proactive cleanup or capacity planning is recommended to prevent future issues.
Express Route ARP
ArpAvailability: measures the availability of Address Resolution Protocol (ARP) responses on an ExpressRoute circuit, indicating whether the circuit can successfully resolve IP addresses to MAC addresses for routing traffic.
BgpAvailability: reflects the status of the Border Gateway Protocol (BGP) session, showing whether the routing connection between your on-premises network and Azure is established and functioning correctly.
Alert Backup
The alert backup uses a classic metric alert model at this time. There is no modern metric alert that alerts the user of a failing backup without using alert processing rule to reroute classic metric alert to the corresponding areas. We highly recommend this should be updated when Microsoft use the modern version.
Silenced Resources
There are resource that have their alerts silenced, if in the future they need to be activated here is the list.
ResourceGroups:
-
PCCAgentlessScanResourceGroup
-
PublicCloudManaged-ComputeScan
-
lp-central-logging
-
dig-security-rg-71b19b490c154-806775379468908544-use1 (dig-security-rg-(values is not consistent with all environments)-use1)
Maintenance Windows:
- weekly Saturday-Sunday from 2:00 - 4:00 AM Central
Environments:
- cloudtest
Region silenced:
- East region have been silenced in all environments till September 1st of 2025
Metric Alert Silenced:
- There is only one metric alert that is silenced completely that is lp-cl-resource-group-lock-removed
Loadbalancers
DipAvailability: measures the availability of backend pool instances in an Azure Load Balancer, indicating whether traffic can be successfully routed to healthy virtual machines.
VipAvailability reflects the availability of the frontend IP address, showing whether the load balancer itself is reachable and functioning properly.
Loadbalancers have been created through the same metric alert template however that is outdated. Loadbalancer data is being sent through diagnostic setting through eventhub. This provides much better scalability compared to metric alerts due to not rallying on loadbalancer id's.
Public IP Address
Public IP Address is built the same as Loadbalancer diagnostic setting through eventhub due to scalability problem with metric alerts rallying on public ip address id's.
Azure DDoS Protection Standard service: which helps detect and mitigate distributed denial-of-service (DDoS) attacks targeting your Azure resources. These alerts notify you when suspicious traffic patterns or actual attacks are detected.
Baseline alert table
| Alert Description | Threshold Condition |
|---|---|
| Percentage CPU Greater than 90 - Warning | > 90% |
| Percentage CPU Greater than 95 - Critical | > 95% |
| OS Disk IOPS Consumed Percentage Greater than 95 - Warning | > 95% |
| OS Disk IOPS Consumed Percentage Greater than 98 - Critical | > 98% |
| Data Disk IOPS Consumed Percentage Greater than 95 - Warning | > 95% |
| Data Disk IOPS Consumed Percentage Greater than 98 - Critical | > 98% |
| Windows Disk Free Space Less than 15% - Warning | < 15% |
| Windows Disk Free Space Less than 10% - Critical | < 10% |
| Virtual Machine Availability Less than 100% - Critical | > 100% |
| Express Route BGP Down - Critical | > 95% |
| Express Route ARP Down - Critical | > 100% |
| ALB Data Path Availability Less than 90 - Critical | > 90% |
| ALB Health Probe Status Less than 90 - Error | > 90% |
Alert Suppression Rules
- Suppress alerts during maintenance windows
- Suppress alerts for LaunchPad Resource Groups
- Suppress cloud test resources
Activity log alerts administrative and service health
- Service Health
- Microsoft.Sql/servers/firewallRules/write
- Microsoft.Sql/servers/firewallRules/delete
- Microsoft.Network/networkSecurityGroups/write
- Microsoft.Network/networkSecurityGroups/delete
- Microsoft.ClassicNetwork/networkSecurityGroups/write
- Microsoft.ClassicNetwork/networkSecurityGroups/delete
- Microsoft.Network/networkSecurityGroups/securityRules/write
- Microsoft.Network/networkSecurityGroups/securityRules/delete
- Microsoft.ClassicNetwork/networkSecurityGroups/securityRules/write
- Microsoft.ClassicNetwork/networkSecurityGroups/securityRules/delete
Resources that use diagnostic settings for alerts
- Public IP Availabiility
- load balancers all types
Azure NetApp Files
Diagnostic setting through eventhub is enabled in Azure NetApp files so that activity logs are forwarded to splunk.
There are metrics corresponding to Capacity pools and Volumes within Azure NetApp file that need to be monitored and alerts triggered when the metric parameters exceed defined thresholds.
Adding New NetApp Volumes or Pools
When adding new NetApp volumes or capacity pools to monitoring workspaces, follow these steps to ensure proper alert generation without naming conflicts:
Files to Update
-
variable.tf - Define the new volume or pool in the appropriate variable:
- For volumes: Add to
netapp_volumesvariable - For pools: Add to
netapp_capacity_poolsvariable
- For volumes: Add to
-
locals.tf - Add resource abbreviation mapping:
- Update the
resource_prefixlookup table innetapp_short_name_map - Create a unique 3-character abbreviation for the new resource
- Update the
Resource Naming Convention
Each NetApp resource must have a unique 3-character abbreviation to generate unique action group names within the 12-character limit. The naming format is:
{region_prefix}{resource_prefix}{metric_suffix}
Region Prefixes:
wus3- West US 3cus- Central USeus- East US
Resource Prefix Requirements:
- Must be exactly 3 characters
- Must be unique within the workspace
- Should be meaningful/recognizable
- Examples:
kpr(kuiper),std(standard),wbn(wbs_pnw)
Example Resource Mapping:
resource_prefix = lookup({
# West US3 NPD NetApp resources
"wbs_pnw" = "wbn"
"standardpool" = "stp"
"new_volume" = "nvl" # New 3-char abbreviation
}, local.netapp_resource_key_map[k].resource_key, "unk")
Validation Requirements
⚠️ Critical: Always validate your changes by running terraform plan and checking for "unk" or "Unk" values in:
- Action group names (
short_name) - Alert descriptions
- Resource identifiers
Validation Command:
terraform plan | grep -i unk
If "unk" or "Unk" appears in the output, it indicates:
- Missing resource abbreviation in the
resource_prefixlookup - Missing metric suffix mapping in
metric_suffixlookup - Incorrect resource key mapping
Common Issues:
- Resource name not found in lookup table → returns "unk"
- Metric short name not mapped → returns "Unk"
- Typos in resource names or mappings
Action Group Name Examples
✅ Correct: wus3wbnBkpEn (West US3 + wbs_pnw + Backup Enabled)
❌ Incorrect: wus3wbnUnk (indicates missing metric suffix mapping)
❌ Incorrect: wus3unkBkpEn (indicates missing resource abbreviation)
Baseline alert table
| Alert Description | Threshold Condition |
|---|---|
| Volume utilization Greater than 80 - Warning | > 80% |
| Volume consumed size Greater than 85 - Critical | > 85% |
| Read IOPS Consumed Percentage Greater than 80 - Warning | > 80% |
| Read IOPS Consumed Percentage Greater than 90 - Critical | > 90% |
| Write IOPS Consumed Percentage Greater than 80 - Warning | > 80% |
| Write IOPS Consumed Percentage Greater than 90 - Critical | > 90% |
| Total IOPS Consumed Percentage Greater than 80 - Warning | > 80% |
| Total IOPS Consumed Percentage Greater than 90 - Critical | > 90% |
| other IOPS Consumed Percentage Greater than 150TiB - Warning | > 150 TiB |
| Other IOPS Consumed Percentage Greater than 200TiB - Critical | > 200 TiB |
| Read throughput Percentage Greater than 90 - Critical | > 90% |
| Write throughput Percentage Greater than 90 - Critical | > 90% |
| Total throughput Percentage Greater than 90 - Critical | > 90% |
| Other throughput Percentage Greater than 3 MiB/s/TiB - Critical | > 3 MiB/s / TiB |
| Read latency Greater than 60 milliseconds - Critical | > 60 ms |
| Write latency Greater than 60 milliseconds - Critical | > 60 ms |
| Volume replication lag time Greater than 900 second - Critical | > 900 s |
| Capacity pool consumed size Greater than 85% - Critical | > 85% |
| Volume Inode Consumed Percentage Greater than 80 - Warning | > 80% |
| Volume Inode Consumed Percentage Greater than 90 - Critical | > 90% |