Monitoring & Observability
Monitoring & Observability
This is the central landing page for all monitoring tooling used in Epic on Azure. Use the links below to jump directly to the tool or dashboard you need.
OneView Record: Epic on Azure (AIDE_0085665) — central catalog of all tooling, data sources, and application metadata.
Access Requirements
Before you can view dashboards or receive alerts, request membership in these Microsoft Entra ID (Azure AD) groups:
| Group | Purpose | Required For |
|---|---|---|
| Monitoring_ReadOnly | Read access to Dynatrace dashboards and monitoring data | All team members |
| dtcloud_AIDE_0085665_config | Dynatrace configuration access (management zones, alerting profiles) | Platform engineers, on-call |
| cloud_splunk_east_epic_azure_nw_power | Splunk read/write for VM and network logs | All team members |
| sec_splunk_epic_azure_nw_power | Splunk security logs (firewall) | Security, network engineers |
Request these groups through Secure (MyIT). For full onboarding steps, see Phase 2: Tools Setup.
Monitoring Architecture
All monitoring data flows through a layered architecture: agents collect data, tools analyze it, alerts route through Interlink, and notifications reach on-call engineers via ServiceNow Notify.
graph TD
subgraph "Data Collection"
W_OA[Windows — OneAgent]
W_LOG[Windows — Event Logs]
L_OA[Linux — OneAgent]
L_FB[Linux — Fluent Bit]
FW[Firewalls / Appliances — Syslog]
AZ[Azure Platform Metrics]
end
subgraph "Analysis & Visualization"
DT[(Dynatrace<br/>APM + Infrastructure)]
SP[(Splunk<br/>Log Aggregation)]
AZMON[Azure Monitor<br/>Metric Alerts]
ESP[Epic System Pulse]
end
subgraph "Alert Routing"
IL[Interlink<br/>Event Aggregator]
end
subgraph "Notification"
SNOW[ServiceNow Notify<br/>Alerts to Devices]
TCC[Command Center / TCC<br/>P1 & P2 Escalation]
end
W_OA -->|OneAgent| DT
L_OA -->|OneAgent| DT
W_LOG -->|Azure Log Aggregator| SP
L_FB -->|Azure Event Hub| SP
FW -->|Syslog → Event Hub| SP
AZ --> AZMON
AZ --> DT
ESP -->|Events| DT
DT -->|Problem Notifications| IL
AZMON -->|Alert Rules| IL
SP -->|Saved Searches / Alerts| IL
IL --> SNOW
IL --> TCC
Key data flows:
- Host metrics and APM → Dynatrace OneAgent → Dynatrace SaaS → Interlink
- Logs (all hosts) → Fluent Bit / Event Logs → Azure Event Hub → Splunk → Interlink
- Azure platform metrics → Azure Monitor → Alert Rules → Interlink
- All alerts → Interlink → ServiceNow Notify → on-call devices
Dashboards & Tools Quick Access
Dynatrace (APM & Infrastructure)
Dynatrace OneAgent is deployed on all hosts and auto-injectable application components (IIS, .NET, Java, etc.). It provides full-stack observability from infrastructure to application transactions.
| Dashboard | Description | Link |
|---|---|---|
| Infrastructure Insights | What needs attention — problems, resource issues | Open Dashboard |
| Infrastructure Health Overview | Epic National West health status | Open Dashboard |
| ODB Health Dashboard | Epic on Azure ODB performance and health | Open Dashboard |
| Interconnect Foreground | Interconnect Foreground monitoring | Open Dashboard |
| Interconnect Background | Interconnect Background monitoring | Open Dashboard |
| BCA | BCA monitoring | Open Dashboard |
| Epic Care Link | Epic Care Link monitoring | Open Dashboard |
| Care Everywhere | Care Everywhere monitoring | Open Dashboard |
| Welcome Web | Welcome Web monitoring | Open Dashboard |
| Epic Print Service | Epic Print Service monitoring | Open Dashboard |
| Welcome Client | Welcome Client monitoring | Open Dashboard |
| System Pulse | System Pulse monitoring | Open Dashboard |
| ODB | ODB monitoring | Open Dashboard |
Dynatrace Tenants:
| Environment | Tenant ID | URL |
|---|---|---|
| Production | skx14060 | skx14060.apps.dynatrace.com |
| Non-Production | dfr17824 | dfr17824.apps.dynatrace.com |
Filter tags: Askid:AIDE_0085665, [Azure]aide-id: AIDE_0085665
Host groups: AIDE_0085665.{environment}.azu (e.g., AIDE_0085665.prod.azu)
Network zones: AIDE_0085665.{environment}.azu
OneAgent deployment:
- Linux install path:
/monitor/oneagent - Windows install path:
C:\monitor\oneagent - Monitoring mode: Full stack (application + infrastructure)
- Deployed via Ansible: see ohemr-ansible-role-dynatrace
- Configuration-as-Code: see ohemr-dynatrace-config (Terraform)
Documentation:
- Dynatrace Problems API Guide — querying problems via API with cURL and jq
- Monitoring Strategy — where Dynatrace fits in the overall monitoring stack
Splunk (Log Aggregation & Analysis)
Splunk is the central log aggregation platform. All hosts, applications, and network appliances forward logs to Splunk via Azure Event Hub and Fluent Bit.
| Dashboard | Description | Link |
|---|---|---|
| Azure Metric Alert Dashboard | Tracked resource groups, Severity 0 (Critical) alerts | Open Dashboard |
| Azure Patching Schedule | Upcoming and recent patching maintenance windows | Open Dashboard |
Splunk Instances:
| Instance | Purpose | URL |
|---|---|---|
| Cloud Splunk East | VM, network, and infrastructure logs | est-sh.prod.cloud-splunk-optum.com |
| Security Splunk | Firewall and security event logs | sec-splunk.optum.com |
Key indexes:
| Index | Content |
|---|---|
cloud_epic_azure_nw | VM and network infrastructure logs |
sec_n_paloalto_panos | Palo Alto firewall logs |
Quick test search: index=cloud_epic_azure_nw | head 10
Log retention: 90 days operational, 7 years audit
Documentation:
- Splunk Maintenance Windows — suppressing alerts during patching
- Splunk Queries Guide — useful SPL queries and search patterns
- Fluent Bit Configuration — Linux log collection agent setup
Azure Monitor (Infrastructure Metric Alerts)
Azure Monitor provides native metric alerting for all Azure resources. Alert rules are deployed via Terraform and route through Interlink.
Critical alert thresholds (Severity 0):
| Metric | Threshold | Resource Type |
|---|---|---|
| CPU | >= 95% | Virtual Machines |
| Available Memory | <= 2% | Virtual Machines |
| Data Disk IOPS | >= 98% | Virtual Machines |
| OS Disk IOPS | >= 98% | Virtual Machines |
| VM Availability | < 1 | Virtual Machines |
| Disk Free Space | < 10% | Windows VMs |
Warning thresholds (Severity 2): CPU >= 90%, Memory <= 15%, IOPS >= 95%, Disk < 15%
Activity log alerts: Service Health, SQL Firewall changes, NSG changes
Alert processing rules (suppression):
- Maintenance windows: Saturday-Sunday 2:00-4:00 AM CST
- Cloud test resources: daily suppression
- Excluded resource groups: PCC Agentless Scan, PublicCloudManaged-ComputeScan, DIG Security, LP Central Logging
Infrastructure-as-Code:
- Alert rules: ohemr-epic-private-registry-alert-processing-rule (Terraform)
- Action groups route to Event Hub
diagnostic-logson namespacelp-cl-centralus-eventhub-6a9ba7a4
Documentation:
- Metric Alert Configuration — thresholds and alert rule details
- Metric Alert Code Explanations — alert logic documentation
- EoA Monitoring Coverage Matrix — what is monitored and by whom
Alert runbooks:
Interlink (Event Aggregation & Alert Routing)
Interlink is the central event aggregator and alerting tool. All monitoring tools (Dynatrace, Azure Monitor, Splunk) route their alerts through Interlink, which then dispatches notifications via ServiceNow Notify.
Access: interlink.optum.com (Production) | interlink-test.optum.com (Test)
Alert flow:
- Monitoring tool detects issue and fires alert
- Alert arrives in Interlink as an event
- Interlink applies correlation rules and deduplication
- Interlink dispatches notification via ServiceNow Notify
- On-call engineer receives alert on their registered device
Documentation:
- Interlink Maintenance Windows — creating and managing maintenance suppression records via API and UI
ServiceNow Notify (Alert Delivery)
ServiceNow Notify delivers alerts to on-call engineers' registered devices (phone, SMS, email) based on their ServiceNow profile configuration.
Incident routing:
| Severity | Routing | Target |
|---|---|---|
| P1, P2 (Critical/High) | Command Center / TCC | Immediate page to on-call + incident bridge |
| P3, P4 (Warning/Info) | Team / ServiceNow ticket | Team notification or auto-ticket creation |
Documentation:
OneView (Application Record)
OneView is the central catalog for finding all tooling, data sources, and metadata associated with Epic on Azure.
Epic on Azure record: AIDE_0085665
OneView provides:
- Application metadata and ownership
- Linked infrastructure and services
- Monitoring tool references
- Compliance and security posture
Epic System Pulse
Epic System Pulse is the native Epic monitoring tool for application-level health and performance.
Access: systempulse.uhc.com
Integration: Events flow into Dynatrace for correlation with infrastructure metrics. Manual review and classification by the Epic technical team is currently required.
Selector.AI (POC)
Selector.AI is an AIOps platform currently under evaluation as a proof-of-concept for intelligent alert correlation, root cause analysis, and automated insights across the monitoring stack.
Access: optum.selector.ai
Status: Active POC — not yet integrated into production alert routing. Contact the platform team for access and current scope.
Monitoring by Resource Type
| Resource | Primary Monitor | Secondary Monitor | Log Destination | Alert Routing |
|---|---|---|---|---|
| Azure Services (Storage, ExpressRoute) | Azure Monitor | Dynatrace OneAgent | Splunk via Event Hub | Interlink → SNOW/TCC |
| Windows VMs | Azure Monitor + Guest OS Logs | Dynatrace OneAgent | Splunk via Event Hub | Interlink → SNOW/TCC |
| Linux VMs | Azure Monitor (AMA + DCR) | Dynatrace OneAgent | Splunk via Fluent Bit | Interlink → SNOW/TCC |
| NetApp Volumes | Azure Monitor | — | Splunk via Event Hub | Interlink → SNOW/TCC |
| Firewalls / Appliances | Appliance Syslog (Palo Alto) | — | Splunk via Event Hub | Interlink → SNOW/TCC |
| Citrix | UberAgent Dashboards | — | Splunk via Kafka | Interlink → SNOW/TCC |
| Epic Application | Epic System Pulse | Dynatrace APM | — | Manual review (future: automated) |
For the full coverage matrix with team ownership and operational details, see EoA Monitoring Standards.
Performance Targets
| Metric | Target |
|---|---|
| Epic Hyperspace response time | < 2 seconds |
| Database query average | < 100ms |
| API endpoint response | < 500ms |
| File system operations | < 50ms |
| Production availability | 99.95% uptime |
| Critical services availability | 99.99% uptime |
| Planned maintenance | < 4 hours/month |
Capacity thresholds: Warning at 75%, Critical at 85%, Emergency at 95%
Monitoring Contacts
| Domain | Contact |
|---|---|
| Azure Monitoring | Clint / Indhu |
| Dynatrace | Paris |
| Splunk Logging | Clint / Indhu / Paris |
| Appliance Monitoring | Dwayne B Jones |
| Citrix | Jason |
| SQL Monitoring | Laura / Clint / John Brownlee |
| Epic System Pulse | Matt / Jordan |
Related Documentation
- Monitoring Strategy — tool selection philosophy and data flow architecture
- Fluent Bit Configuration — Linux log collection agent setup and troubleshooting
- Operations Hub — daily operational procedures
- Incident Management — escalation procedures and RACI
- Monitoring RACI Matrix — go-live readiness assessment for each tool
- Onboarding: Tools Setup — first-time access requests
Key Repositories
| Repository | Purpose | IaC Tool |
|---|---|---|
| ohemr-dynatrace-config | Dynatrace configuration-as-code (management zones, auto-tags, alerting profiles) | Terraform |
| ohemr-ansible-role-dynatrace | OneAgent and ActiveGate deployment | Ansible |
| ohemr-epic-private-registry-alert-processing-rule | Azure Monitor alert rules and processing | Terraform |
| ohemr-epic-megadoc | This documentation (monitoring section) | MkDocs |