Epic on Azure Monitoring - RACI Matrix
Epic on Azure Monitoring - RACI Matrix
RACI Legend
- R = Responsible (Does the work)
- A = Accountable (Final authority/decision maker)
- C = Consulted (Provides input)
- I = Informed (Kept in the loop)
Go-Live Readiness Rating Scale
- ๐ข High (4-5): Tool fully operational, team trained, processes documented, confident in ability to support
- ๐ก Medium (3): Tool functional but gaps exist, some training needed, moderate confidence
- ๐ด Low (1-2): Significant gaps, limited readiness, low confidence in current state
RACI Matrix
<table> <thead> <tr> <th width="20%"><strong>Monitoring Area/Tool</strong></th> <th width="15%" style="white-space: nowrap;"><strong>Go-Live Readiness</strong></th> <th width="4%"><strong>Optum Health<br/>(Ken Cam)</strong></th> <th width="4%"><strong>Optum Insight<br/>(Jordan Lambert)</strong></th> <th width="4%"><strong>Citrix Team<br/>(James Hallowell)</strong></th> <th width="4%"><strong>Infra Ops<br/>(Randy)</strong></th> <th width="4%"><strong>Cloud Ops<br/>(Tom)</strong></th> <th width="5%"><strong>Clinical NOC<br/>(Tom Busse)</strong></th> <th width="40%"><strong>Notes & Commentary</strong></th> </tr> </thead> <tbody> <tr> <td><strong>SystemPulse (Epic App Monitoring)</strong><br/><em>Includes Database/ODB Services</em><br><a href="https://systempulse.uhc.com/SystemPulse/Monitor.aspx">Here</a></td> <td style="white-space: nowrap;">๐ข <strong>High (4)</strong></td> <td>C/I</td> <td><strong>A/R</strong></td> <td>I</td> <td>I</td> <td>I</td> <td>I</td> <td><strong>Capabilities:</strong> Epic application performance monitoring, database (ODB Services) monitoring, workflow validation. <strong>Limitations:</strong> Focused on app layer only; basic infrastructure visibility. Limited correlation with network/Citrix issues without manual effort.</td> </tr> <tr> <td><strong>Hyperspace Web (Epic Web Access)</strong></td> <td style="white-space: nowrap;">๐ข <strong>High (4)</strong></td> <td>C/I</td> <td><strong>A/R</strong></td> <td>C</td> <td>I</td> <td>I</td> <td>I</td> <td><strong>Capabilities:</strong> Tracks web-based Epic access and session availability. <strong>Limitations:</strong> No deep session analytics; relies on Citrix for full user experience visibility. Limited troubleshooting beyond access checks. Front-end connectivity validation only.</td> </tr> <tr> <td><strong>Citrix Monitor / UberAgent</strong></td> <td style="white-space: nowrap;">๐ก <strong>Medium (3)</strong></td> <td>I</td> <td>C</td> <td><strong>A/R</strong></td> <td>C/I</td> <td>I</td> <td>I</td> <td><strong>Capabilities:</strong> Detailed Citrix session performance metrics; tracks latency, logon times, resource utilization. <strong>Limitations:</strong> Session-level only; lacks full correlation with app/network layers. Splunk dashboards need manual tuning. Limited predictive analytics. Data retention may be insufficient for trend analysis.</td> </tr> <tr> <td><strong>Splunk (Infra & App Logs)</strong><br><a href="https://epic.optum.com/getting-started/onboarding-tools/#monitoring-access">Request Access</a></td> <td style="white-space: nowrap;">๐ก <strong>Medium (3)</strong></td> <td>I</td> <td>C</td> <td><strong>R</strong> (Citrix data)</td> <td><strong>A/R</strong> (Infrastructure)</td> <td>C</td> <td>C</td> <td><strong>Capabilities:</strong> Aggregates logs from multiple sources; custom dashboards; correlation potential. <strong>Limitations:</strong> Real-time correlation needs advanced queries; high licensing costs. Requires query expertise for complex troubleshooting. <strong>Missing Linux Sys Logs</strong> - gap in log coverage. Epic-specific workflow correlation still maturing.</td> </tr> <tr> <td><strong>Azure Monitor / Dashboards</strong><br><a href="https://epic.optum.com/getting-started/onboarding-access/#azure-account">Request Access</a></td> <td style="white-space: nowrap;">๐ข <strong>High (4)</strong></td> <td>I</td> <td>C</td> <td>C</td> <td><strong>A/R</strong></td> <td>C</td> <td>I</td> <td><strong>Capabilities:</strong> Native Azure monitoring for VMs, databases, network; custom dashboards; AI-driven anomaly detection. <strong>Limitations:</strong> Gaps in hybrid visibility; requires tuning for actionable alerts vs. noise. Native Azure focus may miss Epic-specific context.</td> </tr> <tr> <td><strong>Dynatrace</strong><br><a href="https://epic.optum.com/getting-started/onboarding-tools/#dynatrace">Request Access</a></td> <td style="white-space: nowrap;">๐ข <strong>High (4)</strong></td> <td>I</td> <td>C</td> <td>C</td> <td>C</td> <td><strong>A/R</strong></td> <td>I</td> <td><strong>Capabilities:</strong> Full-stack APM with AI-driven insights; application topology mapping. <strong>Limitations:</strong> Deployment in progress; requires ActiveGate setup and advanced configuration for Epic. Limited app performance monitoring capability currently. Licensing constraints may limit full-stack visibility. <strong>Critical Gap:</strong> Citrix team does not want to add/manage OneAgent on their VMs in addition to UberAgent - limits Dynatrace visibility into Citrix layer. Not yet fully integrated with Epic workflows.</td> </tr> <tr> <td><strong>Netscout (Network Visibility)</strong></td> <td style="white-space: nowrap;">๐ก <strong>Medium (3)</strong></td> <td>I</td> <td>C</td> <td>C</td> <td><strong>A/R</strong></td> <td>C</td> <td>I</td> <td><strong>Capabilities:</strong> Packet-level visibility; detects and mitigates DDoS attacks; flow-based analytics. <strong>Limitations:</strong> No application-level context. Integration with Epic/Netcarriers pending. Limited correlation without Splunk/Dynatrace integration. Flow-based analytics for troubleshooting only.</td> </tr> <tr> <td><strong>DocX (Dedicated Ops for Clinical eXperience)</strong><br><a href="https://docx.optum.com/">Here</a></td> <td style="white-space: nowrap;">๐ข <strong>High (4)</strong></td> <td>C</td> <td>C</td> <td>I</td> <td>I</td> <td>I</td> <td><strong>A/R</strong></td> <td><strong>Capabilities:</strong> Clinician-facing performance dashboard; provider-level experience tracking. <strong>Escalation Path:</strong> Clinical NOC opens incident and notifies Optum Health when DocX shows degradation. <strong>Limitations:</strong> Limited end-user experience data beyond clinician workflows. Manual escalation process (not automated). <strong>Gap:</strong> Integration with other monitoring tools for correlation needs strengthening.</td> </tr> <tr> <td><strong>Azure Dashboards (Custom)</strong><br><a href="https://epic.optum.com/getting-started/onboarding-access/#azure-account">Request Access</a></td> <td style="white-space: nowrap;">๐ข <strong>High (4)</strong></td> <td>I</td> <td>C</td> <td>C</td> <td><strong>A/R</strong></td> <td>C</td> <td>I</td> <td><strong>Capabilities:</strong> Custom Azure-native dashboards for infrastructure metrics. <strong>Limitations:</strong> Requires manual dashboard creation and maintenance. No anomaly detection. Limited to Azure-native metrics without integration work.</td> </tr> <tr> <td><strong>GitHub (IaC / Platform Issues)</strong><br><a href="https://github.com/optum-tech-compute/ohemr-ops/issues">Here</a></td> <td style="white-space: nowrap;">๐ข <strong>High (4)</strong></td> <td>I</td> <td>I</td> <td>I</td> <td>C</td> <td><strong>A/R</strong></td> <td>I</td> <td><strong>Capabilities:</strong> Issue tracking for infrastructure-as-code, CI/CD pipelines, platform automation. <strong>Scope:</strong> Non-urgent platform/code issues; separate from production incidents.</td> </tr> </tbody> </table>Team Responsibilities Overview
๐ต Optum Health (Ken Cam & Team)
Primary Focus: End-user communications / CDO & Clinician site relationship
Responsibilities:
- Owns all external communications to clinicians, CDO, and clinical sites
- Primary clinical stakeholder relationship owner
- Receives notifications from Clinical NOC when DocX shows degradation requiring external communications
- Translates clinical impact to business/executive stakeholders
- Monitors clinician experience trends (via DocX, owned by Clinical NOC)
- Escalation point for widespread clinical impact requiring executive visibility
Identified Gaps:
- โ ๏ธ Communication templates - Need pre-drafted templates for various incident scenarios (P1, P2, resolution notices)
- Recommendation:
- Validate Clinical NOC โ Optum Health escalation path with tabletop exercises
- Develop communication playbook with templates for different scenarios
- All communications to technical teams go through formal ServiceNow incidents only
๐ฃ Clinical NOC (Tom Busse & Team)
Primary Focus: DocX monitoring / Clinician experience monitoring / NOC operations
Responsibilities:
- A/R for DocX (clinician experience monitoring)
- Monitors clinician-facing performance dashboards 24x7
- Defined escalation path: When DocX shows degradation, Clinical NOC manually opens ServiceNow incident AND notifies Optum Health
- Routes technical issues to Optum Insight via ServiceNow incident
- Reports clinician experience trends to Optum Health for external communication
- First line of detection for clinician-impacting issues via DocX
Identified Gaps:
- โ ๏ธ DocX integration with other tools - Limited correlation between DocX clinician experience and SystemPulse/Citrix/Azure metrics; requires manual checking
- Recommendation:
- Enhance DocX correlation with other monitoring tools for Clinical NOC visibility (unified dashboard)
- Define SLAs: Clinical NOC response time to DocX degradation (e.g., open incident + notify Optum Health within 15 minutes for critical, 30 minutes for high)
- Weekly sync between Clinical NOC, Optum Health, and Optum Insight on clinician experience trends
๐ข Optum Insight (Jordan Lambert & Team) - Business Ops / Technical Triage
Primary Focus: Epic Application Layer & Database Performance / Technical triage and bridge between clinical teams and internal technical teams
Responsibilities:
- A/R for SystemPulse (Epic application monitoring including database/ODB Services)
- A/R for Hyperspace Web (Epic web access)
- INITIAL TECHNICAL TRIAGE for all clinician-reported issues from Help Desk
- Receives technical escalations from Clinical NOC when DocX shows degradation (via ServiceNow incident)
- Receives all escalations via formal ServiceNow incidents only - no informal/backchannel escalations
- Bridge between clinical teams (Clinical NOC, Optum Health) and internal technical teams (Infrastructure Ops, Citrix, Cloud Ops)
- Epic workflow validation and performance
- Database performance monitoring and troubleshooting
- Application-level troubleshooting and escalation
- Coordinate with Optum Health for clinical impact communication (via ServiceNow incident updates)
- Engage Infrastructure Ops, Citrix, and Cloud Ops when issues span multiple layers
Identified Gaps:
- โ ๏ธ Limited infrastructure visibility - SystemPulse doesn't see network/Citrix/Azure layer issues
- โ ๏ธ Integration gaps - Correlation between SystemPulse alerts and Citrix/network issues requires manual effort
- โ ๏ธ Triage process not fully documented - Need clear VBF driven runbooks for when to engage Infrastructure Ops vs. Citrix vs. Cloud Ops vs. multiple teams
- Recommendation:
- Establish integration between SystemPulse and Splunk for cross-layer correlation
๐ก Citrix Team (James Hallowell & Team)
Primary Focus: Session delivery, VDI performance, Citrix infrastructure
Responsibilities:
- A/R for Citrix Monitor / UberAgent
- R for Citrix-related data flowing into Splunk
- Session performance monitoring and troubleshooting
- Citrix infrastructure health
- Engaged by Optum Insight when session/VDI issues identified
Identified Gaps:
- โ ๏ธ Session-level only visibility - Limited correlation with app/network/infrastructure
- โ ๏ธ Splunk dashboard tuning - Manual tuning required; may not yet be optimized for Epic workflows
- โ ๏ธ Dynatrace OneAgent reservations - Citrix team does not want to add/manage OneAgent on their VMs in addition to UberAgent - creates gap in Dynatrace full-stack visibility into Citrix layer
- Recommendation:
- Work with Infrastructure Ops to integrate Citrix metrics into unified Splunk dashboards
- Decision needed: Accept Dynatrace gap for Citrix layer OR negotiate alternative (e.g., Cloud Ops manages OneAgent on Citrix VMs, Citrix team provides access)
๐ Infrastructure Ops (Randy & Team)
Primary Focus: Azure infrastructure, network, observability (excluding Dynatrace)
Responsibilities:
- A/R for Azure Monitor / Azure Dashboards
- A/R for Splunk (infrastructure logs and correlation)
- A/R for Netscout (network visibility)
- Infrastructure health and performance
- Cross-platform correlation and troubleshooting
- Engaged by Optum Insight when infrastructure/network issues identified
Identified Gaps:
- โ ๏ธ Splunk Linux Sys Logs missing - Gap in log coverage for Linux systems (being addressed pre go-live with fluentbit)
- โ ๏ธ Netscout integration pending - Limited correlation with Epic/application context (tied to Citrix packets)
- โ ๏ธ Tool sprawl risk - Multiple monitoring tools may lead to alert fatigue and unclear ownership
- Recommendation:
- Add Linux Sys Logs to Splunk ingestion
- Complete Netscout โ Splunk integration
๐ถ Cloud Ops (Tom Hudak & Team)
Primary Focus: Dynatrace full-stack APM / Platform automation / IaC
Responsibilities:
- A/R for Dynatrace (full-stack APM)
- A/R for GitHub Issues (platform automation, IaC, CI/CD pipelines)
- Dynatrace deployment, configuration, and management
- Full-stack observability and application performance monitoring
- Platform automation and infrastructure-as-code
- Engaged by Optum Insight when Dynatrace detects issues OR platform/IaC issues arise
Identified Gaps:
- โ ๏ธ Dynatrace deployment in progress - Not yet providing full monitoring capability; needs continued deployment and configuration
- โ ๏ธ Citrix OneAgent - Critical gap: Citrix team does not want to add/manage OneAgent on their VMs in addition to UberAgent - limits Dynatrace visibility into Citrix session layer; creates blind spot for full-stack APM
- โ ๏ธ Decision needed on Citrix gap - Need executive decision: accept Dynatrace blind spot for Citrix layer OR negotiate alternative (Cloud Ops manages OneAgent on Citrix VMs)
- Recommendation:
- Continue Dynatrace deployment and Epic integration
- Escalate Citrix OneAgent decision to leadership - impacts Dynatrace ROI and full-stack visibility
- Clearly document current Dynatrace capabilities vs. post go-live roadmap
Escalation Matrix & Collaborative Support Model
๐ Escalation Principles
Given the interconnected nature of Epic on Azure, the team that owns the monitoring tool detecting the issue becomes PRIMARY, and all other technical teams engage as SECONDARY to support correlation and troubleshooting.
Key Principles:
- Help Desk routes all clinician-reported issues to Optum Insight for initial technical triage via ServiceNow
- Clinical NOC (DocX owner) has defined escalation path: Opens ServiceNow incident + notifies Optum Health when DocX degrades; routes technical investigation to Optum Insight
- ALL escalations follow formal ServiceNow incident creation and engagement process - no informal or backchannel escalations
- Optum Insight acts as bridge between clinical teams (Clinical NOC, Optum Health) and internal technical teams (Infrastructure Ops, Citrix, Cloud Ops)
- Primary team owns initial triage and coordinates response
- All secondary teams (Optum Insight, Infrastructure Ops, Citrix, Cloud Ops) join bridge/war room for complex issues
- Optum Health (Ken Cam) owns all external communication and clinical stakeholder management
- Cross-layer correlation is expected - no team works in isolation
๐ Escalation Matrix
| Issue Type | Entry Point | Primary Team (R/A) | Secondary Teams (C) | ServiceNow Assignment Group | *ssignment Group** | Notes |
|---|---|---|---|---|---|---|
| Epic application performance | SystemPulse alert OR Help Desk | Optum Insight | Infra Ops, Citrix, Cloud Ops (if Dynatrace) | Epic - Azure (National West) | SystemPulse alerts (app or database); may require Azure/Citrix/Dynatrace correlation.<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health.<br/>Optum Health manages comms if clinical impact. | |
| Database performance (ODB) | SystemPulse database monitoring | Optum Insight | Infra Ops, Cloud Ops (if Dynatrace) | Epic - Azure (National West) | Database performance via SystemPulse; Azure SQL may require Infra Ops; Dynatrace context.<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health.<br/>Optum Health manages comms if clinical impact. | |
| Clinician-reported issue | Help Desk | Optum Insight (triage) | Infra Ops, Citrix, Cloud Ops | Epic - Azure (National West) | Help Desk routes to Optum Insight via ServiceNow.<br/>Clinical NOC: monitors DocX; if widespread, opens ServiceNow + notifies Optum Health.<br/>Optum Health manages comms. | |
| DocX clinician experience degradation | DocX dashboard (Clinical NOC) | Clinical NOC โ opens ServiceNow + routes to Optum Insight (Epic - Azure (National West)) | Infra Ops, Citrix, Cloud Ops (by Optum Insight) | Route to Optum Insight | Path: Clinical NOC opens ServiceNow + notifies Optum Health (comms) AND Optum Insight (tech).<br/>No automation. Optum Health owns comms. | |
| Citrix session issues | Citrix Monitor alert | Citrix Team | Optum Insight, Infra Ops, Cloud Ops (if Dynatrace - limited: no OneAgent) | USS_Virtual_Workspace | May be app, network, or Azure infra.<br/>Note: Dynatrace limited Citrix visibility (no OneAgent).<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health. Optum Health owns comms. | |
| Azure infrastructure | Azure Monitor alert | Infrastructure Ops | Optum Insight, Citrix Team, Cloud Ops (if Dynatrace insights available) | Epic_Azure_Infrastructure_Ops (Prod/NonProd) | VM, network, storage issues; impacts all layers above.<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health.<br/>Optum Health manages comms if clinical impact. | |
| Network issues | Netscout alert | Infrastructure Ops | Optum Insight, Citrix Team, Cloud Ops (if Dynatrace insights available) | Epic_Azure_Infrastructure_Ops (Prod/NonProd) | Netscout alerts; affects app and session delivery.<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health.<br/>Optum Health manages comms if clinical impact. | |
| Hyperspace Web access | Hyperspace Web alert OR Help Desk | Optum Insight | Citrix, Infra Ops, Cloud Ops | Epic - Azure (National West) | Web access may be app, Citrix, or network.<br/>Clinical NOC: monitors DocX; if degraded + blocked, opens ServiceNow + notifies Optum Health. Optum Health comms. | |
| Dynatrace | Dynatrace alert | Cloud Ops | Optum Insight, Infra Ops, Citrix | GitHub Issues OR Epic_Azure_Infrastructure_Ops (if production) | Full-stack APM; may indicate app or infra issue.<br/>Note: Dynatrace has limited Citrix visibility (no OneAgent on Citrix VMs).<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health. Optum Health comms. | |
| Splunk | Splunk alert | Infrastructure Ops | Optum Insight, Citrix Team, Cloud Ops | Epic_Azure_Infrastructure_Ops (Prod/NonProd) | Cross-layer alerts require all teams to triage.<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health.<br/>Optum Health manages comms if clinical impact. | |
| Cloud Ops / Platform issues | GitHub Issues | Cloud Ops | Infrastructure Ops, Optum Insight, Citrix Team | GitHub Issues | Platform-level issues, automation, IaC, CI/CD pipelines.<br/>Clinical NOC informed if affects DocX via ServiceNow.<br/>Optum Health manages comms if affects production. |
๐ Standard Response Flow
Flow 1: Clinician-Reported Issue
flowchart TD
A[Clinician reports issue to Help Desk] --> B[Help Desk creates ServiceNow incident]
B --> C[Routes to Optum Insight<br/>Epic - Azure National West]
C --> D[Optum Insight performs initial triage<br/>Checks: SystemPulse, Hyperspace Web, DB]
D --> E{Issue spans<br/>multiple layers?}
E -->|Yes| F[Creates child tasks for<br/>Infrastructure Ops/Citrix/Cloud Ops]
F --> G[Opens war room if needed]
E -->|No| H[Optum Insight continues investigation]
D --> I[Clinical NOC monitors DocX<br/>for clinical impact]
I --> J{DocX shows<br/>degradation?}
J -->|Yes| K[Clinical NOC opens separate<br/>ServiceNow incident]
K --> L[Clinical NOC notifies OH]
J -->|No| M[Continue monitoring]
L --> N[Optum Health manages<br/>external communications]
G --> O[All teams correlate data<br/>from their tools]
H --> O
O --> P[Primary team coordinates<br/>resolution]
P --> Q[Optum Health communicates<br/>resolution to clinicians]
Q --> R[Post-incident review]
Key Steps:
- Triage: Optum Insight checks SystemPulse (app & database), Hyperspace Web
- Parallel Monitoring: Clinical NOC monitors DocX for clinical impact patterns
- Escalation (if needed):
- Multi-layer issue โ Optum Insight creates child tasks for Infrastructure Ops/Citrix/Cloud Ops
- Clinical impact โ Clinical NOC opens separate incident + notifies Optum Health
- Coordination: All teams correlate data; Primary team leads resolution
- Communication: Optum Health manages external communications throughout
- Closure: Post-incident review with all teams
Flow 2: DocX Clinician Experience Degradation
flowchart TD
A[Clinical NOC identifies<br/>degradation in DocX] --> B{Severity<br/>assessment}
B --> C[Clinical NOC opens<br/>ServiceNow incident]
C --> D[Routes to Optum Insight<br/>for technical triage]
C --> E[Notifies Optum Health<br/>ServiceNow/Teams Channel]
D --> F[Optum Insight performs<br/>technical triage]
F --> G[Checks: SystemPulse, Hyperspace Web,<br/>DB monitoring]
G --> H{Root cause<br/>identified?}
H -->|Unclear| I[Creates child tasks for<br/>Infrastructure Ops/Citrix/Cloud Ops]
H -->|Identified| J[Engages appropriate team]
E --> K[Optum Health assesses<br/>communication needs]
K --> L{Widespread<br/>impact?}
L -->|Yes| M[Prepares external<br/>communications]
L -->|No| N[Monitors situation<br/>with Clinical NOC]
I --> O[All technical teams<br/>correlate data]
J --> O
O --> P[Clinical NOC continues<br/>monitoring DocX]
P --> Q[Resolution coordinated<br/>by Optum Insight]
M --> R[Optum Health manages<br/>external comms]
N --> R
R --> S[Optum Health communicates<br/>resolution]
S --> T[Post-incident review]
Key Steps:
- Detection: Clinical NOC identifies DocX degradation (manual monitoring)
- Dual Escalation:
- Technical: Opens ServiceNow incident โ Routes to Optum Insight
- Communications: Notifies Optum Health directly (Teams/Phone/Email)
- Technical Investigation: Optum Insight triages โ Creates child tasks as needed
- Communication Assessment: Optum Health evaluates need for external communications
- Continuous Monitoring: Clinical NOC continues DocX monitoring throughout
- Resolution: Optum Insight coordinates technical fix
- External Comms: Optum Health manages stakeholder updates
- Closure: Post-incident review with all teams
Critical Notes:
- โ Defined escalation path exists - Process is mature and documented
- โ Dual notification ensures both technical response and external communications
Flow 3: Monitoring Tool Alert
flowchart TD
A[Alert fires in<br/>monitoring tool] --> B[Automated ServiceNow<br/>incident created]
B --> C[Assigned to tool owner's<br/>assignment group]
C --> D[Primary team performs<br/>initial triage]
D --> E{Impacts Epic<br/>application?}
E -->|Yes| F[Engage Optum Insight<br/>via ServiceNow child task]
E -->|No| G[Primary team continues<br/>investigation]
F --> H[Clinical NOC monitors<br/>DocX for clinical impact]
G --> H
H --> I{DocX shows<br/>impact?}
I -->|Yes| J[Clinical NOC opens separate<br/>ServiceNow incident]
J --> K[Clinical NOC notifies<br/>Optum Health]
I -->|No| L[Continue monitoring]
K --> M[Optum Health manages<br/>external communications]
D --> N{Cross-layer<br/>issue?}
N -->|Yes| O[Primary team creates child tasks<br/>for secondary teams]
N -->|No| P[Primary team owns resolution]
O --> Q[All teams correlate data]
P --> Q
Q --> R[Primary team coordinates<br/>resolution]
M --> S[Optum Health communicates<br/>resolution if clinical impact]
R --> S
S --> T[Post-incident review]
Key Steps:
- Automated Detection: Monitoring tool alert โ ServiceNow incident created
- Assignment: Auto-routed to tool owner's assignment group
- Initial Triage: Primary team (tool owner) investigates
- Epic Impact Check: If Epic-related โ Engage Optum Insight via child task
- Parallel Monitoring: Clinical NOC monitors DocX for clinical impact
- Cross-Layer Assessment: Create child tasks if issue spans multiple teams
- Correlation: All teams share data from their tools
- Resolution: Primary team coordinates; Optum Health handles external comms if needed
- Closure: Post-incident review
Critical Gaps & Recommendations
๐ด High Priority Gaps
-
Dynatrace OneAgent on Citrix VMs - Critical Decision Needed
- Gap: Citrix team does not want to add/manage OneAgent on their VMs in addition to UberAgent - creates significant blind spot for Dynatrace full-stack APM visibility into Citrix session layer
- Impact:
- Dynatrace cannot provide full-stack observability including Citrix layer
- May miss performance issues that span application โ Citrix โ infrastructure
- Troubleshooting requires manual correlation between Dynatrace (app/infra) and UberAgent/Citrix Monitor (session)
- Owner: Cloud Ops (Tom) + Citrix Team (James Hallowell) + Executive Decision
- Options:
- Option 1: Accept gap - Dynatrace monitors app/infra only; Citrix layer remains UberAgent/Citrix Monitor only (status quo)
- Option 2: Cloud Ops manages OneAgent on Citrix VMs; Citrix team provides access but not ongoing management
- Option 3: Negotiate limited OneAgent deployment on subset of Citrix VMs for correlation testing
- Option 4: Delay Dynatrace full deployment until Citrix team agreement OR alternative solution found
- Recommendation: TBD - Decision needed before go-live
- Timeline: Decision needed before go-live
- Workaround: Azure Moniting / Dashboards provides complete profile and high level visibility into VM health
-
Splunk Linux Sys Logs Missing
- Gap: Linux system logs not currently ingested into Splunk
- Impact: Blind spot for Linux-based infrastructure issues; incomplete log correlation
- Owner: Infrastructure Ops (Randy)
- Recommendation: Prioritize Linux Sys Log ingestion into Splunk before go-live
- Timeline: Before go-live
- Workaround: Azure Moniting / Dashboards provides complete profile and high level visibility into VM health
-
Dynatrace Deployment Completion
- Gap: Dynatrace deployment in progress; not yet providing full performance monitoring capability (within acknowledged scope given Citrix gap)
- Impact: Limited full-stack observability; may miss performance issues that span multiple layers
- Owner: Cloud Ops (Tom)
- Recommendation:
- Continue Dynatrace deployment and Epic integration
- Clearly document current capabilities vs. post go-live roadmap
- Clearly document Citrix blind spot in all Dynatrace documentation
- Define when issues escalate from Infrastructure Ops/Optum Insight to Cloud Ops (Dynatrace alerts vs. other tool alerts)
- Timeline: Continue through go-live; full deployment (minus Citrix gap) prior to go-live
- Workaround: Azure Moniting / Dashboards provides complete profile and high level visibility into VM health
-
Optum Insight Triage Process & Runbooks
- Gap: Triage process for clinician-reported issues not fully documented; decision tree for when to engage Infrastructure Ops vs. Citrix vs. Cloud Ops vs. multiple teams needs clarity; multiple escalation sources (Help Desk, Clinical NOC - all via ServiceNow)
- Impact: Delays in engaging right teams; potential for missed cross-layer root causes; confusion on prioritization; unclear when to engage Cloud Ops (Dynatrace/platform issues)
- Owner: Optum Insight (Jordan Lambert)
- Recommendation:
- Create triage decision tree/flowchart for Optum Insight team including Cloud Ops engagement criteria
- Document runbooks for common issue patterns (including database performance issues)
- Establish clear priority: P1 incidents always take priority regardless of source (Help Desk, Clinical NOC)
- Define SLAs for initial triage
- Define when to engage Cloud Ops: Dynatrace alerts, platform/IaC issues, need for full-stack APM insights
- Timeline: Before go-live
-
Netscout Integration
- Gap: Integration with Epic/Netcarriers pending; limited application context in network data
- Impact: Network issues may not be automatically correlated with app/session performance degradation visible to Optum Insight or Clinical NOC
- Owner: Cloud Ops (Tom)
- Recommendation: Prioritize Netscout โ Splunk integration for automated network event correlation
- Timeline: Before go-live
๐ก Medium Priority Gaps
-
Cross-Tool Correlation
- Gap: Limited correlation between SystemPulse (app & database), Citrix Monitor (session), Netscout (network), Azure Monitor (infrastructure), Dynatrace (full-stack - with Citrix gap), and DocX (clinical impact); support teams must manually check multiple tools
- Impact: Troubleshooting requires manual correlation across multiple tools by Optum Insight and other teams, slowing MTTR; Clinical NOC may see DocX degradation before technical teams see alerts; Dynatrace insights may be siloed in Cloud Ops
- Owner: Infrastructure Ops (Randy) - Splunk owner + Clinical NOC (Tom Busse) - DocX owner + Cloud Ops (Tom) - Dynatrace owner
- Recommendation:
- Use Splunk as correlation engine; integrate all tool data including DocX metrics and Dynatrace insights into unified dashboards
- Provide Clinical NOC access to unified dashboard for visibility into technical monitoring (helps with triage decision)
- Provide Optum Insight access to unified dashboard for rapid triage including Dynatrace data
- Define how Dynatrace data flows to Splunk for correlation
- Create correlation views: when Clinical NOC sees DocX degradation, can quickly see if SystemPulse/Citrix/Azure/Dynatrace also showing issues
- Timeline: Initial dashboards before go-live; advanced correlation post go-live
-
Alert Tuning & Noise Management
- Gap: Multiple tools generating alerts; tuning still in progress across all platforms
- Impact: Alert fatigue for Optum Insight (primary triage team) and Clinical NOC (DocX monitoring); risk of missed critical alerts; ServiceNow ticket volume may overwhelm; Dynatrace alerts may add to noise
- Owner: All teams (each for their tools) + Clinical NOC (DocX) + Cloud Ops (Dynatrace)
- Recommendation:
- Establish alert taxonomy and severity standards across all tools including Dynatrace
- Weekly alert tuning sessions during first month post go-live
- Define alert suppression rules for known maintenance windows
- Configure ServiceNow auto-assignment rules
- Define Dynatrace alert routing: GitHub Issues vs. ServiceNow based on severity/impact
- Note: DocX alerts remain manual (Clinical NOC judgment); no automated ticketing
- Timeline: Ongoing through go-live and first 30 days
-
DocX Integration with Technical Monitoring
- Gap: Limited correlation between DocX clinician experience and SystemPulse/Citrix/Azure/Dynatrace metrics; Individual teams must manually check each tool
- Impact: Triage may overall take longer; may miss technical root cause signals
- Owner: Clinical NOC (Tom Busse) + Optum Insight (Jordan Lambert) + Infrastructure Ops (Randy - Splunk) + Cloud Ops (Tom - Dynatrace)
- Recommendation:
- Create dedicated dashboard page on epic.optum.com that is accessible by all support teams showing DocX + SystemPulse + Citrix + Azure + Dynatrace dashboard links all in one place
- Enhance Clinical NOC's ability to quickly correlate DocX degradation with technical signals including Dynatrace insights
- Timeline: Post go-live enhancement (within 60 days)
Pre Go-Live Action Items
๐ด Critical Before Go-Live (Must Complete)
- Dynatrace OneAgent / Citrix VMs: Escalate to executive leadership for decision - Citrix team does not want OneAgent on VMs; impacts Dynatrace full-stack visibility and ROI (Owner: Cloud Ops + Citrix Team + Executive Leadership)
- Splunk: Add Linux Sys Logs ingestion (Owner: Infrastructure Ops)
- Dynatrace: Document current capabilities vs. roadmap; clearly document Citrix blind spot; ensure medium readiness confirmed (Owner: Cloud Ops)
- Cloud Ops + Optum Insight: Define escalation criteria - when does Optum Insight engage Cloud Ops (Dynatrace alerts, platform issues, need for full-stack insights) via ServiceNow (Owner: Cloud Ops + Optum Insight)
- Optum Insight: Create triage decision tree/flowchart and runbooks for common issue patterns including database performance; include prioritization for multiple escalation sources (Help Desk, Clinical NOC - all via ServiceNow); include Cloud Ops engagement criteria (Owner: Optum Insight)
- Optum Insight: Define SLAs for initial triage response times (Owner: Optum Insight)
- Help Desk: Train on routing all clinician-reported Epic issues to Optum Insight (
Epic - Azure National West) via ServiceNow (Owner: Optum Insight + Help Desk) unless a Citrix issue is the clear cause (should route toUSS_Virtual_Workspace) - Netscout: Complete integration with Splunk for network event correlation (Owner: Infrastructure Ops)
- Cross-team: Establish unified war room procedures, communication protocols, and test bridge lines; define Cloud Ops role in war room; reinforce all escalations via ServiceNow (Owner: All teams including Cloud Ops)
- All teams: Alert tuning sprint - reduce noise, validate critical alerts trigger correctly and create ServiceNow incidents; include Dynatrace alert routing definition (Owner: Each team for their tools)
- Documentation: Finalize runbooks for each tool with escalation paths and ServiceNow assignment groups clearly defined; include Cloud Ops / Dynatrace runbook; document that ALL escalations go through ServiceNow - no backchannel/informal escalations (Owner: Each team for their tools)
- ServiceNow: Configure auto-assignment rules for monitoring tool alerts to route to correct assignment groups; define Dynatrace alert routing (GitHub Issues vs. ServiceNow) (Owner: Infrastructure Ops + Cloud Ops + ServiceNow admin)
๐ก Important Before Go-Live (High Priority)
- Splunk: Create unified dashboards pulling from SystemPulse (app & database), Citrix Monitor, Azure Monitor, Netscout, Dynatrace (noting Citrix gap), and DocX - accessible to Clinical NOC for correlation AND Optum Insight for triage (Owner: Infrastructure Ops + Clinical NOC + Cloud Ops)
- Cloud Ops + Infrastructure Ops: Define how Dynatrace data flows to Splunk for correlation (API integration, log forwarding, etc.) (Owner: Cloud Ops + Infrastructure Ops)
- SystemPulse: Tune Epic and database alerts for Azure environment baselines vs. on-prem (Owner: Optum Insight)
- All teams: Define and publish SLAs: MTTD (Mean Time To Detect), MTTI (Investigate), MTTR (Resolve); include Clinical NOC response time to DocX degradation; include Cloud Ops response time to Dynatrace alerts (Owner: All teams)
- Communication: Set up stakeholder notification lists and status page for go-live - managed by Optum Health (Owner: Optum Health)
- Optum Health: Develop communication playbook with pre-drafted templates for various incident scenarios (P1, P2, resolution notices, CDO updates) (Owner: Optum Health)
๐ข Post Go-Live Improvements (30-90 days)
- Dynatrace: Complete full deployment and Epic integration (within scope of acknowledged Citrix gap) (Owner: Cloud Ops)
- Dynatrace / Citrix Gap: Review decision on OneAgent for Citrix VMs based on 90 days of operational data - is gap impacting troubleshooting? (Owner: Cloud Ops + Citrix Team + Executive Leadership)
- Implement ML-based anomaly detection across integrated tools including Dynatrace (Owner: Infrastructure Ops + Cloud Ops + Clinical NOC)
- Develop predictive analytics for capacity planning (Owner: Infrastructure Ops + Cloud Ops)
- Create advanced correlation rules in Splunk to assist Optum Insight triage and Clinical NOC triage; include DocX metrics and Dynatrace insights (Owner: Infrastructure Ops + Optum Insight + Clinical NOC + Cloud Ops)
- Quarterly review of RACI and tool effectiveness with all teams including Clinical NOC, Optum Health, and Cloud Ops (Owner: All teams)
- Evaluate tool rationalization opportunities - reduce overlap where possible; evaluate Dynatrace value given Citrix gap (Owner: All teams)
- Review Clinical NOC escalation process effectiveness: Response times, false positives, missed escalations (Owner: Clinical NOC + Optum Health + Optum Insight)
- Review and optimize ServiceNow assignment group routing based on actual incident patterns; review Dynatrace alert routing effectiveness (Owner: All teams)
- Analyze Optum Insight triage data: average time to engage secondary teams, most common escalation patterns, response times to different sources (Help Desk vs. Clinical NOC); include Cloud Ops engagement patterns (Owner: Optum Insight)
- Add additional Linux system coverage to Splunk ingestion beyond initial Sys Logs (Owner: Infrastructure Ops)
Appendix: Contact Information & ServiceNow Assignment Groups
ServiceNow Engagement Model
- All monitoring alerts and incidents must be routed through ServiceNow using the appropriate assignment group.
- Help Desk routes all clinician-reported Epic issues to Optum Insight for initial technical triage via ServiceNow.
- ALL escalations follow formal ServiceNow incident creation and engagement process - no informal or backchannel escalations.
- ServiceNow is the single source of truth for all Epic on Azure incidents.
ServiceNow Assignment Groups
| Team | ServiceNow Assignment Group | Scope / Responsibilities | Tool Ownership | Primary Contact Type |
|---|---|---|---|---|
| Business Ops (Epic App DBA) <br/>aka Optum Insight | Epic - Azure (National West) | Triage clinician issues; escalations from NOC;<br/>Epic app, SystemPulse, Hyperspace Web;<br/>bridge: clinical (NOC, Health) & tech (Infra, Citrix, Cloud) | SystemPulse, Hyperspace Web | Help Desk routes via ServiceNow;<br/>NOC routes DocX via ServiceNow;<br/>ALL via ServiceNow |
| Citrix Team | USS_Virtual_Workspace | Citrix infrastructure, VDI performance, session delivery, Citrix Monitor / UberAgent;<br/>Note: Does not manage Dynatrace OneAgent on Citrix VMs | Citrix Monitor / UberAgent | Engaged by Optum Insight via ServiceNow or direct alert |
| Azure Platform Ops | Epic_Azure_Infrastructure_Ops (Prod/NonProd) | Azure infrastructure, Azure Monitor, Netscout, Splunk, network | Azure Monitor, Netscout, Splunk, Azure Dashboards | Engaged by Optum Insight via ServiceNow or direct alert |
| Clinical NOC | N/A (escalates to others) | DocX monitoring; opens ServiceNow incident and escalates to Optum Insight (technical) and notifies Optum Health (communications) when DocX degrades | DocX | NOC monitoring; defined escalation path via ServiceNow - no automated ticketing |