SupportUpdated July 3, 2026

Epic on Azure Monitoring - RACI Matrix

monitoringracisupportresponsibilitiesgo-live

Epic on Azure Monitoring - RACI Matrix

RACI Legend

R = Responsible (Does the work)
A = Accountable (Final authority/decision maker)
C = Consulted (Provides input)
I = Informed (Kept in the loop)

Go-Live Readiness Rating Scale

🟢 High (4-5): Tool fully operational, team trained, processes documented, confident in ability to support
🟡 Medium (3): Tool functional but gaps exist, some training needed, moderate confidence
🔴 Low (1-2): Significant gaps, limited readiness, low confidence in current state

RACI Matrix

<table> <thead> <tr> <th width="20%">Monitoring Area/Tool</th> <th width="15%" style="white-space: nowrap;">Go-Live Readiness</th> <th width="4%">Optum Health (Ken Cam)</th> <th width="4%">Optum Insight (Jordan Lambert)</th> <th width="4%">Citrix Team (James Hallowell)</th> <th width="4%">Infra Ops (Randy)</th> <th width="4%">Cloud Ops (Tom)</th> <th width="5%">Clinical NOC (Tom Busse)</th> <th width="40%">Notes & Commentary</th> </tr> </thead> <tbody> <tr> <td>SystemPulse (Epic App Monitoring) Includes Database/ODB Services <a href="https://systempulse.uhc.com/SystemPulse/Monitor.aspx">Here</a></td> <td style="white-space: nowrap;">🟢 High (4)</td> <td>C/I</td> <td>A/R</td> <td>I</td> <td>I</td> <td>I</td> <td>I</td> <td>Capabilities: Epic application performance monitoring, database (ODB Services) monitoring, workflow validation. Limitations: Focused on app layer only; basic infrastructure visibility. Limited correlation with network/Citrix issues without manual effort.</td> </tr> <tr> <td>Hyperspace Web (Epic Web Access)</td> <td style="white-space: nowrap;">🟢 High (4)</td> <td>C/I</td> <td>A/R</td> <td>C</td> <td>I</td> <td>I</td> <td>I</td> <td>Capabilities: Tracks web-based Epic access and session availability. Limitations: No deep session analytics; relies on Citrix for full user experience visibility. Limited troubleshooting beyond access checks. Front-end connectivity validation only.</td> </tr> <tr> <td>Citrix Monitor / UberAgent</td> <td style="white-space: nowrap;">🟡 Medium (3)</td> <td>I</td> <td>C</td> <td>A/R</td> <td>C/I</td> <td>I</td> <td>I</td> <td>Capabilities: Detailed Citrix session performance metrics; tracks latency, logon times, resource utilization. Limitations: Session-level only; lacks full correlation with app/network layers. Splunk dashboards need manual tuning. Limited predictive analytics. Data retention may be insufficient for trend analysis.</td> </tr> <tr> <td>Splunk (Infra & App Logs) <a href="https://epic.optum.com/getting-started/onboarding-tools/#monitoring-access">Request Access</a></td> <td style="white-space: nowrap;">🟡 Medium (3)</td> <td>I</td> <td>C</td> <td>R (Citrix data)</td> <td>A/R (Infrastructure)</td> <td>C</td> <td>C</td> <td>Capabilities: Aggregates logs from multiple sources; custom dashboards; correlation potential. Limitations: Real-time correlation needs advanced queries; high licensing costs. Requires query expertise for complex troubleshooting. Missing Linux Sys Logs - gap in log coverage. Epic-specific workflow correlation still maturing.</td> </tr> <tr> <td>Azure Monitor / Dashboards <a href="https://epic.optum.com/getting-started/onboarding-access/#azure-account">Request Access</a></td> <td style="white-space: nowrap;">🟢 High (4)</td> <td>I</td> <td>C</td> <td>C</td> <td>A/R</td> <td>C</td> <td>I</td> <td>Capabilities: Native Azure monitoring for VMs, databases, network; custom dashboards; AI-driven anomaly detection. Limitations: Gaps in hybrid visibility; requires tuning for actionable alerts vs. noise. Native Azure focus may miss Epic-specific context.</td> </tr> <tr> <td>Dynatrace <a href="https://epic.optum.com/getting-started/onboarding-tools/#dynatrace">Request Access</a></td> <td style="white-space: nowrap;">🟢 High (4)</td> <td>I</td> <td>C</td> <td>C</td> <td>C</td> <td>A/R</td> <td>I</td> <td>Capabilities: Full-stack APM with AI-driven insights; application topology mapping. Limitations: Deployment in progress; requires ActiveGate setup and advanced configuration for Epic. Limited app performance monitoring capability currently. Licensing constraints may limit full-stack visibility. Critical Gap: Citrix team does not want to add/manage OneAgent on their VMs in addition to UberAgent - limits Dynatrace visibility into Citrix layer. Not yet fully integrated with Epic workflows.</td> </tr> <tr> <td>Netscout (Network Visibility)</td> <td style="white-space: nowrap;">🟡 Medium (3)</td> <td>I</td> <td>C</td> <td>C</td> <td>A/R</td> <td>C</td> <td>I</td> <td>Capabilities: Packet-level visibility; detects and mitigates DDoS attacks; flow-based analytics. Limitations: No application-level context. Integration with Epic/Netcarriers pending. Limited correlation without Splunk/Dynatrace integration. Flow-based analytics for troubleshooting only.</td> </tr> <tr> <td>DocX (Dedicated Ops for Clinical eXperience) <a href="https://docx.optum.com/">Here</a></td> <td style="white-space: nowrap;">🟢 High (4)</td> <td>C</td> <td>C</td> <td>I</td> <td>I</td> <td>I</td> <td>A/R</td> <td>Capabilities: Clinician-facing performance dashboard; provider-level experience tracking. Escalation Path: Clinical NOC opens incident and notifies Optum Health when DocX shows degradation. Limitations: Limited end-user experience data beyond clinician workflows. Manual escalation process (not automated). Gap: Integration with other monitoring tools for correlation needs strengthening.</td> </tr> <tr> <td>Azure Dashboards (Custom) <a href="https://epic.optum.com/getting-started/onboarding-access/#azure-account">Request Access</a></td> <td style="white-space: nowrap;">🟢 High (4)</td> <td>I</td> <td>C</td> <td>C</td> <td>A/R</td> <td>C</td> <td>I</td> <td>Capabilities: Custom Azure-native dashboards for infrastructure metrics. Limitations: Requires manual dashboard creation and maintenance. No anomaly detection. Limited to Azure-native metrics without integration work.</td> </tr> <tr> <td>GitHub (IaC / Platform Issues) <a href="https://github.com/optum-tech-compute/ohemr-ops/issues">Here</a></td> <td style="white-space: nowrap;">🟢 High (4)</td> <td>I</td> <td>I</td> <td>I</td> <td>C</td> <td>A/R</td> <td>I</td> <td>Capabilities: Issue tracking for infrastructure-as-code, CI/CD pipelines, platform automation. Scope: Non-urgent platform/code issues; separate from production incidents.</td> </tr> </tbody> </table>

Team Responsibilities Overview

🔵 Optum Health (Ken Cam & Team)

Primary Focus: End-user communications / CDO & Clinician site relationship

Responsibilities:

Owns all external communications to clinicians, CDO, and clinical sites
Primary clinical stakeholder relationship owner
Receives notifications from Clinical NOC when DocX shows degradation requiring external communications
Translates clinical impact to business/executive stakeholders
Monitors clinician experience trends (via DocX, owned by Clinical NOC)
Escalation point for widespread clinical impact requiring executive visibility

Identified Gaps:

⚠️ Communication templates - Need pre-drafted templates for various incident scenarios (P1, P2, resolution notices)
Recommendation:
- Validate Clinical NOC → Optum Health escalation path with tabletop exercises
- Develop communication playbook with templates for different scenarios
- All communications to technical teams go through formal ServiceNow incidents only

🟣 Clinical NOC (Tom Busse & Team)

Primary Focus: DocX monitoring / Clinician experience monitoring / NOC operations

Responsibilities:

A/R for DocX (clinician experience monitoring)
Monitors clinician-facing performance dashboards 24x7
Defined escalation path: When DocX shows degradation, Clinical NOC manually opens ServiceNow incident AND notifies Optum Health
Routes technical issues to Optum Insight via ServiceNow incident
Reports clinician experience trends to Optum Health for external communication
First line of detection for clinician-impacting issues via DocX

Identified Gaps:

⚠️ DocX integration with other tools - Limited correlation between DocX clinician experience and SystemPulse/Citrix/Azure metrics; requires manual checking
Recommendation:
- Enhance DocX correlation with other monitoring tools for Clinical NOC visibility (unified dashboard)
- Define SLAs: Clinical NOC response time to DocX degradation (e.g., open incident + notify Optum Health within 15 minutes for critical, 30 minutes for high)
- Weekly sync between Clinical NOC, Optum Health, and Optum Insight on clinician experience trends

🟢 Optum Insight (Jordan Lambert & Team) - Business Ops / Technical Triage

Primary Focus: Epic Application Layer & Database Performance / Technical triage and bridge between clinical teams and internal technical teams

Responsibilities:

A/R for SystemPulse (Epic application monitoring including database/ODB Services)
A/R for Hyperspace Web (Epic web access)
INITIAL TECHNICAL TRIAGE for all clinician-reported issues from Help Desk
Receives technical escalations from Clinical NOC when DocX shows degradation (via ServiceNow incident)
Receives all escalations via formal ServiceNow incidents only - no informal/backchannel escalations
Bridge between clinical teams (Clinical NOC, Optum Health) and internal technical teams (Infrastructure Ops, Citrix, Cloud Ops)
Epic workflow validation and performance
Database performance monitoring and troubleshooting
Application-level troubleshooting and escalation
Coordinate with Optum Health for clinical impact communication (via ServiceNow incident updates)
Engage Infrastructure Ops, Citrix, and Cloud Ops when issues span multiple layers

Identified Gaps:

⚠️ Limited infrastructure visibility - SystemPulse doesn't see network/Citrix/Azure layer issues
⚠️ Integration gaps - Correlation between SystemPulse alerts and Citrix/network issues requires manual effort
⚠️ Triage process not fully documented - Need clear VBF driven runbooks for when to engage Infrastructure Ops vs. Citrix vs. Cloud Ops vs. multiple teams
Recommendation:
- Establish integration between SystemPulse and Splunk for cross-layer correlation

🟡 Citrix Team (James Hallowell & Team)

Primary Focus: Session delivery, VDI performance, Citrix infrastructure

Responsibilities:

A/R for Citrix Monitor / UberAgent
R for Citrix-related data flowing into Splunk
Session performance monitoring and troubleshooting
Citrix infrastructure health
Engaged by Optum Insight when session/VDI issues identified

Identified Gaps:

⚠️ Session-level only visibility - Limited correlation with app/network/infrastructure
⚠️ Splunk dashboard tuning - Manual tuning required; may not yet be optimized for Epic workflows
⚠️ Dynatrace OneAgent reservations - Citrix team does not want to add/manage OneAgent on their VMs in addition to UberAgent - creates gap in Dynatrace full-stack visibility into Citrix layer
Recommendation:
- Work with Infrastructure Ops to integrate Citrix metrics into unified Splunk dashboards
- Decision needed: Accept Dynatrace gap for Citrix layer OR negotiate alternative (e.g., Cloud Ops manages OneAgent on Citrix VMs, Citrix team provides access)

🟠 Infrastructure Ops (Randy & Team)

Primary Focus: Azure infrastructure, network, observability (excluding Dynatrace)

Responsibilities:

A/R for Azure Monitor / Azure Dashboards
A/R for Splunk (infrastructure logs and correlation)
A/R for Netscout (network visibility)
Infrastructure health and performance
Cross-platform correlation and troubleshooting
Engaged by Optum Insight when infrastructure/network issues identified

Identified Gaps:

⚠️ Splunk Linux Sys Logs missing - Gap in log coverage for Linux systems (being addressed pre go-live with fluentbit)
⚠️ Netscout integration pending - Limited correlation with Epic/application context (tied to Citrix packets)
⚠️ Tool sprawl risk - Multiple monitoring tools may lead to alert fatigue and unclear ownership
Recommendation:
- Add Linux Sys Logs to Splunk ingestion
- Complete Netscout → Splunk integration

🔶 Cloud Ops (Tom Hudak & Team)

Primary Focus: Dynatrace full-stack APM / Platform automation / IaC

Responsibilities:

A/R for Dynatrace (full-stack APM)
A/R for GitHub Issues (platform automation, IaC, CI/CD pipelines)
Dynatrace deployment, configuration, and management
Full-stack observability and application performance monitoring
Platform automation and infrastructure-as-code
Engaged by Optum Insight when Dynatrace detects issues OR platform/IaC issues arise

Identified Gaps:

⚠️ Dynatrace deployment in progress - Not yet providing full monitoring capability; needs continued deployment and configuration
⚠️ Citrix OneAgent - Critical gap: Citrix team does not want to add/manage OneAgent on their VMs in addition to UberAgent - limits Dynatrace visibility into Citrix session layer; creates blind spot for full-stack APM
⚠️ Decision needed on Citrix gap - Need executive decision: accept Dynatrace blind spot for Citrix layer OR negotiate alternative (Cloud Ops manages OneAgent on Citrix VMs)
Recommendation:
- Continue Dynatrace deployment and Epic integration
- Escalate Citrix OneAgent decision to leadership - impacts Dynatrace ROI and full-stack visibility
- Clearly document current Dynatrace capabilities vs. post go-live roadmap

Escalation Matrix & Collaborative Support Model

🔄 Escalation Principles

Given the interconnected nature of Epic on Azure, the team that owns the monitoring tool detecting the issue becomes PRIMARY, and all other technical teams engage as SECONDARY to support correlation and troubleshooting.

Key Principles:

Help Desk routes all clinician-reported issues to Optum Insight for initial technical triage via ServiceNow
Clinical NOC (DocX owner) has defined escalation path: Opens ServiceNow incident + notifies Optum Health when DocX degrades; routes technical investigation to Optum Insight
ALL escalations follow formal ServiceNow incident creation and engagement process - no informal or backchannel escalations
Optum Insight acts as bridge between clinical teams (Clinical NOC, Optum Health) and internal technical teams (Infrastructure Ops, Citrix, Cloud Ops)
Primary team owns initial triage and coordinates response
All secondary teams (Optum Insight, Infrastructure Ops, Citrix, Cloud Ops) join bridge/war room for complex issues
Optum Health (Ken Cam) owns all external communication and clinical stakeholder management
Cross-layer correlation is expected - no team works in isolation

📊 Escalation Matrix

Issue Type	Entry Point	Primary Team (R/A)	Secondary Teams (C)	ServiceNow Assignment Group	ssignment Group*
Epic application performance	SystemPulse alert OR Help Desk	Optum Insight	Infra Ops, Citrix, Cloud Ops (if Dynatrace)	`Epic - Azure (National West)`	SystemPulse alerts (app or database); may require Azure/Citrix/Dynatrace correlation.<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health.<br/>Optum Health manages comms if clinical impact.
Database performance (ODB)	SystemPulse database monitoring	Optum Insight	Infra Ops, Cloud Ops (if Dynatrace)	`Epic - Azure (National West)`	Database performance via SystemPulse; Azure SQL may require Infra Ops; Dynatrace context.<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health.<br/>Optum Health manages comms if clinical impact.
Clinician-reported issue	Help Desk	Optum Insight (triage)	Infra Ops, Citrix, Cloud Ops	`Epic - Azure (National West)`	Help Desk routes to Optum Insight via ServiceNow.<br/>Clinical NOC: monitors DocX; if widespread, opens ServiceNow + notifies Optum Health.<br/>Optum Health manages comms.
DocX clinician experience degradation	DocX dashboard (Clinical NOC)	Clinical NOC → opens ServiceNow + routes to Optum Insight (`Epic - Azure (National West)`)	Infra Ops, Citrix, Cloud Ops (by Optum Insight)	Route to Optum Insight	Path: Clinical NOC opens ServiceNow + notifies Optum Health (comms) AND Optum Insight (tech).<br/>No automation. Optum Health owns comms.
Citrix session issues	Citrix Monitor alert	Citrix Team	Optum Insight, Infra Ops, Cloud Ops (if Dynatrace - limited: no OneAgent)	`USS_Virtual_Workspace`	May be app, network, or Azure infra.<br/>Note: Dynatrace limited Citrix visibility (no OneAgent).<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health. Optum Health owns comms.
Azure infrastructure	Azure Monitor alert	Infrastructure Ops	Optum Insight, Citrix Team, Cloud Ops (if Dynatrace insights available)	`Epic_Azure_Infrastructure_Ops (Prod/NonProd)`	VM, network, storage issues; impacts all layers above.<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health.<br/>Optum Health manages comms if clinical impact.
Network issues	Netscout alert	Infrastructure Ops	Optum Insight, Citrix Team, Cloud Ops (if Dynatrace insights available)	`Epic_Azure_Infrastructure_Ops (Prod/NonProd)`	Netscout alerts; affects app and session delivery.<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health.<br/>Optum Health manages comms if clinical impact.
Hyperspace Web access	Hyperspace Web alert OR Help Desk	Optum Insight	Citrix, Infra Ops, Cloud Ops	`Epic - Azure (National West)`	Web access may be app, Citrix, or network.<br/>Clinical NOC: monitors DocX; if degraded + blocked, opens ServiceNow + notifies Optum Health. Optum Health comms.
Dynatrace	Dynatrace alert	Cloud Ops	Optum Insight, Infra Ops, Citrix	GitHub Issues OR `Epic_Azure_Infrastructure_Ops` (if production)	Full-stack APM; may indicate app or infra issue.<br/>Note: Dynatrace has limited Citrix visibility (no OneAgent on Citrix VMs).<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health. Optum Health comms.
Splunk	Splunk alert	Infrastructure Ops	Optum Insight, Citrix Team, Cloud Ops	`Epic_Azure_Infrastructure_Ops (Prod/NonProd)`	Cross-layer alerts require all teams to triage.<br/>Clinical NOC: monitors DocX; if degraded, opens ServiceNow + notifies Optum Health.<br/>Optum Health manages comms if clinical impact.
Cloud Ops / Platform issues	GitHub Issues	Cloud Ops	Infrastructure Ops, Optum Insight, Citrix Team	GitHub Issues	Platform-level issues, automation, IaC, CI/CD pipelines.<br/>Clinical NOC informed if affects DocX via ServiceNow.<br/>Optum Health manages comms if affects production.

📞 Standard Response Flow

Flow 1: Clinician-Reported Issue

flowchart TD
    A[Clinician reports issue to Help Desk] --> B[Help Desk creates ServiceNow incident]
    B --> C[Routes to Optum Insight<br/>Epic - Azure National West]
    C --> D[Optum Insight performs initial triage<br/>Checks: SystemPulse, Hyperspace Web, DB]
    D --> E{Issue spans<br/>multiple layers?}
    E -->|Yes| F[Creates child tasks for<br/>Infrastructure Ops/Citrix/Cloud Ops]
    F --> G[Opens war room if needed]
    E -->|No| H[Optum Insight continues investigation]
    D --> I[Clinical NOC monitors DocX<br/>for clinical impact]
    I --> J{DocX shows<br/>degradation?}
    J -->|Yes| K[Clinical NOC opens separate<br/>ServiceNow incident]
    K --> L[Clinical NOC notifies OH]
    J -->|No| M[Continue monitoring]
    L --> N[Optum Health manages<br/>external communications]
    G --> O[All teams correlate data<br/>from their tools]
    H --> O
    O --> P[Primary team coordinates<br/>resolution]
    P --> Q[Optum Health communicates<br/>resolution to clinicians]
    Q --> R[Post-incident review]

Key Steps:

Triage: Optum Insight checks SystemPulse (app & database), Hyperspace Web
Parallel Monitoring: Clinical NOC monitors DocX for clinical impact patterns
Escalation (if needed):
- Multi-layer issue → Optum Insight creates child tasks for Infrastructure Ops/Citrix/Cloud Ops
- Clinical impact → Clinical NOC opens separate incident + notifies Optum Health
Coordination: All teams correlate data; Primary team leads resolution
Communication: Optum Health manages external communications throughout
Closure: Post-incident review with all teams

Flow 2: DocX Clinician Experience Degradation

flowchart TD
    A[Clinical NOC identifies<br/>degradation in DocX] --> B{Severity<br/>assessment}
    B --> C[Clinical NOC opens<br/>ServiceNow incident]
    C --> D[Routes to Optum Insight<br/>for technical triage]
    C --> E[Notifies Optum Health<br/>ServiceNow/Teams Channel]
    D --> F[Optum Insight performs<br/>technical triage]
    F --> G[Checks: SystemPulse, Hyperspace Web,<br/>DB monitoring]
    G --> H{Root cause<br/>identified?}
    H -->|Unclear| I[Creates child tasks for<br/>Infrastructure Ops/Citrix/Cloud Ops]
    H -->|Identified| J[Engages appropriate team]
    E --> K[Optum Health assesses<br/>communication needs]
    K --> L{Widespread<br/>impact?}
    L -->|Yes| M[Prepares external<br/>communications]
    L -->|No| N[Monitors situation<br/>with Clinical NOC]
    I --> O[All technical teams<br/>correlate data]
    J --> O
    O --> P[Clinical NOC continues<br/>monitoring DocX]
    P --> Q[Resolution coordinated<br/>by Optum Insight]
    M --> R[Optum Health manages<br/>external comms]
    N --> R
    R --> S[Optum Health communicates<br/>resolution]
    S --> T[Post-incident review]

Key Steps:

Detection: Clinical NOC identifies DocX degradation (manual monitoring)
Dual Escalation:
- Technical: Opens ServiceNow incident → Routes to Optum Insight
- Communications: Notifies Optum Health directly (Teams/Phone/Email)
Technical Investigation: Optum Insight triages → Creates child tasks as needed
Communication Assessment: Optum Health evaluates need for external communications
Continuous Monitoring: Clinical NOC continues DocX monitoring throughout
Resolution: Optum Insight coordinates technical fix
External Comms: Optum Health manages stakeholder updates
Closure: Post-incident review with all teams

Critical Notes:

✅ Defined escalation path exists - Process is mature and documented
✅ Dual notification ensures both technical response and external communications

Flow 3: Monitoring Tool Alert

flowchart TD
    A[Alert fires in<br/>monitoring tool] --> B[Automated ServiceNow<br/>incident created]
    B --> C[Assigned to tool owner's<br/>assignment group]
    C --> D[Primary team performs<br/>initial triage]
    D --> E{Impacts Epic<br/>application?}
    E -->|Yes| F[Engage Optum Insight<br/>via ServiceNow child task]
    E -->|No| G[Primary team continues<br/>investigation]
    F --> H[Clinical NOC monitors<br/>DocX for clinical impact]
    G --> H
    H --> I{DocX shows<br/>impact?}
    I -->|Yes| J[Clinical NOC opens separate<br/>ServiceNow incident]
    J --> K[Clinical NOC notifies<br/>Optum Health]
    I -->|No| L[Continue monitoring]
    K --> M[Optum Health manages<br/>external communications]
    D --> N{Cross-layer<br/>issue?}
    N -->|Yes| O[Primary team creates child tasks<br/>for secondary teams]
    N -->|No| P[Primary team owns resolution]
    O --> Q[All teams correlate data]
    P --> Q
    Q --> R[Primary team coordinates<br/>resolution]
    M --> S[Optum Health communicates<br/>resolution if clinical impact]
    R --> S
    S --> T[Post-incident review]

Key Steps:

Automated Detection: Monitoring tool alert → ServiceNow incident created
Assignment: Auto-routed to tool owner's assignment group
Initial Triage: Primary team (tool owner) investigates
Epic Impact Check: If Epic-related → Engage Optum Insight via child task
Parallel Monitoring: Clinical NOC monitors DocX for clinical impact
Cross-Layer Assessment: Create child tasks if issue spans multiple teams
Correlation: All teams share data from their tools
Resolution: Primary team coordinates; Optum Health handles external comms if needed
Closure: Post-incident review

Critical Gaps & Recommendations

🔴 High Priority Gaps

Dynatrace OneAgent on Citrix VMs - Critical Decision Needed
- Gap: Citrix team does not want to add/manage OneAgent on their VMs in addition to UberAgent - creates significant blind spot for Dynatrace full-stack APM visibility into Citrix session layer
- Impact:
  - Dynatrace cannot provide full-stack observability including Citrix layer
  - May miss performance issues that span application → Citrix → infrastructure
  - Troubleshooting requires manual correlation between Dynatrace (app/infra) and UberAgent/Citrix Monitor (session)
- Owner: Cloud Ops (Tom) + Citrix Team (James Hallowell) + Executive Decision
- Options:
  - Option 1: Accept gap - Dynatrace monitors app/infra only; Citrix layer remains UberAgent/Citrix Monitor only (status quo)
  - Option 2: Cloud Ops manages OneAgent on Citrix VMs; Citrix team provides access but not ongoing management
  - Option 3: Negotiate limited OneAgent deployment on subset of Citrix VMs for correlation testing
  - Option 4: Delay Dynatrace full deployment until Citrix team agreement OR alternative solution found
- Recommendation: TBD - Decision needed before go-live
- Timeline: Decision needed before go-live
- Workaround: Azure Moniting / Dashboards provides complete profile and high level visibility into VM health
Splunk Linux Sys Logs Missing
- Gap: Linux system logs not currently ingested into Splunk
- Impact: Blind spot for Linux-based infrastructure issues; incomplete log correlation
- Owner: Infrastructure Ops (Randy)
- Recommendation: Prioritize Linux Sys Log ingestion into Splunk before go-live
- Timeline: Before go-live
- Workaround: Azure Moniting / Dashboards provides complete profile and high level visibility into VM health
Dynatrace Deployment Completion
- Gap: Dynatrace deployment in progress; not yet providing full performance monitoring capability (within acknowledged scope given Citrix gap)
- Impact: Limited full-stack observability; may miss performance issues that span multiple layers
- Owner: Cloud Ops (Tom)
- Recommendation:
  - Continue Dynatrace deployment and Epic integration
  - Clearly document current capabilities vs. post go-live roadmap
  - Clearly document Citrix blind spot in all Dynatrace documentation
  - Define when issues escalate from Infrastructure Ops/Optum Insight to Cloud Ops (Dynatrace alerts vs. other tool alerts)
- Timeline: Continue through go-live; full deployment (minus Citrix gap) prior to go-live
- Workaround: Azure Moniting / Dashboards provides complete profile and high level visibility into VM health
Optum Insight Triage Process & Runbooks
- Gap: Triage process for clinician-reported issues not fully documented; decision tree for when to engage Infrastructure Ops vs. Citrix vs. Cloud Ops vs. multiple teams needs clarity; multiple escalation sources (Help Desk, Clinical NOC - all via ServiceNow)
- Impact: Delays in engaging right teams; potential for missed cross-layer root causes; confusion on prioritization; unclear when to engage Cloud Ops (Dynatrace/platform issues)
- Owner: Optum Insight (Jordan Lambert)
- Recommendation:
  - Create triage decision tree/flowchart for Optum Insight team including Cloud Ops engagement criteria
  - Document runbooks for common issue patterns (including database performance issues)
  - Establish clear priority: P1 incidents always take priority regardless of source (Help Desk, Clinical NOC)
  - Define SLAs for initial triage
  - Define when to engage Cloud Ops: Dynatrace alerts, platform/IaC issues, need for full-stack APM insights
- Timeline: Before go-live
Netscout Integration
- Gap: Integration with Epic/Netcarriers pending; limited application context in network data
- Impact: Network issues may not be automatically correlated with app/session performance degradation visible to Optum Insight or Clinical NOC
- Owner: Cloud Ops (Tom)
- Recommendation: Prioritize Netscout → Splunk integration for automated network event correlation
- Timeline: Before go-live

🟡 Medium Priority Gaps

Cross-Tool Correlation
- Gap: Limited correlation between SystemPulse (app & database), Citrix Monitor (session), Netscout (network), Azure Monitor (infrastructure), Dynatrace (full-stack - with Citrix gap), and DocX (clinical impact); support teams must manually check multiple tools
- Impact: Troubleshooting requires manual correlation across multiple tools by Optum Insight and other teams, slowing MTTR; Clinical NOC may see DocX degradation before technical teams see alerts; Dynatrace insights may be siloed in Cloud Ops
- Owner: Infrastructure Ops (Randy) - Splunk owner + Clinical NOC (Tom Busse) - DocX owner + Cloud Ops (Tom) - Dynatrace owner
- Recommendation:
  - Use Splunk as correlation engine; integrate all tool data including DocX metrics and Dynatrace insights into unified dashboards
  - Provide Clinical NOC access to unified dashboard for visibility into technical monitoring (helps with triage decision)
  - Provide Optum Insight access to unified dashboard for rapid triage including Dynatrace data
  - Define how Dynatrace data flows to Splunk for correlation
  - Create correlation views: when Clinical NOC sees DocX degradation, can quickly see if SystemPulse/Citrix/Azure/Dynatrace also showing issues
- Timeline: Initial dashboards before go-live; advanced correlation post go-live
Alert Tuning & Noise Management
- Gap: Multiple tools generating alerts; tuning still in progress across all platforms
- Impact: Alert fatigue for Optum Insight (primary triage team) and Clinical NOC (DocX monitoring); risk of missed critical alerts; ServiceNow ticket volume may overwhelm; Dynatrace alerts may add to noise
- Owner: All teams (each for their tools) + Clinical NOC (DocX) + Cloud Ops (Dynatrace)
- Recommendation:
  - Establish alert taxonomy and severity standards across all tools including Dynatrace
  - Weekly alert tuning sessions during first month post go-live
  - Define alert suppression rules for known maintenance windows
  - Configure ServiceNow auto-assignment rules
  - Define Dynatrace alert routing: GitHub Issues vs. ServiceNow based on severity/impact
  - Note: DocX alerts remain manual (Clinical NOC judgment); no automated ticketing
- Timeline: Ongoing through go-live and first 30 days
DocX Integration with Technical Monitoring
- Gap: Limited correlation between DocX clinician experience and SystemPulse/Citrix/Azure/Dynatrace metrics; Individual teams must manually check each tool
- Impact: Triage may overall take longer; may miss technical root cause signals
- Owner: Clinical NOC (Tom Busse) + Optum Insight (Jordan Lambert) + Infrastructure Ops (Randy - Splunk) + Cloud Ops (Tom - Dynatrace)
- Recommendation:
  - Create dedicated dashboard page on epic.optum.com that is accessible by all support teams showing DocX + SystemPulse + Citrix + Azure + Dynatrace dashboard links all in one place
  - Enhance Clinical NOC's ability to quickly correlate DocX degradation with technical signals including Dynatrace insights
- Timeline: Post go-live enhancement (within 60 days)

Pre Go-Live Action Items

🔴 Critical Before Go-Live (Must Complete)

🟡 Important Before Go-Live (High Priority)

Splunk: Create unified dashboards pulling from SystemPulse (app & database), Citrix Monitor, Azure Monitor, Netscout, Dynatrace (noting Citrix gap), and DocX - accessible to Clinical NOC for correlation AND Optum Insight for triage (Owner: Infrastructure Ops + Clinical NOC + Cloud Ops)
Cloud Ops + Infrastructure Ops: Define how Dynatrace data flows to Splunk for correlation (API integration, log forwarding, etc.) (Owner: Cloud Ops + Infrastructure Ops)
SystemPulse: Tune Epic and database alerts for Azure environment baselines vs. on-prem (Owner: Optum Insight)
All teams: Define and publish SLAs: MTTD (Mean Time To Detect), MTTI (Investigate), MTTR (Resolve); include Clinical NOC response time to DocX degradation; include Cloud Ops response time to Dynatrace alerts (Owner: All teams)
Communication: Set up stakeholder notification lists and status page for go-live - managed by Optum Health (Owner: Optum Health)
Optum Health: Develop communication playbook with pre-drafted templates for various incident scenarios (P1, P2, resolution notices, CDO updates) (Owner: Optum Health)

🟢 Post Go-Live Improvements (30-90 days)

Appendix: Contact Information & ServiceNow Assignment Groups

ServiceNow Engagement Model

All monitoring alerts and incidents must be routed through ServiceNow using the appropriate assignment group.
Help Desk routes all clinician-reported Epic issues to Optum Insight for initial technical triage via ServiceNow.
ALL escalations follow formal ServiceNow incident creation and engagement process - no informal or backchannel escalations.
ServiceNow is the single source of truth for all Epic on Azure incidents.

ServiceNow Assignment Groups

Team	ServiceNow Assignment Group	Scope / Responsibilities	Tool Ownership	Primary Contact Type
Business Ops (Epic App DBA) <br/>aka Optum Insight	`Epic - Azure (National West)`	Triage clinician issues; escalations from NOC;<br/>Epic app, SystemPulse, Hyperspace Web;<br/>bridge: clinical (NOC, Health) & tech (Infra, Citrix, Cloud)	SystemPulse, Hyperspace Web	Help Desk routes via ServiceNow;<br/>NOC routes DocX via ServiceNow;<br/>ALL via ServiceNow
Citrix Team	`USS_Virtual_Workspace`	Citrix infrastructure, VDI performance, session delivery, Citrix Monitor / UberAgent;<br/>Note: Does not manage Dynatrace OneAgent on Citrix VMs	Citrix Monitor / UberAgent	Engaged by Optum Insight via ServiceNow or direct alert
Azure Platform Ops	`Epic_Azure_Infrastructure_Ops (Prod/NonProd)`	Azure infrastructure, Azure Monitor, Netscout, Splunk, network	Azure Monitor, Netscout, Splunk, Azure Dashboards	Engaged by Optum Insight via ServiceNow or direct alert
Clinical NOC	N/A (escalates to others)	DocX monitoring; opens ServiceNow incident and escalates to Optum Insight (technical) and notifies Optum Health (communications) when DocX degrades	DocX	NOC monitoring; defined escalation path via ServiceNow - no automated ticketing