PagerDuty Chair Highlights AI Agent Failure Risks

PagerDuty Executive Chair Jenn Tejada told Forbes on July 2, 2026 that AI is shifting from experimental use into production and warned that agentic systems introduce failure modes like model drift, which is harder to detect than a traditional software crash because symptoms surface only after an agent has already taken multiple flawed actions. She pointed to record hyperscaler AI infrastructure spending, estimated near $725 billion for 2026 according to BNP Paribas, as evidence of how fast this shift is happening, and argued that AIOps platforms need to monitor AI agents alongside conventional infrastructure so humans can intervene before a small failure compounds into an outage like the one that hit AWS in October 2025. For engineering and SRE teams building agentic workflows, her core point is practical: instrument for drift now, because agent failures do not announce themselves the way a crashed service does.
The operational-risk profile of AI is changing as more agentic systems move into production, and PagerDuty's Executive Chair frames the shift through a specific engineering problem: agent and model drift show up differently than a conventional software crash, so the observability stack most teams already run will not catch it on its own. That gap, not the AI itself, is the practical thing SRE and ML platform teams need to solve before scaling agentic workflows.
What happened
Jenn Tejada, Executive Chair and former CEO of PagerDuty, told Forbes contributor Martine Paris in an interview published July 2, 2026 that AI is shifting "from the experimentation phase to production," pointing to a wave of infrastructure investment that includes Meta's data center buildout and workforce training programs. She cited hyperscaler AI capital spending estimated at $725 billion for 2026, according to BNP Paribas - roughly double the prior year's estimate. Tejada, who took PagerDuty public in 2019, said the shift will produce smaller "one and two pizza" engineering teams alongside more complex systems and a higher risk of the kind of cascading failure seen in the AWS outage of October 2025.
Technical context
Tejada described model and agent drift as a failure mode that is harder to see than a service that simply stops working: "when AI drifts, it's actually harder to see, and you don't see until it's executed that drift in a number of ways, and now it's evolved into multiple failures," she told Forbes. That matches how the wider AIOps and observability field is treating drift in 2026: rather than a discrete outage, drift is a gradual accuracy or behavior degradation that traditional uptime and latency monitoring will miss, which is why AIOps vendors increasingly frame the discipline as correlating model-level signals (accuracy, output distribution) with conventional infrastructure telemetry instead of treating them as separate monitoring domains.
For practitioners
Tejada's framing points to a concrete requirement: teams running agentic systems need something that watches agent behavior continuously and can interrupt or pause it when outputs look wrong, not just alert after a downstream service fails. In practice that means extending existing incident-response tooling with agent-specific telemetry (decisions, tool calls, confidence signals) plus a human-in-the-loop override path, so a drifting agent gets flagged and paused before it compounds into the kind of multi-system failure Tejada references.
What to watch
Watch for AIOps and observability vendors to keep expanding beyond infrastructure-only monitoring into model- and agent-level signals, and for more public postmortems of agent-driven incidents that would validate, or complicate, Tejada's model-drift framing with real failure data rather than a single executive's characterization.
Key Points
- 1PagerDuty's Jenn Tejada says AI agent failures surface as gradual model drift rather than a clean crash, delaying detection until damage compounds.
- 2Hyperscaler AI capex is estimated near $725 billion for 2026, nearly double last year's spending, according to BNP Paribas figures Tejada cited.
- 3Practitioners need agent-level telemetry and human override paths in incident tooling, not just infrastructure alerts, to catch drift before it cascades.
Scoring Rationale
A single, well-sourced Forbes interview in which PagerDuty's Executive Chair articulates a concrete, verifiable operational-risk framework (model drift, AIOps, human-in-the-loop) for agentic AI in production, corroborated by real hyperscaler capex figures; genuinely useful for SRE and ML platform practitioners, but it is one executive's characterization in an interview format rather than multi-sourced reporting on a new development, so it lands at the lower end of notable.
Sources
Public references used for this report.
Practice interview problems based on real data
1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


