Context
A Pega Infinity 23 platform running on-premises had limited visibility into its own behavior: performance issues surfaced as user complaints, and root-cause analysis meant log archaeology. I led the implementation of Dynatrace APM across the full stack.
Challenge
Pega platforms are layered — browser, Tomcat, the Pega rules engine, database — and a slow screen can originate in any of them. Without end-to-end tracing, every incident started from zero.
What I did
- Instrumented every tier: Dynatrace OneAgent across all Pega application components, real-time data acquisition on Apache Tomcat instances, custom dashboards for Pega-specific metrics (framework functions, HTTP request flows).
- Resource monitoring & alerting: CPU, memory and disk utilization tracking with automated threshold alerts — moving the team from reactive to proactive.
- Distributed tracing: end-to-end PurePaths tracing of user actions across all tiers, mapping data flow between the application and database layers to pinpoint bottlenecks.
- Frontend truth: Session Replay for user-behavior analysis and JavaScript error tracking for issues that only exist in the browser.
Outcome
- Unplanned downtime down 65% — threshold alerts caught resource exhaustion before users did.
- MTTR down 25% — tracing replaced guesswork in incident response.
- 100% performance visibility — every request, from click to SQL, is observable.
This project is also where platform work meets data work: monitoring is time-series analysis, alerting is threshold modeling, and dashboard design is data storytelling — the same instincts I now apply to decisioning analytics.