Drill Into an Anomaly
- In Anomalies tab. , view the
- Double-click an anomaly to open the detailed view.
Initially, the page describes everything that is occurring during the anomaly's Start Time. To review how things change later in the anomaly's lifecycle, click events further along its timeline.
Examine the Anomaly Description
The anomaly description describes the anomaly in relation to its named Business Transaction, the severity level of the selected state transition event, and the top deviating Business Transaction metrics.
In this example, these are:
- Business Transaction:
/r/Checkout
- Severity Level: Critical
- Top Deviating Metrics: Average Response Time
The deviating metric is Average Response Time indicating that checkout responding slowly is the problem.
Examine the Timeline
The state transition events mark the moments when the anomaly moves between Warning and Critical states.
- The timeline in this example begins in the Criticalstate, followed 30 minutes later by a transition to the Warning state, which lasts only eight minutes.
- Because this simple anomaly starts in the Critical state and remains there for most of its lifecycle, we can probably learn all we need to know from the initial event
By contrast, patterns that appear in more complicated timelines may help you to understand anomalies. For example, this timeline from a different anomaly repeatedly toggles from a brief Warning state to a longer Criticalstate:
In this case, you should examine several state change events to determine what clues toggling between states offers about problems in your application.
Examine the Flow Map
The example flow map contains:
- The START label shows that the Business Transaction begins with the OrderService tier.
- Between the OrderService tier and its numerous dependencies, two tiers are red—these are the tiers where the system has found Suspected Causes.
You can now focus on determining which of the red tiers contains the root cause of the anomaly.
Examine the Top Suspected Causes
The Top Suspected Causes show likely root causes of a Business Transaction performance problem. You can can traverse up to the following entities in the call paths to find the root cause of the anomaly:
services such as payment service, order service
backend such as database backend, HTTP backend
cross-applications
Infra machine entity-server
In the following example, we want to know why Checkout is responding slowly. The first Suspected Cause is a Process CPU Burnt issue on the eretail.prod.payment01_1 node of the PaymentService1 tier:
Hover over the Suspected Cause to highlight the relevant entities in the flow map. Everything but the critical path fades away, revealing that OrderService, where the Business Transaction starts, and which had a degraded response time, relies on PaymentService1:
The second Suspected Cause is an HTTP call on OrderService itself.
Hover to highlight the affected entities.
Which Suspected Cause is the root cause? Which is only a symptom of the overall problem?
- We have a plausible root cause in the Process CPU Burnt issue on PaymentService1 tier, which is ranked likeliest by the system.
- Meanwhile, the HTTP call on OrderService bears some analysis:
- An HTTP call includes both a request and a response
- We know that the tier on the other end, PaymentService1, has its own problem
- Therefore, we can infer that the HTTP response from PaymentService1 is what makes the call slow
Now we see that both Suspected Causes originate with PaymentService1, and the HTTP call issue is really a side-effect of the Process CPU Burnt issue. The system's ranking makes sense.
As we continue to investigate, if we decide that the Process CPU Burn issue is not the root cause, we can reconsider the HTTP call.