Troubleshooting Anomalies
To demonstrate techniques for using Anomaly Detection and Automated Root Cause Analysis effectively, this example follows an anomaly from the moment it surfaces until its root cause is confirmed. The troubleshooting process can begin with any of the multiple ways to view anomalies. This example assumes that you start with the page.
Drill Into an Anomaly
- In Anomalies tab. , view the
- Double-click an anomaly to open the detailed view.
Initially, the page describes everything that is occurring during the anomaly's Start Time. To review how things change later in the anomaly's lifecycle, click events further along its timeline.
Examine the Anomaly Description
The anomaly description describes the anomaly in relation to its named Business Transaction, the severity level of the selected state transition event, and the top deviating Business Transaction metrics.
In this example, these are:
- Business Transaction:
/r/Checkout
- Severity Level: Critical
- Top Deviating Metrics: Average Response Time
The deviating metric is Average Response Time indicating that checkout responding slowly is the problem.
Examine the Timeline
The state transition events mark the moments when the anomaly moves between Warning and Critical states.
- The timeline in this example begins in the Critical state, followed 30 minutes later by a transition to the Warning state, which lasts only eight minutes.
- Because this simple anomaly starts in the Critical state and remains there for most of its lifecycle, we can probably learn all we need to know from the initial event
By contrast, patterns that appear in more complicated timelines may help you to understand anomalies. For example, this timeline from a different anomaly repeatedly toggles from a brief Warning state to a longer Criticalstate:
In this case, you should examine several state change events to determine what clues toggling between states offers about problems in your application.
Examine the Flow Map
The example flow map contains:
- The START label shows that the Business Transaction begins with the OrderService tier.
- Between the OrderService tier and its numerous dependencies, two tiers are red—these are the tiers where the system has found Suspected Causes.
You can now focus on determining which of the red tiers contains the root cause of the anomaly.
Examine the Top Suspected Causes
The Top Suspected Causes show likely root causes of a Business Transaction performance problem. You can can traverse up to the following entities in the call paths to find the root cause of the anomaly:
services such as payment service, order service
backend such as database backend, HTTP backend
cross-applications
Infra machine entity-server
In the following example, we want to know why the business transaction /order
is throwing an error. The first suspected cause is a front end issue on frontend15novauto
.
Hover over the Suspected Cause to highlight the relevant entities in the flow map. Everything but the critical path fades away, revealing that ApacheWebServer
, where the Business Transaction starts, relies on frontend15novauto, which had an anomaly in the errors per minutes
metric.
Drill Into a Suspected Cause
Click More Details for the Suspected Cause to review:
- Simplified timeline
- Metrics graphed over time
Two types of graphed metrics display:
- Top Deviating Metrics for the Business Transaction
- Suspected Cause Metrics
Examine Top Deviating Metrics for the Business Transaction
Deviating Business Transaction metrics can indicate why an anomaly was important enough to surface. (The system does not surface anomalies for every transitory or slight deviation in metrics. Such anomalies would be of dubious value, since their customer impact is minimal. For the same reason, anomalies are surfaced for Business Transactions which have a CPM of under 20.)
Each deviating metric is shown as a thin blue line (the metric's value) against a wide gray band (the metric's Expected Range).
You can:
- Scroll along the graph to compare a metric’s value with its Expected Range at any time point.
- Hover over a time point to view the metric's value and Expected Range in numerical form.
In this example:
- The deviating metric's (errors per minutes) spike remained elevated for about 50 minutes (7:45 AM - 8:35 AM), then subsided back into Expected Range.
- Six minutes after the metric returned to its Expected Range, the Severity Level changed from Critical to Warning, and twelve minutes after that, to Normal.
Hovering over time points display the period of deviation: the Errors per Minutes was around 16 and above, while its Expected Range was from 0 to 13.5. With a key metric elevated by this large amount, it made sense for the system to surface this anomaly.
The Top Deviating Metrics timeline also displays the evaluation period of an anomaly in the grey color. The evaluation time period is the duration in which the data is analyzed to detect the anomaly. This timeline helps you to precisely identify the time when the issue started. The following images displays the evaluation period:
Examine Suspected Cause Metrics
You view, scroll through, and hover over Suspected Cause Metrics similar to Top Deviating Metrics.
For example, the suspected cause metric is displayed for frontend15novauto
. The expected range of the MemoryUsed%
metric is 0 to 4.5%, but the value is 8%, which is above the expected range. This is the root cause of the anomaly.
Results
We used Anomaly Detection and Automated Root Cause Analysis to quickly find the root cause of a Business Transaction performance problem. What kind of time and effort did Anomaly Detection and Automated Root Cause Analysis save us in this case?
Recall that the tier where the Business Transaction started, /order
, has multiple services as dependencies. Anomaly Detection and Automated Root Cause Analysis identified the most relevant tier and the most relevant metric on the tier as the origin of the issue.
You were spared the tedious process of investigating multiple metrics on each dependency in turn. Instead, you confirmed or negated Suspected Causes with a quick glance at timelines, flow maps, and metrics. Anomaly Detection and Automated Root Cause Analysis performed the vast majority of the work in root cause analysis, presenting you with the information you needed to quickly form and verify a hypothesis.
(Optional) Inspect Snapshots and Execute Actions
When you view an anomaly, you can inspect:
- Business Transaction snapshots from the time period of the anomaly.
- Actions executed as the result of policies you have configured for the Business Transaction.
These are options of the standard troubleshooting flow. They are typically done as follow-up.
In this example:
- Suppose we want more context for the
MemoryUsed%
issue onfrontend15novauto
. We can view snapshots of transactions affected by that issue. Double-click a snapshot in the list to open it in its own pane, and if desired, drill down into details and call graphs. - It is common to send messages to a ticketing system when an anomaly occurs. In this case, we posted to Slack, for our Ops team to see that on their phones.