Finding and removing outliers

This section describes outliers. For a complete list of topics on detecting anomalies, detecting patterns, and time series forecasting see About advanced statistics, in this manual.

What is an outlier?

An outlier is a data point that is far removed from the typical distribution of data points.

In some situations you might want to simply identify the outliers. In other situations you might want to remove the outliers so that they do not skew your statistical results or create issues displaying data in your charts.

Statistical outliers

A statistical outlier is a data point that is far removed from some measure of centrality. Typical measures of centrality are mean, median, and mode. Mean is the average value. Median is the value at the middle of a sorted list of all values. Mode is the most frequent value.

An image that shows a graph of the mean, median, and mode curves.
Centrality measure Splunk centrality commands
Mean stats avg(field)
Median stats median(field)
Mode stats mode(field)

There are different types of statistical outliers. Outliers can be far from the mean, far from the median, or a small number relative to the mode. An outlier can also be a data point that is far removed from the typical range of data points.

Identifying outliers

There are several methods you can use to identify outliers. Often these methods involve calculating some measure of centrality and then identifying the outliers.

In some cases outliers are identified when you notice an anomaly. For example, when you chart the data and you notice that the axis is skewed. An outlier can be the culprit.

Use the number of statistical outliers as a bellwether. If you see more statistical outliers than usual, that phenomenon itself is an anomaly.

To calculate the centrality measure, you can use the following commands.

Centrality measure Splunk centrality commands
Standard deviation stats stdev(field)
Quartiles and percentiles stats perc75(field) perc25(field)
Top 3 most frequent values top 3 field

In many cases, you need additional commands to calculate the information that you are looking for. The following sections contain some of these examples.

Calculate the lower and upper boundaries of an acceptable range to identify outliers

The following example takes the first 500 events from the quote.csv file. The streamstats command and a moving window of 100 events are used to calculate the average and standard deviation. The average and standard deviation are used with the eval command to calculate the lower and upper boundaries. A sensitivity is added into the calculation by multiplying the stdev by 2. The eval command uses those boundaries to identify the outliers. The outliers are then sorted in descending order.

Use the interquartile range (IQR) to identify outliers

The following example takes the first 500 events from the quote.csv file. The eventstats command is used to calculate the median, the 25th percentile (p25), and the 75th percentile (p75). The IQR is calculated with the eval command by subtracting the percentiles. The median and the IQR are used with the eval command to calculate the lower and upper boundaries. A sensitivity is added into the calculation by multiplying the IQR by 20.The eval command uses those boundaries to identify the outliers. The outliers are then sorted in descending order.

Removing outliers in charts

You can use the outlier command to remove outlying numerical values from your search results.

You have the option to remove or transform the events with outliers. The remove option removes the events. The transform option truncates the outlying value to the threshold for outliers. The threshold is specified with the param option.

A value is considered an outlier if the value is outside of the param threshold multiplied by the inter-quartile range (IQR). The default value for param is 2.5.

Create a chart of web server events, transform the outlying values

For a timechart of web server events, transform the outlying average CPU values.

Remove outliers that interfere with displaying the y-axis in a chart

Sometimes when you create a chart, a small number of values are so far from the other values that the chart is rendered unreadable. You can remove the outliers so that the chart values are visible.

Remove outliers using the three-sigma rule across transactions

This example uses the eventstats command to calculate the average and the standard deviation. The three-sigma limit is then calculated. The where command filters search results. Only the events with the duration less than the three-sigma limit are returned.

Manage alerts for outliers

When setting up an alert, it is important to review the outlier threshold you have set.

* If the threshold value is too low, too many alerts are returned for non-critical outliers

* If the threshold value is too high, not enough alerts are returned and you might not identify the outliers

Typically a small percentage of events in your data are outliers. If you have 1,000,000 events a day and 5% of the events are outliers, setting an alert would trigger 50,000 alert actions unless you specify a throttle. A throttle suppresses additional alerts that have the same field value in a given time range.

For example, your search returns on average 100 events every minute. You only want to be alerted when the status for an event is 404. You can setup the alert to perform the alert action one time every 60 seconds instead of alerting you for every event that has a 404 status in that 60 second window.

Set alert throttling

You set throttling as part of setting an alert.

1. Determine what percent of your events are outliers.

2. Under Settings, choose Searches, reports, and alerts.

3. Under Schedule and alert, click the Schedule this search check box. The screen expands to display the scheduling and alerting options.

4. Specify the alert condition and mode.

5. Mark the Throttling check box and specify when the throttling expires.

6. Specify the alert action.

7. Click Save.