Analyze error spans in Splunk APM

With Splunk APM error detection, you can isolate specific causes of errors in your system and applications.

How Splunk APM detects error spans

Each span in Splunk APM captures a single operation. Splunk APM considers a span to be an error span if the operation that the span captures results in an error as defined by the following conditions:

The otel.status_code field for the span is ERROR. otel.status_code is set in the Splunk Distribution of the OpenTelemetry instrumentation using the native OTel field span.status. span.status, and subsequently otel.status_code, are set based on either the HTTP status code or the gRPC status code.
- See How OpenTelemetry handles HTTP status codes to learn which status code values set otel.status_code to ERROR in the OpenTelemetry instrumentation.
- See How OpenTelemetry handles gRPC status codes to learn which rpc.grpc.status_code tag values set otel.status_code to ERROR in the OpenTelemetry instrumentation.
The error tag for the span is set to a truthy value, which is any value other than False or 0.

See the Span Status section of the OpenTelemetry Transformation to non-OTLP Formats spec on GitHub https://opentelemetry.io/docs/specs/otel/common/mapping-to-non-otlp/#span-status to learn more about otel.status_code. See the Set Status section of the OpenTelemetry Tracing API specification on GitHub https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status to learn more about span.status.

How OpenTelemetry handles HTTP status codes

The following table provides an overview of how HTTP status codes are used to set the span.status field, and subsequently otel.status_code, in OpenTelemetry instrumentation in accordance with OpenTelemetry semantic conventions. To learn more, see the OpenTelemetry semantic conventions for HTTP spans on GitHub https://github.com/open-telemetry/semantic-conventions/blob/main/model/http/spans.yaml.


Error type	Server-side spans `span.kind = SERVER`	Client-side spans `span.kind = CLIENT`
`1xx`, `2xx`, and `3xx`	`otel.status_code` is unset, unless there’s another error in the span.	`otel.status_code` is unset, unless there’s another error in the span.
`4xx`	Not considered a server-side error. `otel.status_code` unset.	Counted as a client error. `otel.status_code` set to `ERROR`.
`5xx`	`otel.status_code` set to `ERROR`.	`otel.status_code` set to `ERROR`.

How OpenTelemetry handles gRPC status codes

To determine if a gRPC span counts towards the error rate for a service, Splunk APM looks at the otel.status_code field as set by OpenTelemetry instrumentation. The following logic is applied by the instrumentation in accordance with OpenTelemetry semantic conventions:


Code	Status	Server-side spans `span.kind = SERVER`	Client-side spans `span.kind = CLIENT`
0	OK	unset	unset
1	CANCELLED	unset	ERROR
2	UNKNOWN	ERROR	ERROR
3	INVALID_ARGUMENT	unset	ERROR
4	DEADLINE_EXCEEDED	ERROR	ERROR
5	NOT_FOUND	unset	ERROR
6	ALREADY_EXISTS	unset	ERROR
7	PERMISSION_DENIED	unset	ERROR
8	RESOURCE_EXHAUSTED	unset	ERROR
9	FAILED_PRECONDITION	unset	ERROR
10	ABORTED	unset	ERROR
11	OUT_OF_RANGE	unset	ERROR
12	UNIMPLEMENTED	ERROR	ERROR
13	INTERNAL	ERROR	ERROR
14	UNAVAILABLE	ERROR	ERROR
15	DATA_LOSS	ERROR	ERROR
16	UNAUTHENTICATED	unset	ERROR

See the OpenTelemetry specification for information on the handling of gRPC status codes on GitHub https://github.com/open-telemetry/semantic-conventions/blob/main/model/rpc/spans.yaml.

How error spans are counted in MetricSets

To generate endpoint-level Monitoring MetricSets, Splunk APM turns endpoint spans, which are spans with span.kind = SERVER or span.kind = CONSUMER, into error metric data. If a span is considered an error per the Error rules in Splunk APM, that span counts towards errors in the Monitoring MetricSet for the endpoint associated with that span.

Service-level Monitoring MetricSets are based on the number of error spans in each of the service endpoints.

Server-side and client-side error counting

Splunk APM captures all spans from all instrumented services, including spans capturing requests made to clients, called client-side spans, and requests received by services, called server-side spans. In certain cases, when a service returns an error, the error can be registered in both the initiating span and the receiving span. To avoid duplicated error reports, Splunk APM counts only the server-side error spans in MetricSets and error totals.

For example, when service_a makes a call to service_b and both services are fully instrumented, Splunk APM receives the following two spans:

span_1, a span with span.kind = CLIENT that captures service_a making the call to service_b
span_2, a span with span.kind = SERVER that captures service_b receiving the request

If service_b returns a 500 error, both spans receive that error. To avoid double-counting errors, Splunk APM counts only the server-side span, span_2, as an error in MetricSets and error totals.

What is the difference between an error and a root cause error?

To help you identify the root cause of an error, Splunk APM differentiates between errors and root cause errors. For example, the request and error graph in Tag Spotlight differentiates root cause errors from total errors with a darker color:

This screenshot shows the graph of requests and errors for paymentservice in Tag Spotlight. Total errors have a light pink area plot on the graph, and root cause errors are darker pink.

When a particular span within a trace results in an error, the error can propagate through other spans in the trace. Any span determined to contain an error based on the criteria described in How Splunk APM detects error spans is an error span. Splunk APM designates the originating error of a chain of error spans as the root cause error.

For example, consider the checkout trace in the following screenshot:

This screenshot shows an example of Splunk APM trace view

The checkout service makes HTTP requests to the authorization service, the checkout service, and the payment service. The HTTP request to the payment service results in a 402 "Payment Required" error. Because the request to the payment service failed, the initiating requests to checkout service and http.Request also result in errors.

In this case, the source error, or root cause error, is the 402 error in the payment service. The 500 errors appearing in the checkout and api services are subsequent errors.

The root cause error count indicates the count of these root cause errors, while the standard error count indicates the total count of all root cause errors as well as any subsequent errors.

Customize the error logic in Splunk APM

Note:

You must have a role with the UPDATE_CONFIG capability to configure the error rate metric. This capability is included in the admin role.

If you're an admin and want to create and assign a custom role with this capability, see Custom roles in Splunk Observability Cloud and Splunk Observability Cloud のユーザーにロールを割り当てる.

By default, Splunk APM counts server-side spans with 5xx status codes as errors. The default configuration applies to all services.

To define which HTTP error codes contribute to the error rate metric in APM, you can update the default configuration for all services or configure a custom override for a specific service. Custom overrides take priority over the default configuration.

Error code metric configurations are applied at the organization level, as services may belong to more than one environment.

Update the default configuration for the error rate metric

In the Splunk Observability Cloud main menu, select Settings > APM error code configuration.
In the Default configuration table, select the edit icon under Actions.
Follow the on-screen steps to update the default configuration for all services.

Configure a custom override for the error rate metric

In the Splunk Observability Cloud main menu, select Settings > APM error code configuration.
Select Configure custom override.
Follow the on-screen steps to configure a custom override for a service.

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

Enterprise Security

SOAR

IT Service Intelligence

Content Packs

Splunk Observability Cloud

AppDynamics SaaS

AppDynamics On-Premises

SAP Agent

Developer Documentation

Splunkbase

Splunk Enterprise

Splunk Cloud Platform

Splunkbase

DATA MANAGEMENT

SEARCH AND ANALYTICS

ADMINISTRATION

Enterprise Security

SOAR

ENTERPRISE SECURITY

SOAR

RELATED APPS

IT Service Intelligence

Content Packs

ITSI

IT Ops

ADMINISTRATION

EXTENSIONS

Splunk Observability Cloud

MONITORING

DATA MANAGEMENT

ADMINISTRATION

AppDynamics SaaS

AppDynamics On-Premises

SAP Agent

ESSENTIALS

MONITORING

ADMINISTRATION

Developer Documentation

Splunkbase

PLATFORM

OBSERVABILITY

REFERENCE

Resources

REFERENCE

Learn More

Support

How Splunk APM detects error spans

How OpenTelemetry handles HTTP status codes

How OpenTelemetry handles gRPC status codes

How error spans are counted in MetricSets

Server-side and client-side error counting

What is the difference between an error and a root cause error?

Customize the error logic in Splunk APM

Update the default configuration for the error rate metric

Configure a custom override for the error rate metric