Metrics in Splunk Observability Cloud
Introduction to metrics, data points, and metric time series in Splunk Observability Cloud.
In Splunk Observability Cloud, metric data consists of a numerical measurement called a metric, the metric type, and one or more dimensions. Each piece of data in this form is a data point. For example, a data point can be the CPU utilization of host server1 with metric type gauge, metric value 0.7, dimensions "hostname":"server1" and "host_location":"Tokyo", and the timestamp 1557225030000.
A metric time series (MTS) contains all the data points that have the same metric name, metric type, and set of dimensions. Splunk Observability Cloud automatically creates MTS from incoming data points. For example, the following data points for the cpu.utilization metric with the same "hostname":"server1" and "location":"Tokyo" dimensions, but with different values and timestamps, make up a single MTS.
MTS are used in Splunk Infrastructure Monitoring to populate charts and generate alerts.
Metrics
A metric is a measurable number that varies over time. Multiple sources of the same general type, such as host machines, usually report the metric values for a single set of metric names. For example, a server cluster that has 100 host machines might report a single set of metrics named cpu.utilization, api.calls, and dropped.packets, although metric values might be different for each machine.
sf. or sf_metric.
Metric type
There are three types of metrics: gauge, cumulative counter, and counter. See more in Metric types.
|
Metric type |
Description |
Example |
|---|---|---|
|
Gauge |
Value of a measurement at a specific point in time |
CPU utilization percentage of a server |
|
Cumulative counter |
Total number of occurrences or items since the measurement began |
Total number of Splunk Infrastructure Monitoring API calls served since starting the web server |
|
Counter |
Number of new occurrences or items since the last measurement |
The number of packets that fail to reach their destinations over each 24-hour period |
|
Histograms |
Distribution of measurements across time. Splunk Observability Cloud supports explicit bucket histograms. |
Response time (performance) or successful screen loads (availability) |
Metric category
There are about 20 metric categories in Splunk Observability Cloud. Metric category, especially metrics categorized as custom, can impact billing.
Learn all metric categories and how to identify them in Metric categories.
Metric resolution
By default, Splunk Observability Cloud processes metrics at a 10-second resolution. If metrics have a native resolution courser than 10 seconds, then Splunk Observability Cloud processes the metrics at their native resolution.
Optionally, metrics can be ingested at a higher resolution of 1 second. High-resolution metrics enable exceptionally fine-grained and low-latency visibility and alerting for your infrastructure, applications, and business performance.
sf_hires to 1 in any MTS.
Metric metadata
Metrics can have associated metadata such as dimensions, custom properties, or tags. Learn more in Metadata: Dimensions, custom properties, tags, and attributes.
To add or edit dimensions:
-
Use the API. See how in our developer portal .
Data points
A data point contains a metric name and value, the type of the metric, and the dimensions of the metric. Dimensions are the key-value pairs that identify the source of the reported value. Infrastructure Monitoring assumes that incoming data points contain a metric as well as a dimension, or a unique key-value pair that describes some aspect of the metric source.
A data point consists of the following components:
|
Component |
Description |
Examples |
|---|---|---|
|
Metric type |
The specified metric type determines the way that Splunk Observability Cloud works with the metric. To learn more about metric types, see Metric types. |
One of three metric types: |
|
Metric name |
A metric name identifies the values that you send into Infrastructure Monitoring. For example, the AWS metric To learn more about metrics naming constraints, see Naming conventions for metrics and dimensions. |
|
|
Metric value |
The measurement from your system, represented as a number. Metric values must be a signed integer, float, or numeric string in decimal or fixed-point notation. The system stores them as 64-bit integers. See more in the Send Traces, Metrics and Events API documentation. |
|
|
Dimensions |
Key-value pairs that describe some aspect of the source of the metric. A data point can have one or more dimensions. The most common dimension is a source. For example, a dimension can be a host or instance for infrastructure metrics, or it can be an application component or service tier for application metrics. Dimensions are considered metric metadata. To learn more about dimensions, see Metadata: Dimensions, custom properties, tags, and attributes. |
|
|
Timestamp (Optional) |
Either the time that data is sent by the software, or the time at which the data arrives in Splunk Observability Cloud. The timestamp is in *nix time in milliseconds. |
1557225030000 |
Metric time series
A metric time series (MTS) is a collection of data points that have the same metric and the same set of dimensions.
For example, the following sets of data points are in three separate MTS:
-
MTS1: Gauge metric
cpu.utilization, dimension"hostname": "host1" -
MTS2: Gauge metric
cpu.utilization, dimension"source_host": "host1" -
MTS3: Gauge metric
cpu.utilization, dimension"hostname": "host2"
MTS 2 has the same host value as MTS 1, but not the same dimension key. MTS 3 has the same host name as MTS 1, but not the same host name value.
Splunk Observability Cloud retains inactive MTS for 13 months.
Use unique dimensions to create independent MTS
It’s important to configure the Collector or ingest to provide at least one dimension that identifies a unique entity.
For example, when you report on the CPU utilization of 10 hosts in a cluster, the metric is the CPU utilization.
If each host in the cluster shares the exact same dimensions with all the other hosts, the cluster generates only one MTS. As a result, you might have difficultly in differentiating and monitoring the CPU utilization of each individual host in the cluster.
However, if each host in the cluster has at least one unique dimension (typically a unique hostname), the cluster generates 10 MTS, or one for each host. Each MTS represents the CPU utilization over time for a single host.
Metric types
Learn about the metric types in Splunk Observability Cloud: gauges, cumulative counters, histograms, and counters.
In Splunk Observability Cloud, there are four types of metrics: gauge, counters, cumulative counters, and histograms.
The following table lists the types of supported metrics and their default rollups in Splunk Observability Cloud:
|
Metric |
Description |
Rollup |
|---|---|---|
|
Represent data that has a specific value at each point in time. Gauge metrics can increase or decrease. |
Average |
|
|
Represent a count of occurrences in a time interval. Counter metrics can only increase during the time interval. |
Sum |
|
|
Represent a running count of occurrences, and measure the change in the value of the metric from the previous data point. |
Delta |
|
|
Represent a distribution of measurements or metrics, with complete percentile data available. Data is distributed into equally sized intervals or "buckets". |
Histogram |
The type of the metric determines which default rollup function Splunk Observability Cloud applies to summarize individual incoming data points to match a specified data resolution. A rollup is a statistical function that takes all the data points in a metric time series (MTS) over a time period and outputs a single data point. Splunk Observability Cloud applies rollups after it retrieves the data points from storage but before it applies analytics functions. To learn more about rollups and data resolution, see Rollups in Data resolution and rollups in charts.
average() function to data points for gauge metrics. When you specify a 10-second resolution for a line graph plot, and Splunk Observability Cloud is receiving data for the metric every second, each point in the line represents the average of 10 data points.
Gauges
Fan speed, CPU utilization, memory usage, and time spent processing a request are examples of gauge metric data.
Splunk Observability Cloud applies the SignalFlow average() function to data points for gauge metrics. When you specify a ten second resolution for a line graph plot, and Splunk Observability Cloud is receiving data for the metric every second, each point on the line represents the average of 10 data points.
Counters
Number of requests handled, emails sent, and errors encountered are examples of counter metric data. The machine or app that generates the counter increments its value every time something happens and resets the value at the end of each reporting interval.
Splunk Observability Cloud applies the SignalFlow sum() function to data points for counter metrics. When you specify a ten second resolution for a line graph plot, and Splunk Observability Cloud is receiving data for the metric every second, each point on the line represents the sum of 10 data points.
Cumulative counters
Number of successful jobs, number of logged-in users, and number of warnings are examples of cumulative counter metric data. Cumulative counter metrics differ from counter metrics in the following ways:
-
Cumulative counters only reset to 0 when the monitored machine or application restarts or when the counter value reaches the maximum value representable (2 32 or 2 64 ).
-
In most cases, you’re interested in how much the metric value changed between measurements.
Splunk Observability Cloud applies the SignalFlow delta() function to data points for cumulative counter metrics. When you specify a ten second resolution for a line graph plot, and Splunk Observability Cloud is receiving data for the metric every second, each point on the line represents the change between the first data point received and the 10th data point received. As a result, you don’t have to create custom SignalFlow to apply the delta() function, and the plot line represents variations.
Histograms
Histograms can summarize data in ways that are difficult to reproduce with other metrics. Thanks to the buckets, the distribution of your continuous data over time is easier to explore, as you don’t have to analyze the entire dataset to see where all the data points are. At the same time, histogram helps reduce usage of your subscription.
Splunk Observability Cloud applies the SignalFlow histogram() function to data points for histogram metrics, with a default percentile value of 90. You can apply several other functions to histograms, like min, max, count, sum, percentile, and cumulative_distribution_function.
For more information, see Histogram metrics in Splunk Observability Cloud.
Metric categories
Learn about metric categories in Splunk Observability Cloud.
Metric categories for realms us0 and us1
The following metric categories are used in the realms us0 and us1:
|
Billing class |
Metrics included |
|---|---|
|
Custom metrics |
Metrics reported to Splunk Observability Cloud outside of those reported by default, such as host, container, or bundled metrics. Custom metrics might result in increased data ingest costs. |
|
APM Monitoring MetricSets |
Includes metrics from APM Monitoring MetricSets. See Learn about Monitoring MetricSets in Splunk APM for more information. |
|
RUM Monitoring MetricSets |
Includes metrics from RUM Monitoring MetricSets. See Filter and troubleshoot with custom tags for more information. |
|
Default/bundled metrics (Infrastructure) |
|
|
Default/bundled metrics (APM) |
|
|
Other metrics |
Internal metrics |
Metric categories for other realms
The following metric categories are used for any realms that aren’t us0 or us1:
|
Category type |
Description |
|---|---|
|
0 |
No information about the category type of the metric. Note: Category type information for metrics is only available after 03/16/2023. Any metrics created before that date has category type |
|
1 |
Host |
|
2 |
Container |
|
3 |
Custom Metrics reported to Splunk Observability Cloud outside of those reported by default, such as host, container, or bundled metrics. Custom metrics might result in increased data ingest costs. |
|
4 |
Hi-resolution |
|
5 |
Internal |
|
6 |
Tracing metrics |
|
7 |
Bundled In host-based subscription plans, additional metrics sent through Infrastructure Monitoring public cloud integrations that are not attributed to specific hosts or containers. |
|
8 |
APM hosts |
|
9 |
APM container |
|
10 |
APM identity |
|
11 |
APM bundled metrics |
|
12 |
APM Troubleshooting MetricSets This category is not part of the report. |
|
13 |
APM Monitoring MetricSets |
|
14 |
Infrastructure Monitoring function |
|
15 |
APM function |
|
16 |
RUM Troubleshooting MetricSets This category is not part of the report. |
|
17 |
RUM Monitoring MetricSets |
|
18 |
Network Explorer metrics |
|
19 |
Runtime metrics |
|
20 |
Synthetics metrics |
Identify and track the category of a metric
In host-based plans, the category of a metric might impact billing.
To keep track of the type of metrics you’re ingesting, Splunk Observability Cloud provides you with different tools and reports:
-
Custom metric report. It shows the information on MTS associated with data points sent from hosts or containers, as well as information related to custom, high-resolution, and bundled MTS, for a specified date.
-
Metric Pipeline Management’s usage report. It gives a detailed breakdown of your MTS creation and usage.
-
Track specific org metrics with custom metric information. See more in View organization metrics for Splunk Observability Cloud.
Use Signalflow to look for a metric’s category
You can use SignalFlow to query for the sf_mtsCategoryType dimension, which indicates the metric category.
For example, to look for the top 10 custom metrics you’re ingesting, use the following query with the * character:
A = data('*', filter=filter('sf_mtsCategoryType', '3')).count(by="sf_metric").top(10).publish(label='A')
To only look at specific metrics, use their specific metric name.
Learn more in SignalFlow and analytics.
Histogram metrics in Splunk Observability Cloud
Splunk Observability Cloud natively supports histograms. All histogram metric data you send to Splunk Observability Cloud through OpenTelemetry feeds charts, alerts, and other features.
Splunk Observability Cloud supports histogram data. You can use the histogram metric data you send from instrumented applications and services to Splunk Observability Cloud to create charts, detectors, and more.
Understanding histograms
A histogram represents the distribution of observations. Histograms require numerical, continuous values. Examples of continuous values include time, size, or temperature. The following chart is a visual representation of a histogram for response times in milliseconds:
Histograms store data in buckets, which are adjacent intervals with numeric boundaries. The buckets or bars in the previous histogram span 100 milliseconds. The size of each bar is determined by the number of observations inside each interval. The higher the bar, the more data points fall within the interval.
You can calculate the total number of observations, the minimum and maximum value, the sum of all values, the average value, and discrete percentile values in every histogram. Splunk Observability Cloud provides a SignalFlow function for histograms, which you can use to customize histograms or perform calculations on the data.
Histograms are useful to compare different datasets at a glance, and to identify trends in your data that might be otherwise hard to detect. For example, histograms can answer questions like "What was the 90th percentile of response time for the database yesterday?"
When to use histogram metrics
Histograms can summarize data in ways that are difficult to reproduce using other metrics. With histogram buckets, you can explore the distribution of your continuous data over time without needing to analyze the entire dataset to see all of the data points. Histograms can combine multiple statistics into a single datapoint, such as sum, min, max, and count, along with the buckets.
Service level objectives (SLO)
Histograms are particularly suited for representing performance and availability service level objectives (SLO). Examples of availability SLOs are checking whether a percentile n of all requests is processed in less than a certain duration or that a percentile n of screens in your app loads successfully.
Unlike metrics covering a single percentile or quantile, histograms contain the percentiles or quantiles you need to track in a single metric. This facilitates exploring data in depth after initial detections. For example, if you get an alert for the 99th percentile for response time, using histograms you can explore other percentiles.
See Introduction to service level objective (SLO) management in Splunk Observability Cloud for more information.
Histogram instead of calculated metrics
Histograms contain data that you can use to calculate percentiles and other statistics in Splunk Observability Cloud instead of calculating them using your infrastructure. Sending histograms also results in fewer MTS sent, which reduces your subscription usage.
For example, if you’re sending the service.response_time.upper_90 and service.response_time.upper_95 metrics to track the response time of a key service in your infrastructure at the 90th and 95th percentiles, you can send histogram data for the entire distribution of response times, eliminating the need of sending 2 separate MTS.
Explicit bucket histograms
Explicit bucket histograms are histograms with predefined bucket boundaries. The advantage of defining bucket boundaries yourself is that you can use limits that make sense in your situation.
For example, the following Java code creates an OpenTelemetry histogram with explicit bucket boundaries:
void exampleWithCustomBuckets(Meter meter) {
DoubleHistogramBuilder originalBuilder = meter.histogramBuilder("people.ages");
ExtendedLongHistogramBuilder builder = (ExtendedLongHistogramBuilder) originalBuilder.ofLongs();
List<Long> bucketBoundaries = Arrays.asList(0L, 5L, 12L, 18L, 24L, 40L, 50L, 80L, 115L);
LongHistogram histogram =
builder
.setAdvice(advice -> advice.setExplicitBucketBoundaries(bucketBoundaries))
.setDescription("A distribution of people's ages")
.setUnit("years")
.build();
addDataToHistogram(histogram);
}
Get histogram data into Splunk Observability Cloud
For instructions on how to get histogram data into Splunk Observability Cloud and how to migrate existing reporting elements, see Get histogram data into Splunk Observability Cloud.
Get histogram data into Splunk Observability Cloud
You can collect histogram data using a variety of receivers, including the Prometheus receiver, and send them to Splunk Observability Cloud using the OpenTelemetry Collector.
You can collect histogram data using a variety of receivers, including the Prometheus receiver, and send them to Splunk Observability Cloud using the OpenTelemetry Collector. See Prometheus receiver.
send_otlp_histogram and, therefore, cannot be used to send histogram data.
Export histogram data with the SignalFx exporter
The version of the SignalFx exporter in the Splunk Distribution of the OpenTelemetry Collector supports the parameter send_otlp_histograms and is the recommended method to send histogram data.
The SignalFx exporter can preserve histogram bucket data. This can be used to extract various statistics from the metric at charting time, e.g., 90th percentile or mean.
To send histogram data to Splunk Observability Cloud with the SignalFx exporter, set send_otlp_histograms: true in your Collector values.yaml file. For example:
exporters:
signalfx:
access_token: "${SPLUNK_ACCESS_TOKEN}"
api_url: "${SPLUNK_API_URL}"
ingest_url: "${SPLUNK_INGEST_URL}"
sync_host_metadata: true
correlation:
send_otlp_histograms: true
Export histogram data with the OTLP/HTTP exporter
metrics_endpoint and the traces_endpoint fields:
exporters:
otlphttp:
metrics_endpoint: https://ingest.<realm>.observability.splunkcloud.com/v2/datapoint/otlp
traces_endpoint: https://ingest.<realm>.observability.splunkcloud.com/v2/trace/otlp
headers:
"X-SF-Token": "mytoken"
tls:
insecure: true
timeout: 10s
Best practices when sending bucket histogram data
When sending bucket histogram data to Splunk Observability Cloud, follow these best practices:
-
Send minimum and maximum values, unless you’re sending cumulative data. The minimum value must be lower than the maximum value, otherwise the datapoint is dropped.
-
Use no more than 31 bucket boundaries when sending histograms. Histograms with more than 31 bucket boundaries (32 buckets) are dropped.
-
Make sure that bucket boundaries don’t overlap or repeat. Order the bucket boundaries when sending them.
-
Send values as signed integer, float, or numeric string in decimal or fixed-point notation. Splunk Observability Cloud stores them as 64-bit integers.
-
Check that the sum of all histogram buckets is equal to the
countfield, and that the size of bucket boundaries is equal to the bucket count minus 1. Histograms that don’t comply with these criteria are dropped. -
When sending cumulative data, for example from Prometheus, use delta aggregation temporality. See Considerations on delta aggregation temporality for instructions on how to configure delta temporality in your system.
Considerations on delta aggregation temporality
When handling cumulative histograms, you must set the delta aggregation temporality flag. If you do not, the cumulative histograms will lack minimum and maximum values. This might cause a percentile calculation to give an incorrect value.
To activate delta aggregation temporality in your instrumentation, set the OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE environment variable to delta. See the compliance matrix in the OpenTelemetry Specification repository to check SDK support for your language.
Send histogram data using the API
If you need to bypass the OpenTelemetry Collector, send histogram data directly to Splunk Observability Cloud using the /v2/datapoint/otlp endpoint of the ingest API. The endpoint accepts data in OTLP, serialized as Protobuf, over HTTP. The gRPC scheme is not supported.
To learn how to send histogram metric data using the API, see /datapoint/otlp in the Splunk Developer Portal.
Migrate your dashboards, functions, charts, and detectors
To migrate your existing dashboards, functions, charts, and detectors to histograms, follow these steps:
-
Make sure that you’re sending histogram data using the Splunk Distribution of OpenTelemetry Collector version 0.98 or higher. Lower versions can’t send histogram data in OTLP format using the SignalFx exporter.
- Edit your charts to use the new
histogram()function. See histogram() in the SignalFlow reference documentation.
Troubleshooting
If you are a Splunk Observability Cloud customer and are not able to see your data in Splunk Observability Cloud, you can get help in the following ways.
Available to Splunk Observability Cloud customers
-
Submit a case in the Splunk Support Portal.
-
Contact Splunk Support.
Available to prospective customers and free trial users
-
Ask a question and get answers through community support at Splunk Answers.
-
Join the Splunk community #observability Slack channel to communicate with customers, partners, and Splunk employees worldwide.
Naming conventions for metrics and dimensions
Naming conventions for metric and dimensions in Splunk Observability Cloud.
Read this document to learn about naming conventions and recommendations for custom metrics and dimensions in Splunk Observability Cloud.
sf. or sf_metric.
Types of data in Splunk Observability Cloud
Splunk Observability Cloud works with imported or existing data as well as custom data.
Imported data
When you use an existing data collection integration such as the collectd agent or the AWS CloudWatch integration, the integration defines metric, dimension, and event names for you. To learn more, see Metric name standards.
To make it easier for you to find and work with metrics coming in from different sources, Splunk Infrastructure Monitoring pulls, transforms, and returns the data in a unified format called virtual metrics. See Virtual metrics in Splunk Infrastructure Monitoring for more information.
Custom data
When you send custom metrics, dimensions, or events (key-value pairs you send to mark specific events such as a release) to Splunk Infrastructure Monitoring, you choose your own names.
Send custom data to Splunk Observability Cloud
To learn how to send custom metrics in Splunk Observability Cloud using our API, see the developer portal .
If you’re using the OpenTelemetry Collector, you can create a receiver to Send custom metrics to Splunk Observability Cloud.
Modify naming schemes you sent to other metric systems
If you’re working with metrics that you had previously sent to other metric systems, such as Graphite, modify the naming scheme to leverage the full feature set of Splunk Observability Cloud.
Metric name standards
Metrics are distinct numeric measurements generated by system infrastructure, application instrumentation, or other hardware or software, which change over time. For example:
-
Count of GET requests received
-
Percent of total memory in use
-
Network response time in milliseconds
Read more on metrics in Metrics, data points, and metric time series in Splunk Observability Cloud.
Use descriptive names
Metric names can have up to 256 characters. If the value is longer, the metric might be dropped.
Use names that help you identify what the metric is related to.
|
Metric information |
Example |
|---|---|
|
Measurement description |
|
|
Measurement units |
|
|
Metric category |
|
If you apply a calculation to the metric before you send it, use the calculation as part of the description. For example, if you calculate the ninety-fifth percentile of measurements and send the result in a metric, use p95 as part of the metric name.
On the other hand, some information is better suited for dimension instead of a metric names, such as the description of the hardware or software being measured. For example, don’t use production1 to indicate that the measurement is for a particular host. To learn more, see Type of information suitable for dimensions.
Use metric names to indicate metric types
Follow these best practices to use names to indicate different metric types:
-
Give each metric its own name.
-
When you define your own metric, give each metric a name that includes a reference of the metric type.
-
Avoid assigning custom metric names that include dimensions. For example, if you have 100 server instances and you want to create a custom metric that tracks the number of disk writes for each one, differentiate between the instances with a dimension.
Create metric names using a hierarchical structure
Start at the highest level, then add more specific values as you proceed.
In this example, all of these metrics have a dimension key called hostname with values such as analytics-1, analytics-2, and so forth. These metrics also have a customer dimension key with values org-x, org-y, and so on. The dimensions provide an infrastructure-focused or a customer-focused view of the analytics service usage. For more information on gauge metrics, see Identify metric types.
-
Start with a domain or namespace that the metric belongs to, such as analytics or web.
-
Next, add the entity that the metric measures, such as jobs or http.
-
At your discretion, add intermediate names, such as errors.
-
Finish with a unit of measurement. For example, the SignalFlow analytics service reports the following metrics:
-
analytics.jobs.total: Gauge metric that periodically measures the current number of executing jobs -
analytics.thrift.execute.count: Counter metric that’s incremented each time new job starts -
analytics.thrift.execute.time: Gauge metric that measures the time needed to process a job execution request -
analytics.jobs_by_state: Counter metric with a dimension key called state, incremented each time a job reaches a particular state.
-
Dimension names and value standards
Dimensions are arbitrary key-value pairs you associate with metrics. While metrics identify a measurement, dimensions identify a specific aspect of the system that’s generating the measurement or characterizes the measurement. Use dimensions to:
-
Classify different streams of data points for a metric.
-
Simplify filtering and aggregation. For example, SignalFlow lets you filter and aggregate data streams by one or more dimensions.
Dimensions can be numeric or nonnumeric. Some dimensions, such as host name and value, come from a system you’re monitoring. You can also create your own dimensions.
Dimension key and value requirements
Dimension key names are UTF-8 strings with a maximum length of 128 characters (512 bytes).
-
For example, if a dimension’s key:value pair is ("mydim", "myvalue"), ‘’mydim’’ is limited to 128 characters.
-
Must start with an uppercase or lowercase letter. The rest of the name can contain letters, numbers, underscores (_) and hyphens (-), and periods (.), but cannot contain blank spaces.
-
Must not start with the underscore character (_).
-
Must not start with the prefix
sf_, except for dimensions defined by Splunk Observability Cloud such assf_hires.
Dimension values are UTF-8 strings with a maximum length of 256 UTF-8 characters (1024 bytes).
-
For example, if a dimension’s key:value pair is ("mydim", "myvalue"), ‘’myvalue’’ is limited to 256 characters.
-
If the value is longer, then the datapoint might be dropped.
-
Numbers are represented as numeric strings.
You can have up to 36 dimensions per MTS. If this limit is exceeded, the data point is dropped, and a message is logged.
To ensure readability, keep names and values to 40 characters or less.
For example:
-
"hostname": "production1" -
"region": "emea"
Considerations for metric and dimension names in your organization
Create consistent names for your organization:
-
Use a single consistent delimiter in metric names. Using a single consistent delimiter in metric names helps you search with wildcards. Use periods or underscores as delimiters. Don’t use colons or slashes.
-
Avoid changing metric and dimension names. If you change a name, you have to update the charts and detectors that use the old name. Infrastructure Monitoring doesn’t do this automatically.
-
Since you’re not the only person using the metric or dimension, use names easy to identify and understand. Follow established conventions. To find out the conventions in your organization, browse your metrics using the Metric Finder.
Guidelines for working with low and high cardinality data
Send low-cardinality data only in metric names or dimension key names. Low-cardinality data has a small number of distinct values. For example, the metric name web.http.error.count for a gauge metric that reports the number of HTTP request errors has a single value. This name is also readable and self-explanatory. For more information on gauge metrics, see Identify metric types.
High-cardinality data has a large number of distinct values. For example, timestamps are high-cardinality data. Only send this kind of high-cardinality data in dimension values. If you send high-cardinality data in metric names, Infrastructure Monitoring might not ingest the data. Infrastructure Monitoring rejects metrics with names that contain timestamps. High-cardinality data does have legitimate uses. For example, in containerized environments, container_id is usually a high-cardinality field. If you include container_id in a metric name such as system.cpu.utilization.<container_id>, instead of having one MTS, you have as many MTS as you have containers.
When to use metrics or dimensions
Use metrics when tracking different metric types
In Infrastructure Monitoring, all metrics belong to a specific metric type, with a specific default rollup. To learn more about metric types, see Metric types.
To track a measurable value using two different metric types, use two metrics instead of one metric with two dimensions.
For example, suppose you have a network_latency measurement that you want to send as two different metric types: a gauge metric (the average network latency in milliseconds) and a counter metric (the total number of latency values sent in an interval). In this case, send the measurement using two different metric names, such as network_latency.average and network_latency.count, instead of one metric name with two dimensions type:average and type:count.
Type of information suitable for dimensions
See some examples of types of information you can add to dimensions:
-
Categories rather than measurements: If doing an arithmetic operation on dimension values results in something meaningful, you don’t have a dimension.
-
Metadata for filtering, grouping, or aggregating.
-
Name of entity being measured: For example
hostname,production1. -
Metadata with large number of possible values: Use one dimension key for many different dimension values.
-
Nonnumeric values: Numeric dimension values are usually labels rather than measurements.
Example: Custom metrics and dimensions to measure HTTP errors
Let’s imagine you want to track the following data to oversee HTTP errors:
-
Number of errors
-
HTTP response code for each error
-
Host that reported the error
-
Service (app) that returned the error
Suppose you identify your data with a long metric name instead of a metric name and a dimension. For example, web.http.myhost.checkout.error.500.count might be a long metric name that represents the number of HTTP response code 500 errors reported by the host named myhost for the service checkout.
If you use web.http.myhost.checkout.error.500.count, you might encounter the following issues:
-
To visualize this data in a Splunk Infrastructure Monitoring chart, you have to run a wildcard query with the syntax
web.http.*.*.error.*.count. -
To sum up the errors by host, service, or error type, you have to change the query.
-
You can’t use filters or dashboard variables in your chart.
-
You have to define a separate metric name to track HTTP 400 errors, or errors reported by other hosts, or errors reported by other services.
Instead, use dimensions to track the same data:
-
Define a metric name that describes the measurement you want, which is the number of HTTP errors:
web.http.error.count. The metric name includes the following:-
web: Your name for a family of metrics for web measurements -
http.error: Your name for the protocol you’re measuring (http) and an aspect of the protocol (error) -
count: The unit of measure
-
-
Define dimensions that categorize the errors. The dimensions include the following:
-
host: The host that reported the error -
service: The service that returned the error -
error_type: The HTTP response code for the error
-
This way, to visualize the error data using a chart, you can search for "error count" to locate the metric by name. When you create the chart, you can filter and aggregate incoming metric time series by host, service, error_type, or all three. You can add a dashboard filter so that when you view the chart in a specific dashboard, you don’t have the chart itself.