About metrics.log
This topic is an overview of metrics.log.
- To learn about other log files, read "What Splunk logs about itself."
- For an example using metrics.log, read "Troubleshoot inputs with metrics.log."
What information is in metrics.log?
Metrics.log has a variety of introspection information for reviewing product behavior.
First, metrics.log is a periodic report, taken every 30 seconds or so, of recent Splunk software activity. 
By default, metrics.log reports the top 10 results for each type. You can change that number of series from the default by editing the value of maxseries in the [metrics] stanza in limits.conf. 
Structure of the lines
Here is a sample line from metrics.log:
 01-27-2010 15:43:54.913 INFO Metrics - group=pipeline, name=parsing, processor=utf8, cpu_seconds=0.000000, executes=66, cumulative_hits=301958
            
First, boiler plate: the timestamp, the "severity," which is always INFO for metrics events, and then the kind of event, "Metrics." 
The next field is the group. This indicates what kind of metrics data it is. There are a few groups in the file, including: 
- pipeline
- queue
- thruput
- tcpout_connections
- udpin_connections
- mpool
- per_host_regex_cpu
- per_index_regex_cpu
- per_source_regex_cpu
- per_sourcetype_regex_cpu
Pipeline messages
Pipeline messages are reports on the Splunk pipelines, which are the strung-together pieces of "machinery" that process and manipulate events flowing into and out of the Splunk system. You can see how many times data reached a given machine in the Splunk system (executes), and you can see how much cpu time each machine used (cpu_seconds). 
Plotting totals of cpu seconds by processor can show you where the cpu time is going in indexing activity. Looking at numbers for executes can give you an idea of data flow. For example if you see:
 group=pipeline, name=merging, processor=aggregator, ..., executes=998
 group=pipeline, name=merging, processor=readerin, ... , executes=698
 group=pipeline, name=merging, processor=regexreplacement, ..., executes=698
 group=pipeline, name=merging, processor=sendout, ... , executes=698
then it's pretty clear that a large portion of your items aren't making it past the aggregator. This might indicate that many of your events are multiline and are being combined in the aggregator before being passed along.
Read more about Splunk's data pipeline in "How data moves through Splunk" in the Distributed Deployment Manual.
Queue messages
Queue messages look like
  ... group=queue, name=parsingqueue, max_size=1000, filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0
            
Most of these values are not interesting. But current_size, especially considered in aggregate, across events, can tell you which portions of Splunk indexing are the bottlenecks. If current_size remains near zero, then probably the indexing system is not being taxed in any way. If the queues remain near 1000, then more data is being fed into the system (at the time) than it can process in total. 
Sometimes you will see messages such as  ... group=queue, name=parsingqueue, blocked!!=true, max_size=1000, filled_count=0, empty_count=8, current_size=0, largest_size=2, smallest_size=0
            
This message contains the blocked string, indicating that it was full, and someone tried to add more, and couldn't. A queue becomes unblocked as soon as the code pulling items out of it pulls an item. Many blocked queue messages in a sequence indicate that data is not flowing at all for some reason. A few scattered blocked messages indicate that flow control is operating, and is normal for a busy indexer. 
If you want to look at the queue data in aggregate, graphing the average of current_size is probably a good starting point. 
There are queues in place for data going into the parsing pipeline, and for data between parsing and indexing. Each networking output also has its own queue, which can be useful to determine whether the data is able to be sent promptly, or alternatively whether there's some network or receiving system limitation.
Generally, filled_count and empty_count cannot be productively used for inferences. 
Thruput messages
Thruput messages (similar to the English word "throughput") come in a few varieties.
Thruput is measured in the indexing pipeline. If your data is not reaching this pipeline for some reason, it will not appear in this data. Thruput numbers relate to the size of "raw" of the items flowing through the system, which is typically the chunks of the original text from the log sources. This differs from the tcpout measurements, which measure the total byte count written to sockets, including protocol overhead as well as descriptive information like host, source, source type. 
First there is a catchall line, which looks like this:
... group=thruput, name=index_thruput, instantaneous_kbps=0.287598, instantaneous_eps=1.000000, average_kbps=0.270838, total_k_processed=74197, load_average=1.345703M This is the best line to look at when tuning performance or evaluating indexing load. It tries to capture the total indexing data load.
Note: In thruput lingo, kbps does not mean kilobits per second, it means kilobytes per second. The industry standard term would be to write this something like KBps. 
The average_kbps in the group=thruput catchall indicates the average since Splunk Enterprise started. 
instantaneous_kbps indicates the average kbps for the reporting interval (equivalent to kbps for the breakouts below.) 
The most useful figure to look at in aggregate is probably instantaneous_kbps over time. 
Following the catchall, there can be variety of breakouts of the indexing thruput, including lines like:
... group=per_host_thruput, series="jombook.splunk.com", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241
... group=per_index_thruput, series="_internal", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241
... group=per_source_thruput, series="/applications/splunk4/var/log/splunk/metrics.log", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241
... group=per_sourcetype_thruput, series="splunkd", kbps=0.261530, eps=1.774194, kb=8.107422, ev=2606, avg_age=420232.710668, max_age=420241
In thruput messages the data load is broken out by host, index, source, and source type. This can be useful for answering two questions:
- Which data categories are busy?
- When were my data categories busy?
The series value identifies the host or index, etc. The kb value indicates the number of kilobytes processed since the last sample. Graphing kb in aggregate can be informative. The summary indexing status dashboard uses this data, for example.
- evis a simple total count of events during the sampling period.
- kbps, as before, indicates kilobytes per second averaged over the sampling period.
The avg_age and max_age refer to the difference between the time that the event was seen by the thruput processor in the indexing queue, and the time when the event occurred (or more accurately, the time that Splunk decided the event occurred). 
- max_ageis the largest difference, in seconds, between the current time and the perceived time of the events coming through the thruput processor.
- avg_ageis the average difference, in seconds, between the current time and the perceived time of the events coming through the thruput processor.
Note: The per_x_thruput categories are not complete. Remember that by default metrics.log shows the 10 busiest of each type, for each sampling window. If you have 2000 active forwarders, you cannot expect to see the majority of them in this data. You can adjust the sampling quantity, but this will increase the chattiness of metrics.log and the resulting indexing load and _internal index size. The sampling quantity is adjustable in limits.conf, [metrics] maxseries = num. 
Ignore the eps value, as it has accuracy issues. 
Tcpout Connections messages
Tcpout connections messages provide a variety of information about the systems that the tcpout component is currently connected to, via an open socket. One line is produced per connected system.
These lines look like this:
   ... group=tcpout_connections, name=undiag_indexers:10.159.4.67:9997:0, sourcePort=11089, 
   destIp=10.159.4.67, destPort=9997, _tcp_Bps=28339066.17, _tcp_KBps=27674.87,
   _tcp_avg_thruput=27674.87, _tcp_Kprocessed=802571, _tcp_eps=33161.10, kb=802571.21
- nameis a combination of conf stanza and entries that fully define the target system to the software.
- sourcePortis the dynamically assigned port by the operating system for this socket.
- destIPand- destPortdefine the destination of the socket.- destPortis typically statically configured, while- destIPmay be configured or returned by name resolution from the network.
All of the size-related fields are based on the number of bytes that Splunk has successfully written to the socket, or SSL-provided socket proxy. When SSL is not enabled for forwarding (the default), these numbers represent the number of bytes written to the socket, so effectively the number of bytes conveyed by the tcp transport (irrespective of overhead issues such as keepalive, tcp headers, ip headers, ethernet frames and so on). When SSL is enabled for forwarding, this number represents the number of bytes Splunk handed off to the OpenSSL layer.
- _tcp_Bpsis the bytes transmitted during the metrics interval divided by the duration of the interval (in seconds)
- _tcp_KBpsis the same value divided by 1024
- _tcp_avg_thruputis an average rate of bytes sent since the last time the tcp output processor was reinitialized/reconfigured. Typically this means an average since Splunk started.
- _tcp_KProcessedis the total number of bytes written since the processor was reinitialized/reconfigured, divided by 1024.
- _tcp_epsis the number of items transmitted during the interval divided by the direction of the interval (in seconds). Note that items will frequently not be events for universal/light forwarders (instead, data chunks)
- kbis the bytes transmitted during the metrics interval divided by 1024.
udpin messages
udpin_connections lines are essentially metering on udp input.
group=udpin_connections, 2514, sourcePort=2514, _udp_bps=0.00, _udp_kbps=0.00,
_udp_avg_thruput=0.00, _udp_kprocessed=0.00, _udp_eps=0.00
Some should be relatively self-explanatory.
- bps: bytes per second
- kbps: kilobytes per specond
- eps: events (packets) per second
- _udp_kprocessed is a running total of kilobytes processed since udp input processor start (typically since splunk start, but if reconfigured may reset).
Don't have info on avg_thruput at this time.
Be aware that it's quite achievable to max out the ability of the operating system, let alone Splunk, to handle UDP packets at high rates. This data might be useful to determine if any data is coming in at all, and at what times it rises. There is no guarantee that all packets sent to this port will be received and thus metered.
mpool messages
The mpool lines represent memory used by the Splunk indexer code only (not any other pipeline components). This information is probably not useful to anyone other than Splunk developers.
group=mpool, max_used_interval=4557, max_used=53878, avg_rsv=180, capacity=268435456, used=0
- max_used_intervalrepresents the number of bytes used during the reporting interval (since the last output).
- max_usedrepresents the maximum amount of memory, in bytes, in use at any time during the component's lifetime (most likely since last starting Splunk).
- avg_rsvis the average size of a memory allocation across the run time of the system.
- capacityis the limit on memory use for the indexer.
- usedis the current indexer's current memory use.
In this case we can see that some memory is sometimes in use, although at the time of the sample, none is in use, and that generally the use is low.
map, pipelineinputchannel name messages
These messages are primarily debugging information over the Splunk internal cache of processing state and configuration data for a given data stream (host, source, or source type).
group=map, name=pipelineinputchannel, current_size=29, inactive_channels=4,
new_channels=0, removed_channels=0, reclaimed_channels=0, timedout_channels=0,
abandoned_channels=0
- current_sizeis the number of total channels loaded in the system at the end of the sampling period.
- inactive_channelsis the number of channels that have no entries in any pipeline referring to them (typically for recently seen data but not for data currently being processed) at the end of the sampling period.
- new_channelsis the number of channels created during the sampling period, meaning that new data streams arrived, or a data stream that was aged out was created again.
- removed_channelsis the number of channels destructed during the sampling period, which means that enough pressure existed to push these channels out of set (there were too many other new data streams).
- reclaimed_channelsis the number of channels that were repurposed during the sampling period. This will happen for reasons similar to- new_channels, based on the size of the utilization, etc.
- timedout_channelsis the number of channels that became unused for a long enough time to be considered stale and the information to be culled. Typically a timedout file or data source hasn't been producing data for some time.
- abandoned_channelsis the number of channels that were terminated by Splunk forwarding, where the forwarder stopped communicating to an indexer, so the indexer shut them down.
subtask_seconds messages
Currently there is only one type of these messages, name=indexer, task=indexer_service. The message describes how the indexer has spent time over each interval window. 
group=subtask_seconds, name=indexer, task=indexer_service, replicate_semislice=0.000000, throttle_optimize=0.00015,  flushBlockSig=0.000000, retryMove_1hotBkt=0.000000, size_hotBkt=0.000000, roll_hotBkt=0.000000, chillOrFreeze=0.000000, update_checksums=0.000000, fork_recovermetadata=0.000000, rebuild_metadata=0.000300, update_bktManifest=0.000000, service_volumes=0.000105, update_bktManifest=0.000000, service_volumes=0.000105, service_maxSizes=0.000000, service_externProc=0.000645- The throttle_optimizesubtask represents time that the indexer spends waiting forsplunk_optimizeprocesses to reduce the count of.tsidxfiles to a reasonable level within hot buckets. Becausesplunk_optimizecan in some cases run more slowly merging.tsidxfiles than the indexer runs while generating them, this flow-control state must exist forsplunk_optimizeto catch up. When this throttling occurs, messages are logged tosplunkd.login categoryDatabasePartitionPolicysimilar toidx=<idxname> Throttling indexer, too many tsidx files in bucket.
- The rebuild_metadatasubtask represents time that the indexer spends generating new copies of.datafiles in hot buckets. This can mean writing outHosts.data,Sourcetypes.data,Sources.data, orStrings.data. Typically the largest cost is forStrings.datain configurations where a large amount of data is directed to this file. Setting theMetaDatalog channel to DEBUG can provide more information on which buckets and files might be involved in a slowdown.
regular expression (regex) messages
These messages represent the amount of CPU time spent processing regular expressions on events, in seconds. The group field will identify if these metrics are based on host, index, source, or sourcetype. The CPU time metrics for the regex processor are not enabled by default. See the configuration regex_cpu_profiling in limits.conf in the Admin Manual. 
Sample Event:
 09-06-2018 20:00:54.667 +0000 INFO Metrics - group=per_sourcetype_regex_cpu, series="audittrail", cpu=428, cpupe=0.9749430523917996, bytes=138929, ev=439
            
Valid group names are:
-  per_host_regex_cpu
-  per_index_regex_cpu
-  per_source_regex_cpu
-  per_sourcetype_regex_cpu
For each interval, the following fields are produced.
- cpurepresents the CPU time in seconds for a given series.
- cpuperepresents the CPU time in seconds per event for a given series.
- bytesrepresents the number of bytes processes for a given series.
- evrepresents total events for a given series.
The default interval for each measurement is 30 seconds.