Apache Spark receiver
The Apache Spark receiver fetches metrics for an Apache Spark cluster through the Apache Spark REST API.
The Apache Spark receiver monitors Apache Spark clusters and the applications running on them through the collection of performance metrics like memory utilization, CPU utilization, shuffle operations, and more. The supported pipeline type is metrics. See Process your data with pipelines for more information.
The receiver retrieves metrics through the Apache Spark REST API using the following endpoints: /metrics/json, /api/v1/applications/[app-id]/stages, /api/v1/applications/[app-id]/executors, and /api/v1/applications/[app-id]/jobs endpoints.
Prerequisites
This receiver supports Apache Spark versions 3.3.2 or higher.
Deploy the collector
See Deploy the Splunk Distribution of the OpenTelemetry Collector.
Configure the receiver
To activate the Apache Spark receiver, add apachespark to the receivers section of your configuration file:
receivers:
apachespark:
collection_interval: 60s
endpoint: http://localhost:4040
application_names:
- PythonStatusAPIDemo
- PythonLR
To complete the configuration, include the receiver in the metrics pipeline of the service section of your configuration file:
service:
pipelines:
metrics:
receivers: [apachespark]
Advanced configurations
-
Activate or deactivate specific metrics
You can activate or deactivate specific metrics by setting the
enabledfield in themetricssection for each metric. For example:YAMLreceivers: samplereceiver: metrics: metric-one: enabled: true metric-two: enabled: falseThe following is an example of host metrics receiver configuration with activated metrics:
YAMLreceivers: hostmetrics: scrapers: process: metrics: process.cpu.utilization: enabled: trueNote: Deactivated metrics aren’t sent to Splunk Observability Cloud.Billing-
If you’re in a MTS-based subscription, all metrics count towards metrics usage.
-
If you’re in a host-based plan, metrics listed as active (Active: Yes) on this document are considered default and are included free of charge.
Learn more at Infrastructure Monitoring subscription usage (Host and metric plans).
-
Restart the collector
The restart command varies depending on what platform you deployed the collector on and what tool you used to deploy it. Here are general examples of the restart command:
- Linux
-
BASH
sudo systemctl restart splunk-otel-collector - Windows
-
Windows with installer script:
BASHstop-service splunk-otel-collector start-service splunk-otel-collector - Kubernetes
-
BASH
helm upgrade your-splunk-otel-collector splunk-otel-collector-chart/splunk-otel-collector -f your-override-values.yamlwhere
splunk-otel-collector-chartis the name you gave to the Helm chart in thehelm repo addcommand.
Settings reference
The following settings are optional:
-
collection_interval.60sby default. Sets the interval this receiver collects metrics on.-
This value must be a string readable by Golang’s
time.ParseDuration. Learn more at Go’s official documentation at https://pkg.go.dev/time#ParseDuration. -
Valid time units are
ns,us(orµs),ms,s,m,h.
-
-
-
initial_delay.1sby default. Determines how long this receiver waits before collecting metrics for the first time.
-
-
endpoint.http://localhost:4040by default. Apache Spark endpoint to connect to in the form of[http][://]{host}[:{port}]. -
application_names. An array of Spark application names for which metrics are collected from. If no application names are specified, metrics are collected for all Spark applications running on the cluster at the specified endpoint.
The full list of settings exposed for this receiver are documented in the Apache Spark receiver config repo in GitHub.
Metrics reference
The following metrics, resource attributes, and attributes are available.
included
https://raw.githubusercontent.com/splunk/collector-config-tools/main/metric-metadata/apachesparkreceiver.yaml