Container monitoring and logging
The Splunk App for Data Science and Deep Learning (DSDL) leverages external containers for computationally intensive tasks. It is crucial to monitor these containers for debugging, operational awareness, seamless model development, and a stable production environment.
Learn about collecting logs, capturing performance metrics, automatically instrumenting containers with OpenTelemetry, and surfacing container health in the Splunk platform or Splunk Observability Cloud.
Overview
When you run the fit
or apply
commands in DSDL, a Docker, Kubernetes, or OpenShift container is spun up to run model training or inference.
Splunk Observability Cloud performs the following actions to help inform container health:
- Collects container logs of stdout, stderr, and custom logs in the Splunk platform to debug errors or confirm job completion.
- Captures performance metrics like CPU, memory, and GPU usage in real time to diagnose slow jobs or resource bottlenecks.
- Enables OpenTelemetry instrumentation for container endpoints in Splunk Observability Cloud if toggled in the DSDL Observability settings.
- Ensures enterprise-level reliability by setting up dashboards, alerts, or autoscaling triggers based on these metrics.
DSDL includes the following logs and telemetry data to inform container health:
- Splunk _internal index logs about container management.
- Container logs of stdout and stderr like ML library messages, and Python prints.
- Custom logs or metrics you send to Splunk HTTP Event Collector (HEC).
- OpenTelemetry data sent to Splunk Observability Cloud after you enable observability in the DSDL setup.
Container logs in the Splunk platform
The following container logs are provided in the Splunk platform.
MLTK container logs
The MLTK container logs in the _internal
index generate when DSDL tries to start or stop a container, or hits network issues. These logs are stored in the _internal
index with "mltk-container" in the messages.
For example, index=_internal "mltk-container"
.
Automatic container logs with the Splunk platform REST API
After a container is successfully deployed, DSDL automatically collects logs through the Splunk REST API. Automatic container logs are useful for quick debugging or reviewing final outputs when a machine learning job is complete.
_internal
index logs instead.You can view the logs by navigating in DSDL to Configuration, then Containers, and then selecting the container name. For example, __DEV__
.
The selected container page shows the following details:
- Container Controls: The container image, cluster target, GPU runtime, and container mode of DEV or PROD.
- Container Details: Key-value pairs for
api_url
,runtime
,mode
, and others. - Container Logs: A table or search result similar to the following:
| rest splunk_server=local services/mltk-container/logs/<container_name> | eval _time = strptime(_time,"%Y-%m-%dT%H:%M:%S.%9N") | sort - _time | fields - splunk_server
Note: The Container Logs surfaces the real-time logs captured from the container after it's fully deployed.
Custom Python logging
If your notebook code logs with print(...)
or with the Python logging module, logs get stored in the stdout or stderr container.
Resource metrics
Tracking resource usage can help you identify if your training jobs are hitting resource bottlenecks or if you need better scheduling or bigger node types.
See the following table for what metrics are available by container provider:
Container | Description |
---|---|
Docker | Use docker stats for ephemeral checks, or use a cAdvisor-based approach to forward metrics into the Splunk platform or Splunk Observability Cloud.
|
Kubernetes or OpenShift | Use the Kubernetes metrics API or Splunk Connect for Kubernetes for CPU, memory, and node metrics.
Note: GPU usage requires the NVIDIA device plugin or GPU operator.
|
You can also set up alerts for abnormal usage patterns or container crash loops, which can also improve your resource reliability.
Automatic OpenTelemetry instrumentation
OpenTelemetry instrumentation can provide advanced insights into container endpoint usage, request durations, and data flow, for HPC or microservices-based machine learning pipelines.
After you turn on Splunk Observability Cloud, DSDL can automatically instrument container endpoints with OpenTelemetry:
- In DSDL, go to Setup, and then Observability Settings.
- Select Yes to turn on Observability.
- Complete the required fields:
- Splunk Observability Access Token: Add your Observability ingest token.
- Open Telemetry Endpoint: Set your endpoint. For example,
https://ingest.eu0.signalfx.com
. - Open Telemetry Servicename: Enter the service name you want to open. For example,
dsdl
.
- Save your changes
Upon completion, all container endpoints including training and inference calls, generate Otel traces. These traces are automatically stored in Splunk Observability Cloud for deeper analysis, including request latency and container-level CPU and memory correlation.
Sending model or training logs to the Splunk platform
Review the following options for sending model or training logs to the Splunk platform as a DSDL user.
Splunk HEC in DSDL
The Splunk HTTP Event Collector (HEC) option in DSDL lets you view partial results or step-by-step logs. You can combine HEC with container logs for full details.
In DSDL, navigate to the Setup page and provide your HEC token in the Splunk HEC Settings panel. Save your changes.
In your notebook, use the following code:
from dsdlsupport import SplunkHEC
hec = SplunkHEC.SplunkHEC()
hec.send({'event': {'message': 'operation done'}, 'time': 1692812611})
Logging epoch metrics
You can use epoch metrics to visualize model training progression in near real time.
See the following example of how to view epoch metrics:
def fit(model, df, param):
for epoch in range(10):
# ...
hec.send({
'event': {
'epoch': epoch,
'loss': 0.1234,
'status': 'in_progress'
}
})
return {"message": "model trained"}
In the Splunk platform you can use the following as an example of how to view epoch metrics:
index=ml_logs status=in_progress
| timechart avg(loss) by epoch
Example workflow
The following is an example workflow for monitoring container health:
- You set up the Docker or Kubernetes environment with Splunk Connect or a Splunk Observability Cloud agent.
- You launch a container:
| fit MLTKContainer algo=...
- After the container launches, DSDL automatically collects logs. In DSDL, you go to Configuration, then Containers, then <container_name> to see container details and logs.
- If Splunk Observability Cloud is enabled, container endpoints generate Otel traces. CPU and memory metrics flow to Splunk Observability Cloud.
- If DSDL calls
hec.send(...)
, partial training logs appear in the Splunk platform. - All data, including logs, traces, and metrics correlates for a 360 degree view.
Container monitoring guidelines
Consider the following guidelines when implementing container monitoring:
- Limit log verbosity. HPC tasks can produce large logs. Use moderate logging levels.
- Check the
_internal
index. For container startup or firewall issues, search_internal "mltk-container"
. - Secure observability. Use transport layer security (TLS) for container endpoints, and secure tokens for Splunk Observability.
- Combine with container management. For concurrency, GPU usage, or development or production containers, see Container management and scaling.
Troubleshooting container monitoring
See the following issues you might experience with container monitoring and how to resolve them:
Issue | How to troubleshoot |
---|---|
Container fails to launch | Likely caused by Docker or Kubernetes being unreachable, with a firewall setting blocking the management port.
Check the |
Observability is toggled on, but no Otel traces appear | Likely caused by an incorrect Splunk Observability Cloud token, or endpoint, or the container configuration is not updated.
|
High-performance computing (HPC) tasks with large logs are flooding your Splunk platform instance | Likely caused by overly verbose training prints or debug mode.
Switch to |
GPU usage not recognized in Splunk Observability Cloud dashboards | Likely caused by a missing NVIDIA device plugin or GPU operator in Kubernetes or OpenShift.
Check node labeling, device plugin logs, and GPU operator deployment. |
HEC events missing in the Splunk platform | Likely caused by the wrong HEC token, disabled HEC, or an endpoint mismatch.
Check the |