Container monitoring and logging

The Splunk App for Data Science and Deep Learning (DSDL) leverages external containers for computationally intensive tasks. It is crucial to monitor these containers for debugging, operational awareness, seamless model development, and a stable production environment.

Learn about collecting logs, capturing performance metrics, automatically instrumenting containers with OpenTelemetry, and surfacing container health in the Splunk platform or Splunk Observability Cloud.

Overview

When you run the fit or apply commands in DSDL, a Docker, Kubernetes, or OpenShift container is spun up to run model training or inference.

Splunk Observability Cloud performs the following actions to help inform container health:

Collects container logs of stdout, stderr, and custom logs in the Splunk platform to debug errors or confirm job completion.
Captures performance metrics like CPU, memory, and GPU usage in real time to diagnose slow jobs or resource bottlenecks.
Enables OpenTelemetry instrumentation for container endpoints in Splunk Observability Cloud if toggled in the DSDL Observability settings.
Ensures enterprise-level reliability by setting up dashboards, alerts, or autoscaling triggers based on these metrics.

DSDL includes the following logs and telemetry data to inform container health:

Splunk _internal index logs about container management.
Container logs of stdout and stderr like ML library messages, and Python prints.
Custom logs or metrics you send to Splunk HTTP Event Collector (HEC).
OpenTelemetry data sent to Splunk Observability Cloud after you enable observability in the DSDL setup.

Container logs in the Splunk platform

The following container logs are provided in the Splunk platform.

MLTK container logs

The MLTK container logs in the _internal index generate when DSDL tries to start or stop a container, or hits network issues. These logs are stored in the _internal index with "mltk-container" in the messages.

For example, index=_internal "mltk-container".

Note: If containers fail to launch, these logs show network errors or Docker or Kubernetes API rejections. Repeated connection attempts can indicate firewall or TLS misconfigurations.

Automatic container logs with the Splunk platform REST API

After a container is successfully deployed, DSDL automatically collects logs through the Splunk REST API. Automatic container logs are useful for quick debugging or reviewing final outputs when a machine learning job is complete.

CAUTION: If the container fails before initialization, automatic container logs might not appear. Check the _internal index logs instead.

You can view the logs by navigating in DSDL to Configuration, then Containers, and then selecting the container name. For example, __DEV__.

The selected container page shows the following details:

Container Controls: The container image, cluster target, GPU runtime, and container mode of DEV or PROD.
Container Details: Key-value pairs for api_url, runtime, mode, and others.
Container Logs: A table or search result similar to the following:
```
| rest splunk_server=local services/mltk-container/logs/<container_name>
| eval _time = strptime(_time,"%Y-%m-%dT%H:%M:%S.%9N")
| sort - _time
| fields - splunk_server
```
Note: The Container Logs surfaces the real-time logs captured from the container after it's fully deployed.

Custom Python logging

If your notebook code logs with print(...) or with the Python logging module, logs get stored in the stdout or stderr container.

Note: High-performance computing (HPC) tasks with high verbosity can produce large volumes of logs. Consider limiting the logs to show only INFO or WARN log levels.

Resource metrics

Tracking resource usage can help you identify if your training jobs are hitting resource bottlenecks or if you need better scheduling or bigger node types.

See the following table for what metrics are available by container provider:


Container	Description
Docker	Use `docker stats` for ephemeral checks, or use a cAdvisor-based approach to forward metrics into the Splunk platform or Splunk Observability Cloud.
Kubernetes or OpenShift	Use the Kubernetes metrics API or Splunk Connect for Kubernetes for CPU, memory, and node metrics. Note: GPU usage requires the NVIDIA device plugin or GPU operator.

You can also set up alerts for abnormal usage patterns or container crash loops, which can also improve your resource reliability.

Automatic OpenTelemetry instrumentation

OpenTelemetry instrumentation can provide advanced insights into container endpoint usage, request durations, and data flow, for HPC or microservices-based machine learning pipelines.

After you turn on Splunk Observability Cloud, DSDL can automatically instrument container endpoints with OpenTelemetry:

In DSDL, go to Setup, and then Observability Settings.
Select Yes to turn on Observability.
Complete the required fields:
1. Splunk Observability Access Token: Add your Observability ingest token.
2. Open Telemetry Endpoint: Set your endpoint. For example, https://ingest.eu0.signalfx.com.
3. Open Telemetry Servicename: Enter the service name you want to open. For example, dsdl.
Save your changes

Upon completion, all container endpoints including training and inference calls, generate Otel traces. These traces are automatically stored in Splunk Observability Cloud for deeper analysis, including request latency and container-level CPU and memory correlation.

Note: Instrumentation is done automatically. You do not need to manually set up Otel libraries in your notebook code.

Sending model or training logs to the Splunk platform

Review the following options for sending model or training logs to the Splunk platform as a DSDL user.

Splunk HEC in DSDL

The Splunk HTTP Event Collector (HEC) option in DSDL lets you view partial results or step-by-step logs. You can combine HEC with container logs for full details.

In DSDL, navigate to the Setup page and provide your HEC token in the Splunk HEC Settings panel. Save your changes.

In your notebook, use the following code:

from dsdlsupport import SplunkHEC
hec = SplunkHEC.SplunkHEC()
hec.send({'event': {'message': 'operation done'}, 'time': 1692812611})

Logging epoch metrics

You can use epoch metrics to visualize model training progression in near real time.

See the following example of how to view epoch metrics:

def fit(model, df, param):
    for epoch in range(10):
        # ...
        hec.send({
          'event': {
             'epoch': epoch,
             'loss': 0.1234,
             'status': 'in_progress'
          }
        })
    return {"message": "model trained"}

In the Splunk platform you can use the following as an example of how to view epoch metrics:

index=ml_logs status=in_progress
| timechart avg(loss) by epoch

Example workflow

The following is an example workflow for monitoring container health:

You set up the Docker or Kubernetes environment with Splunk Connect or a Splunk Observability Cloud agent.
You launch a container:
| fit MLTKContainer algo=...
After the container launches, DSDL automatically collects logs. In DSDL, you go to Configuration, then Containers, then <container_name> to see container details and logs.
If Splunk Observability Cloud is enabled, container endpoints generate Otel traces. CPU and memory metrics flow to Splunk Observability Cloud.
If DSDL calls hec.send(...), partial training logs appear in the Splunk platform.
All data, including logs, traces, and metrics correlates for a 360 degree view.

Container monitoring guidelines

Consider the following guidelines when implementing container monitoring:

Limit log verbosity. HPC tasks can produce large logs. Use moderate logging levels.
Check the _internal index. For container startup or firewall issues, search _internal "mltk-container".
Secure observability. Use transport layer security (TLS) for container endpoints, and secure tokens for Splunk Observability.
Combine with container management. For concurrency, GPU usage, or development or production containers, see Container management and scaling.

Troubleshooting container monitoring

See the following issues you might experience with container monitoring and how to resolve them:


Issue	How to troubleshoot
Container fails to launch	Likely caused by Docker or Kubernetes being unreachable, with a firewall setting blocking the management port. Check the `_internal` logs for "mltk-container" for network or authorization errors.
Observability is toggled on, but no Otel traces appear	Likely caused by an incorrect Splunk Observability Cloud token, or endpoint, or the container configuration is not updated. In DSDL go to Setup then Observability Settings. Confirm a valid token and endpoints. Restart the container if you make any changes.
High-performance computing (HPC) tasks with large logs are flooding your Splunk platform instance	Likely caused by overly verbose training prints or debug mode. Switch to `INFO` or `WARN` level, or only send partial logs with Splunk HTTP Event Collector (HEC).
GPU usage not recognized in Splunk Observability Cloud dashboards	Likely caused by a missing NVIDIA device plugin or GPU operator in Kubernetes or OpenShift. Check node labeling, device plugin logs, and GPU operator deployment.
HEC events missing in the Splunk platform	Likely caused by the wrong HEC token, disabled HEC, or an endpoint mismatch. Check the `_internal` logs for "Token or HEC" errors. Ensure HEC is enabled and the 443/8088 port is open.