Advanced HPC and GPU usage

The Splunk App for Data Science and Deep Learning (DSDL) integrates with containerized machine learning. DSDL supports graphics processing unit (GPU) acceleration, large-scale, high-performance computing (HPC) clusters, and distributed training in Docker, Kubernetes, or OpenShift.

Optimize HPC workflows, including multi-GPU usage, node labeling, ephemeral volumes, and typical HPC environment considerations.

Overview

When you run the fit or apply commands with DSDL, the external container environment can tap into the following high-performance computing resources:

GPUs for deep learning acceleration.
Multi-node HPC clusters for distributed training or parallel inference.
Custom scheduling policies to handle concurrency.

DSDL provides a Splunk platform based orchestration of data flows. HPC nodes perform the heavy machine learning tasks and the Splunk platform manages search, scheduling, and logs. DSDL also offers container images with advanced libraries such as TensorFlow and PyTorch, with optional GPU support.

With DSDL you can turn on iterative development with development containers, while production HPC containers run with minimal overhead.

Advanced HPC and GPU requirements

You must meet the following requirements to use advanced HPC and GPU in DSDL:


Requirement	Details
HPC cluster	Kubernetes or OpenShift typically orchestrates HPC resources. Alternatively, single-host Docker HPC can run multi-GPU servers. For multi-node HPC with Slurm or other schedulers, you can wrap the container environment inside those HPC job scripts.
GPU drivers and libraries	NVIDIA GPU nodes typically need the NVIDIA device plugin in Kubernetes or OpenShift. On Docker single-host environments, you must install `nvidia-docker2` or the `--gpus` runtime. You must have CUDA and cuDNN libraries and other frameworks in your container if you're doing deep learning.
HPC network and storage	HPC nodes typically use high-bandwidth interconnects such as InfiniBand, 10/40/100GbE. DSDL containers can rely on ephemeral volumes for short data staging, but HPC contexts need persistent or parallel file systems, such as NFS or GlusterFS. DSDL continues to sync notebooks and models to the Splunk platform, mitigating ephemeral volume losses.

Multi-GPU and node labeling

See the following for descriptions of multi-GPU and node labeling in DSDL.

GPU resource requests

In Kubernetes or OpenShift, define GPU requests in your pod or deployment specification as shown in the following example:

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

DSDL automatically uses the GPU if your container image includes GPU libraries such as golden-gpu. The Splunk platform calls | fit MLTKContainer ... to schedule pods with nvidia.com/gpu:1.

Node labeling and taints

HPC clusters might label GPU nodes such as gpu=true, or add taints so that only GPU workloads land on them.

In DSDL, set the appropriate node selector or tolerations in your container environment and Kubernetes cluster configuration if you want certain tasks only on GPU nodes.

Single and multi-GPU tasks

For single-GPU tasks, request nvidia.com/gpu:1.

For multi-GPU tasks such as distributed data parallel training, request multiple GPUs in your Pod spec or define a ReplicaSet with multiple GPU pods. This is an advanced option and requires your notebook code such as PyTorch DDP, or Horovod, to handle multi-GPU scaling.

Distributed training approaches

Review the following distributed training approaches for HPU and GPU usage:


Training approach	Description
Single container, multiple GPUs	This is the simplest approach for HPC. A single container hosts multiple GPUs. Within your notebook code, use the PyTorch DataParallel or TensorFlow MirroredStrategy to leverage multiple GPUs in one host. You don't need to make any changes to your Splunk platform search logic because DSDL considers it 1 container.
Multi-container, inter-node communication	For large HPC or multi-node distributed training, you can spawn multiple containers or pods, that communicate through MPI or the PyTorch `torch.distributed` package. DSDL does not manage multi-node orchestration by default but you can set it up with your HPC scheduler or an advanced Kubernetes operator. Searching or scheduling in the Splunk platform triggers the job, but the container's code handles multi-node communication.
DSDL integration	In advanced HPC workflows, you can call the `fit` or `apply` commands, but your notebook code must handle the distributed logic. HPC node ephemeral logs or partial metrics route to Splunk through HEC or container logs.

About development and production HPC containers

See the following for key points about development and production HPC containers.

Development HPC containers

Uses JupyterLab plus GPU libraries.
Lets data scientists refine code on a single HPC node with 1 to 2 GPUs or smaller data subsets.
Is potentially ephemeral because after the development session ends, the container is stopped.

Production HPC containers

Uses minimal overhead because it uses no Jupyter or development tools.
Runs multi-GPU or multi-node distributed tasks.
Is used by scheduled searches or repeated inference jobs in the Splunk platform.
Must have defined GPU resource requests if you need GPU acceleration.

How HPC is used in Splunk Observability

See the following for descriptions of Splunk Observability and HPC.

HPC monitoring

HPC clusters typically ship with node-level metrics such as Ganglia or Prometheus. You can forward these metrics to the Splunk platform or Splunk Observability to unify HPC usage with container-level insights.

GPU telemetry

For GPU usage, consider NVIDIA DCGM or device plugin exporters that feed GPU metrics into Splunk Observability. If you turn on Splunk Observability in DSDL, you can automatically instrument each container endpoint, although HPC multi-node training might require custom tracing logic in your code.

Security and governance

See the following for descriptions of security and governance options available for HPC and GPU usage:


Option	Description
Container registry and minimal GPU images	HPC clusters typically have a local Docker registry. You can build or pull GPU images such as `golden-gpu` and then push them to your HPC registry. Use minimal or specialized images to reduce overhead. An air-gapped DSDL setup might apply if HPC has no external net. See Install and configure the Splunk App for Data Science and Deep Learning in an air-gapped environment.
Role-based access to GPU nodes	In Kubernetes and OpenShift, use RBAC or taints and tolerations so that only power users, or HPC roles, can schedule GPU containers. In Docker single-host HPC, you must rely on local user constraints or Docker group membership.
Automatic notebook sync	DSDL design means that HPC ephemeral volumes are not at risk of losing code. For HPC operators you can treat ephemeral container usage as stateless, letting the Splunk platform manage the notebooks.

Example: HPC workflow

The following is an example of a high-performance computing (HPC) workflow:

Create a Kubernetes HPC cluster with GPU nodes labeled as gpu=true.
Create a custom-built my-gpu-image container image with frameworks and libraries such as Torch, CUDA, and cuDNN.
Make sure the file points to myregistry.local/my-gpu-image:latest in the images.conf file.
Complete the DSDL setup fields:
1. Container type: GPU runtime.
2. Resource requests: nvidia.com/gpu:1.
3. Container mode: DEV or PROD.

Run the following search:

index=my_data
| fit MLTKContainer algo=my_gpu_notebook features_* into app:MyHPCModel

Kubernetes schedules a pod on a GPU node. The container loads your code, trains a PyTorch model, and streams logs or partial metrics to the Splunk platform.
The model is now available in DSDL as app:MyHPCModel.
Note: HPC ephemeral volumes are irrelevant because the code and final artifacts are synced to DSDL.

Troubleshooting HPC and GPU usage

See the following table for issues you might experience and how to resolve them:


Issue	Cause	Solution
Container never schedules on GPU node	You might be missing the `nvidia.com/gpu:1` request, or the HPC node is not labeled for GPU.	Check your Kubernetes pod spec or the images.conf file to confirm HPC node labeling is `gpu=true`. Additionally, check the device plugin.
Multi-GPU training fails silently	Notebook code is not configuring multi-GPU, or it is missing distribution logic.	HPC logs in containers of `stdout` and `stderr`. See if the PyTorch DataParallel configuration or multi-node configuration is correct.
Docker single-host HPC container sees no GPUs	You might not be using `--gpus all` or `runtime=nvidia` for Docker run commands.	Check the Docker CLI usage or logs for "no GPU devices found" errors.
HPC cluster can't pull the GPU image	Private registry authentication error or air-gapped images might be missing.	Re-check credentials or ensure you loaded .TAR files on the HPC node registry.
HPC ephemeral volumes losing code in notebooks	DSDL sync scripts or configuration might be failing.	Check `_internal "mltk-container"` logs for sync errors. Note: The Splunk platform automatically persists notebooks, so ephemeral is acceptable.