Advanced HPC and GPU usage
The Splunk App for Data Science and Deep Learning (DSDL) integrates with containerized machine learning. DSDL supports graphics processing unit (GPU) acceleration, large-scale, high-performance computing (HPC) clusters, and distributed training in Docker, Kubernetes, or OpenShift.
Optimize HPC workflows, including multi-GPU usage, node labeling, ephemeral volumes, and typical HPC environment considerations.
Overview
When you run the fit or apply commands with DSDL, the external container environment can tap into the following high-performance computing resources:
- GPUs for deep learning acceleration.
- Multi-node HPC clusters for distributed training or parallel inference.
- Custom scheduling policies to handle concurrency.
DSDL provides a Splunk platform based orchestration of data flows. HPC nodes perform the heavy machine learning tasks and the Splunk platform manages search, scheduling, and logs. DSDL also offers container images with advanced libraries such as TensorFlow and PyTorch, with optional GPU support.
With DSDL you can turn on iterative development with development containers, while production HPC containers run with minimal overhead.
Advanced HPC and GPU requirements
You must meet the following requirements to use advanced HPC and GPU in DSDL:
| Requirement | Details |
|---|---|
| HPC cluster | Kubernetes or OpenShift typically orchestrates HPC resources.
Alternatively, single-host Docker HPC can run multi-GPU servers. For multi-node HPC with Slurm or other schedulers, you can wrap the container environment inside those HPC job scripts. |
| GPU drivers and libraries | NVIDIA GPU nodes typically need the NVIDIA device plugin in Kubernetes or OpenShift.
On Docker single-host environments, you must install |
| HPC network and storage | HPC nodes typically use high-bandwidth interconnects such as InfiniBand, 10/40/100GbE.
DSDL containers can rely on ephemeral volumes for short data staging, but HPC contexts need persistent or parallel file systems, such as NFS or GlusterFS. DSDL continues to sync notebooks and models to the Splunk platform, mitigating ephemeral volume losses. |
Multi-GPU and node labeling
See the following for descriptions of multi-GPU and node labeling in DSDL.
GPU resource requests
In Kubernetes or OpenShift, define GPU requests in your pod or deployment specification as shown in the following example:
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
DSDL automatically uses the GPU if your container image includes GPU libraries such as golden-gpu. The Splunk platform calls | fit MLTKContainer ... to schedule pods with nvidia.com/gpu:1.
Node labeling and taints
HPC clusters might label GPU nodes such as gpu=true, or add taints so that only GPU workloads land on them.
Single and multi-GPU tasks
For single-GPU tasks, request nvidia.com/gpu:1.
Distributed training approaches
Review the following distributed training approaches for HPU and GPU usage:
| Training approach | Description |
|---|---|
| Single container, multiple GPUs | This is the simplest approach for HPC. A single container hosts multiple GPUs.
Within your notebook code, use the PyTorch DataParallel or TensorFlow MirroredStrategy to leverage multiple GPUs in one host. You don't need to make any changes to your Splunk platform search logic because DSDL considers it 1 container. |
| Multi-container, inter-node communication | For large HPC or multi-node distributed training, you can spawn multiple containers or pods, that communicate through MPI or the PyTorch torch.distributed package.
DSDL does not manage multi-node orchestration by default but you can set it up with your HPC scheduler or an advanced Kubernetes operator. Searching or scheduling in the Splunk platform triggers the job, but the container's code handles multi-node communication. |
| DSDL integration | In advanced HPC workflows, you can call the fit or apply commands, but your notebook code must handle the distributed logic.
HPC node ephemeral logs or partial metrics route to Splunk through HEC or container logs. |
About development and production HPC containers
See the following for key points about development and production HPC containers.
Development HPC containers
- Uses JupyterLab plus GPU libraries.
- Lets data scientists refine code on a single HPC node with 1 to 2 GPUs or smaller data subsets.
- Is potentially ephemeral because after the development session ends, the container is stopped.
Production HPC containers
- Uses minimal overhead because it uses no Jupyter or development tools.
- Runs multi-GPU or multi-node distributed tasks.
- Is used by scheduled searches or repeated inference jobs in the Splunk platform.
- Must have defined GPU resource requests if you need GPU acceleration.
How HPC is used in Splunk Observability
See the following for descriptions of Splunk Observability and HPC.
HPC monitoring
HPC clusters typically ship with node-level metrics such as Ganglia or Prometheus. You can forward these metrics to the Splunk platform or Splunk Observability to unify HPC usage with container-level insights.
GPU telemetry
For GPU usage, consider NVIDIA DCGM or device plugin exporters that feed GPU metrics into Splunk Observability. If you turn on Splunk Observability in DSDL, you can automatically instrument each container endpoint, although HPC multi-node training might require custom tracing logic in your code.
Security and governance
See the following for descriptions of security and governance options available for HPC and GPU usage:
| Option | Description |
|---|---|
| Container registry and minimal GPU images | HPC clusters typically have a local Docker registry. You can build or pull GPU images such as golden-gpu and then push them to your HPC registry.
Use minimal or specialized images to reduce overhead. An air-gapped DSDL setup might apply if HPC has no external net. See Install and configure the Splunk App for Data Science and Deep Learning in an air-gapped environment. |
| Role-based access to GPU nodes | In Kubernetes and OpenShift, use RBAC or taints and tolerations so that only power users, or HPC roles, can schedule GPU containers.
In Docker single-host HPC, you must rely on local user constraints or Docker group membership. |
| Automatic notebook sync | DSDL design means that HPC ephemeral volumes are not at risk of losing code.
For HPC operators you can treat ephemeral container usage as stateless, letting the Splunk platform manage the notebooks. |
Example: HPC workflow
The following is an example of a high-performance computing (HPC) workflow:
- Create a Kubernetes HPC cluster with GPU nodes labeled as
gpu=true. - Create a custom-built
my-gpu-imagecontainer image with frameworks and libraries such as Torch, CUDA, and cuDNN. - Make sure the file points to
myregistry.local/my-gpu-image:latestin the images.conf file. - Complete the DSDL setup fields:
- Container type: GPU runtime.
- Resource requests:
nvidia.com/gpu:1. - Container mode: DEV or PROD.
- Run the following search:
index=my_data | fit MLTKContainer algo=my_gpu_notebook features_* into app:MyHPCModel - Kubernetes schedules a pod on a GPU node. The container loads your code, trains a PyTorch model, and streams logs or partial metrics to the Splunk platform.
- The model is now available in DSDL as
app:MyHPCModel.Note: HPC ephemeral volumes are irrelevant because the code and final artifacts are synced to DSDL.
Troubleshooting HPC and GPU usage
See the following table for issues you might experience and how to resolve them:
| Issue | Cause | Solution |
|---|---|---|
| Container never schedules on GPU node | You might be missing the nvidia.com/gpu:1 request, or the HPC node is not labeled for GPU.
|
Check your Kubernetes pod spec or the images.conf file to confirm HPC node labeling is gpu=true. Additionally, check the device plugin.
|
| Multi-GPU training fails silently | Notebook code is not configuring multi-GPU, or it is missing distribution logic. | HPC logs in containers of stdout and stderr. See if the PyTorch DataParallel configuration or multi-node configuration is correct.
|
| Docker single-host HPC container sees no GPUs | You might not be using --gpus all or runtime=nvidia for Docker run commands.
|
Check the Docker CLI usage or logs for "no GPU devices found" errors. |
| HPC cluster can't pull the GPU image | Private registry authentication error or air-gapped images might be missing. | Re-check credentials or ensure you loaded .TAR files on the HPC node registry. |
| HPC ephemeral volumes losing code in notebooks | DSDL sync scripts or configuration might be failing. | Check _internal "mltk-container" logs for sync errors.
Note: The Splunk platform automatically persists notebooks, so ephemeral is acceptable.
|