Advanced HPC and GPU usage

The Splunk App for Data Science and Deep Learning (DSDL) integrates with containerized machine learning. DSDL supports graphics processing unit (GPU) acceleration, large-scale, high-performance computing (HPC) clusters, and distributed training in Docker, Kubernetes, or OpenShift.

Optimize HPC workflows, including multi-GPU usage, node labeling, ephemeral volumes, and typical HPC environment considerations.

Overview

When you run the fit or apply commands with DSDL, the external container environment can tap into the following high-performance computing resources:

  • GPUs for deep learning acceleration.
  • Multi-node HPC clusters for distributed training or parallel inference.
  • Custom scheduling policies to handle concurrency.

DSDL provides a Splunk platform based orchestration of data flows. HPC nodes perform the heavy machine learning tasks and the Splunk platform manages search, scheduling, and logs. DSDL also offers container images with advanced libraries such as TensorFlow and PyTorch, with optional GPU support.

With DSDL you can turn on iterative development with development containers, while production HPC containers run with minimal overhead.

Advanced HPC and GPU requirements

You must meet the following requirements to use advanced HPC and GPU in DSDL:

Requirement Details
HPC cluster Kubernetes or OpenShift typically orchestrates HPC resources.

Alternatively, single-host Docker HPC can run multi-GPU servers.

For multi-node HPC with Slurm or other schedulers, you can wrap the container environment inside those HPC job scripts.

GPU drivers and libraries NVIDIA GPU nodes typically need the NVIDIA device plugin in Kubernetes or OpenShift.

On Docker single-host environments, you must install nvidia-docker2 or the --gpus runtime.

You must have CUDA and cuDNN libraries and other frameworks in your container if you're doing deep learning.

HPC network and storage HPC nodes typically use high-bandwidth interconnects such as InfiniBand, 10/40/100GbE.

DSDL containers can rely on ephemeral volumes for short data staging, but HPC contexts need persistent or parallel file systems, such as NFS or GlusterFS.

DSDL continues to sync notebooks and models to the Splunk platform, mitigating ephemeral volume losses.

Multi-GPU and node labeling

See the following for descriptions of multi-GPU and node labeling in DSDL.

GPU resource requests

In Kubernetes or OpenShift, define GPU requests in your pod or deployment specification as shown in the following example:

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

DSDL automatically uses the GPU if your container image includes GPU libraries such as golden-gpu. The Splunk platform calls | fit MLTKContainer ... to schedule pods with nvidia.com/gpu:1.

Node labeling and taints

HPC clusters might label GPU nodes such as gpu=true, or add taints so that only GPU workloads land on them.

In DSDL, set the appropriate node selector or tolerations in your container environment and Kubernetes cluster configuration if you want certain tasks only on GPU nodes.

Single and multi-GPU tasks

For single-GPU tasks, request nvidia.com/gpu:1.

For multi-GPU tasks such as distributed data parallel training, request multiple GPUs in your Pod spec or define a ReplicaSet with multiple GPU pods. This is an advanced option and requires your notebook code such as PyTorch DDP, or Horovod, to handle multi-GPU scaling.

Distributed training approaches

Review the following distributed training approaches for HPU and GPU usage:

Training approach Description
Single container, multiple GPUs This is the simplest approach for HPC. A single container hosts multiple GPUs.

Within your notebook code, use the PyTorch DataParallel or TensorFlow MirroredStrategy to leverage multiple GPUs in one host.

You don't need to make any changes to your Splunk platform search logic because DSDL considers it 1 container.

Multi-container, inter-node communication For large HPC or multi-node distributed training, you can spawn multiple containers or pods, that communicate through MPI or the PyTorch torch.distributed package.

DSDL does not manage multi-node orchestration by default but you can set it up with your HPC scheduler or an advanced Kubernetes operator.

Searching or scheduling in the Splunk platform triggers the job, but the container's code handles multi-node communication.

DSDL integration In advanced HPC workflows, you can call the fit or apply commands, but your notebook code must handle the distributed logic.

HPC node ephemeral logs or partial metrics route to Splunk through HEC or container logs.

About development and production HPC containers

See the following for key points about development and production HPC containers.

Development HPC containers

  • Uses JupyterLab plus GPU libraries.
  • Lets data scientists refine code on a single HPC node with 1 to 2 GPUs or smaller data subsets.
  • Is potentially ephemeral because after the development session ends, the container is stopped.

Production HPC containers

  • Uses minimal overhead because it uses no Jupyter or development tools.
  • Runs multi-GPU or multi-node distributed tasks.
  • Is used by scheduled searches or repeated inference jobs in the Splunk platform.
  • Must have defined GPU resource requests if you need GPU acceleration.

How HPC is used in Splunk Observability

See the following for descriptions of Splunk Observability and HPC.

HPC monitoring

HPC clusters typically ship with node-level metrics such as Ganglia or Prometheus. You can forward these metrics to the Splunk platform or Splunk Observability to unify HPC usage with container-level insights.

GPU telemetry

For GPU usage, consider NVIDIA DCGM or device plugin exporters that feed GPU metrics into Splunk Observability. If you turn on Splunk Observability in DSDL, you can automatically instrument each container endpoint, although HPC multi-node training might require custom tracing logic in your code.

Security and governance

See the following for descriptions of security and governance options available for HPC and GPU usage:

Option Description
Container registry and minimal GPU images HPC clusters typically have a local Docker registry. You can build or pull GPU images such as golden-gpu and then push them to your HPC registry.

Use minimal or specialized images to reduce overhead. An air-gapped DSDL setup might apply if HPC has no external net. See Install and configure the Splunk App for Data Science and Deep Learning in an air-gapped environment.

Role-based access to GPU nodes In Kubernetes and OpenShift, use RBAC or taints and tolerations so that only power users, or HPC roles, can schedule GPU containers.

In Docker single-host HPC, you must rely on local user constraints or Docker group membership.

Automatic notebook sync DSDL design means that HPC ephemeral volumes are not at risk of losing code.

For HPC operators you can treat ephemeral container usage as stateless, letting the Splunk platform manage the notebooks.

Example: HPC workflow

The following is an example of a high-performance computing (HPC) workflow:

  1. Create a Kubernetes HPC cluster with GPU nodes labeled as gpu=true.
  2. Create a custom-built my-gpu-image container image with frameworks and libraries such as Torch, CUDA, and cuDNN.
  3. Make sure the file points to myregistry.local/my-gpu-image:latest in the images.conf file.
  4. Complete the DSDL setup fields:
    1. Container type: GPU runtime.
    2. Resource requests: nvidia.com/gpu:1.
    3. Container mode: DEV or PROD.
  5. Run the following search:
    index=my_data
    | fit MLTKContainer algo=my_gpu_notebook features_* into app:MyHPCModel
    
  6. Kubernetes schedules a pod on a GPU node. The container loads your code, trains a PyTorch model, and streams logs or partial metrics to the Splunk platform.
  7. The model is now available in DSDL as app:MyHPCModel.

    Note: HPC ephemeral volumes are irrelevant because the code and final artifacts are synced to DSDL.

Troubleshooting HPC and GPU usage

See the following table for issues you might experience and how to resolve them:

Issue Cause Solution
Container never schedules on GPU node You might be missing the nvidia.com/gpu:1 request, or the HPC node is not labeled for GPU. Check your Kubernetes pod spec or the images.conf file to confirm HPC node labeling is gpu=true. Additionally, check the device plugin.
Multi-GPU training fails silently Notebook code is not configuring multi-GPU, or it is missing distribution logic. HPC logs in containers of stdout and stderr. See if the PyTorch DataParallel configuration or multi-node configuration is correct.
Docker single-host HPC container sees no GPUs You might not be using --gpus all or runtime=nvidia for Docker run commands. Check the Docker CLI usage or logs for "no GPU devices found" errors.
HPC cluster can't pull the GPU image Private registry authentication error or air-gapped images might be missing. Re-check credentials or ensure you loaded .TAR files on the HPC node registry.
HPC ephemeral volumes losing code in notebooks DSDL sync scripts or configuration might be failing. Check _internal "mltk-container" logs for sync errors.

Note: The Splunk platform automatically persists notebooks, so ephemeral is acceptable.