Container management and scaling

Use external containers with the Splunk App for Data Science and Deep Learning (DSDL) to offload resource-heavy machine learning tasks from the Splunk search head. This architecture isolates potentially large workloads, and allows for horizontal scaling, GPU acceleration, and robust environment management.

Review the following guidelines to manage container lifecycles, scale concurrency, and optimize resource usage when running DSDL in Docker, Kubernetes, or OpenShift.

Overview

When you run DSDL commands such as | fit MLTKContainer ... or | apply ..., DSDL communicates with an external container platform. DSDL communicates with Docker on a single host, or with a cluster orchestrated by Kubernetes or OpenShift.

By default, development (DEV) mode containers include JupyterLab, TensorBoard, and other developer tools. Production (PROD) mode containers are minimal, running just the Python processes required for model training and inference.

Understanding how these containers are launched, monitored, and scaled can help with efficient resource usage and robust enterprise-grade workflows.

Container lifecycle

Review the following descriptions of a container lifecycle:


Lifecycle stage	Description
Trigger container launch	When a user runs an ML-SPL command in the Splunk platform, such as `\| fit MLTKContainer algo=...,` DSDL checks if a container or pod is already running with the correct configuration. If not, a new one is launched. If a scheduled search includes the fit or apply commands, DSDL might spin up containers in the background at regular intervals. If a user manually starts a container in DSDL, such as for developmental tasks or to use Jupyter, DSDL launches the container.
Initialize container	DSDL calls the Docker, Kubernetes, or OpenShift API to create a container or pod based on your chosen image. For example `golden-cpu` image. The container environment variables are set in DSDL configuration. Example variables include `JUPYTER_PASSWD` and `container_enable_https`.
Run active container	Once started, the container is available to handle `fit` or `apply` commands. In DEV mode, JupyterLab or TensorBoard endpoints can be accessed through mapped ports, Routes, or NodePorts.
Stop idle container	After a period of inactivity and depending on your setup, DSDL might automatically stop the container to free resources.
Remove and clean up container	Older or crashed containers are removed from the system. On Kubernetes or OpenShift pods are ephemeral by design and typically cleaned up automatically.

Comparing Docker, Kubernetes, and OpenShift containers

Review the following differences between Docker, Kubernetes, and OpenShift containers in DSDL:


Container	Typical use case	Network setup	Scaling	Management
Single-host Docker	Smaller dev or test environments or single Splunk platform instance with Docker on the same machine.	By default, the container might be on `unix://var/run/docker.sock` or `tcp://localhost:2375`, without TLS.	Usually limited to 1 container or pod per host unless you manually script multiple Docker hosts.	DSDL starts and stops the container through the Docker API. Development containers might keep running until manually stopped.
Kubernetes or OpenShift	Production-scale or distributed environments.	Use TLS or HTTPS from the Splunk platform to the Kubernetes or OpenShift API server on port 6443.	You can define multiple replicas, or rely on Kubernetes Horizontal Pod Autoscaler (HPA) for concurrency.	DSDL translates container requests into pod deployments. You can configure resource requests, node labeling (GPU), and advanced security contexts.

Concurrency and scaling patterns

Review the following descriptions of concurrency and scaling patterns in DSDL:


Pattern	Description
Horizontal Pod Autoscaler (HPA) on Kubernetes or OpenShift	HPA can auto-scale pods based on CPU or memory usage. For DSDL, you might define an HPA that spawns additional pods if usage surpasses certain thresholds. HPA helps handle multiple concurrent Splunk platform searches that call the `fit` or `apply` commands simultaneously.
Docker Compose or scripts	If you're using single-host Docker, scaling typically involves manually launching multiple containers or writing scripts to do so. DSDL won't automatically create multiple containers on Docker unless you handle it outside of the Splunk platform.
GPU scheduling	For GPU-based containers in Kubernetes or OpenShift, assign a label or request `nvidia.com/gpu: 1` so that the container lands on GPU-enabled nodes. In Docker, ensure `--gpus` all or similar flags are used if you want GPU acceleration.

Comparing development and production containers

Review the following descriptions of development (DEV) and production (PROD) containers:


Container type	Description
DEV containers	Uses JupyterLab for interactive notebook development. TensorBoard or other development tooling might be exposed on additional ports such as 8888 for Jupyter, or 6006 for TensorBoard. Typically used short-term to refine code, test data staging, or debug advanced logic.
PROD containers	Uses minimal setup, containing only the necessary Python environment for model training and inference. No Jupyter or development ports are exposed which reduces the attack surface and resource overhead. Might run on multiple replicas if concurrency is high.

Configuring DSDL operations

Review this overview of the DSDL operations you can access in DSDL:


DSDL operation	Where to find it	Description
View container dashboard	In DSDL select Configuration, and then Containers.	You can see currently running containers and pods, start or stop them, and open Jupyter or TensorBoard if you're in DEV mode.
View logs and diagnostics	Check Splunk `_internal` logs with `"mltk-container"` to reveal container startup errors, such as network timeouts or Docker and Kubernetes rejections.	You can forward container logs to the Splunk platform, letting you see Python exceptions from the `fit` or `apply` commands.
Clean up idle containers	Configure idle containers to stop after a certain timeout in the DSDL app.	DSDL typically stops idle containers after a certain timeout. In Kubernetes and OpenShift, old pods might remain in a "Completed" or "Terminated" state, but cluster housekeeping eventually prunes them.

Resource allocation and scheduling

Review the following options for resource allocation and scheduling in DSDL:


Option	Description
CPU and memory requests	In Kubernetes, define requests and limits in your pod or deployment specification. This ensures the container won't exceed certain CPU or memory usage. On Docker, you can specify `--cpus` or `--memory` if you manually run containers.
GPU resources	On Kubernetes or OpenShift, you must configure GPU node drivers or device plugins such as NVIDIA so that pods requesting `nvidia.com/gpu: 1` schedule properly. On Docker, use the `runtime=nvidia` setting or an environment variable to link GPU libraries.
Production best practices	Monitor ephemeral storage usage because large data staging or logs can fill ephemeral volumes. Monitor container logs for OOM killer events, which indicate insufficient memory limits.

Typical container management and scaling use cases

See the following common use cased for container management and scaling:


Use case	Desription
Multiple DEV containers	Each data scientist spawns a personal DEV container with Jupyter. They eventually merge and store code in Git.
1 PROD container in Docker	Single-host environment with moderate concurrency. The Splunk platform calls 1 container to handle model training and inference sequentially.
Kubernetes high-performance computing (HPC)	Large-scale HPC environment with multiple GPU nodes. Kubernetes auto-scales pods so concurrent machine learning tasks each get their own container.
OpenShift Enterprise Container Platform	Large-scale HPC environment integrated with the Red Hat security or operator frameworks.

Troubleshooting container management and scaling

See the following issues you might experience and how to resolve them:


Problem	Solution
Container fails to start	1. Check `_internal` logs for `mltk-container` messages about Docker or Kubernetes API errors or timeouts. 2. Ensure the Docker REST socket or Kubernetes API server is reachable.
Development (DEV) container times out	If DEV containers auto-stop, adjust the idle timeout or manually keep them active in the DSDL app.
Resource exhaustion	If logs indicate Out of Memory (OOM) kills, or CPU throttling, raise the memory or CPU requests in Kubernetes, or refine Docker resource constraints.
GPU not recognized	Confirm that you have the correct GPU drivers, device plugins, or `runtime=nvidia` if using Docker. Check container logs or the `_internal` index for GPU scheduling errors.

Example: Kubernetes multi-pod setup

The following steps are for an example multi-pod setup in Kubernetes:

Configure DSDL. In DSDL go to Setup, then select Connect to your Kubernetes cluster.

Define a deployment. DSDL automatically creates a deployment for DEV or PROD containers. You can edit the resource specs or add a Horizontal Pod Autoscaler (HPA) as follows:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: dsdl-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: dsdl-dev
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Use DSDL. The Splunk platform calls the fit or apply commands, and DSDL spawns pods. If CPU usage is high, Horizontal Pod Autoscaler (HPA) scales up.
Observe container states. In the Kubernetes dashboard or on the DSDL Containers page, you can see how many pods are active.