Container management and scaling
Use external containers with the Splunk App for Data Science and Deep Learning (DSDL) to offload resource-heavy machine learning tasks from the Splunk search head. This architecture isolates potentially large workloads, and allows for horizontal scaling, GPU acceleration, and robust environment management.
Review the following guidelines to manage container lifecycles, scale concurrency, and optimize resource usage when running DSDL in Docker, Kubernetes, or OpenShift.
Overview
When you run DSDL commands such as | fit MLTKContainer ...
or | apply ...
, DSDL communicates with an external container platform. DSDL communicates with Docker on a single host, or with a cluster orchestrated by Kubernetes or OpenShift.
By default, development (DEV) mode containers include JupyterLab, TensorBoard, and other developer tools. Production (PROD) mode containers are minimal, running just the Python processes required for model training and inference.
Understanding how these containers are launched, monitored, and scaled can help with efficient resource usage and robust enterprise-grade workflows.
Container lifecycle
Review the following descriptions of a container lifecycle:
Lifecycle stage | Description |
---|---|
Trigger container launch |
|
Initialize container | DSDL calls the Docker, Kubernetes, or OpenShift API to create a container or pod based on your chosen image. For example golden-cpu image.
The container environment variables are set in DSDL configuration. Example variables include |
Run active container | Once started, the container is available to handle fit or apply commands.
In DEV mode, JupyterLab or TensorBoard endpoints can be accessed through mapped ports, Routes, or NodePorts. |
Stop idle container | After a period of inactivity and depending on your setup, DSDL might automatically stop the container to free resources. |
Remove and clean up container | Older or crashed containers are removed from the system. On Kubernetes or OpenShift pods are ephemeral by design and typically cleaned up automatically. |
Comparing Docker, Kubernetes, and OpenShift containers
Review the following differences between Docker, Kubernetes, and OpenShift containers in DSDL:
Container | Typical use case | Network setup | Scaling | Management |
---|---|---|---|---|
Single-host Docker | Smaller dev or test environments or single Splunk platform instance with Docker on the same machine. | By default, the container might be on unix://var/run/docker.sock or tcp://localhost:2375 , without TLS.
|
Usually limited to 1 container or pod per host unless you manually script multiple Docker hosts. | DSDL starts and stops the container through the Docker API. Development containers might keep running until manually stopped. |
Kubernetes or OpenShift | Production-scale or distributed environments. | Use TLS or HTTPS from the Splunk platform to the Kubernetes or OpenShift API server on port 6443. | You can define multiple replicas, or rely on Kubernetes Horizontal Pod Autoscaler (HPA) for concurrency. | DSDL translates container requests into pod deployments. You can configure resource requests, node labeling (GPU), and advanced security contexts. |
Concurrency and scaling patterns
Review the following descriptions of concurrency and scaling patterns in DSDL:
Pattern | Description |
---|---|
Horizontal Pod Autoscaler (HPA)
on Kubernetes or OpenShift |
HPA can auto-scale pods based on CPU or memory usage.
For DSDL, you might define an HPA that spawns additional pods if usage surpasses certain thresholds. HPA helps handle multiple concurrent Splunk platform searches that call thefit or apply commands simultaneously.
|
Docker Compose or scripts | If you're using single-host Docker, scaling typically involves manually launching multiple containers or writing scripts to do so.
DSDL won't automatically create multiple containers on Docker unless you handle it outside of the Splunk platform. |
GPU scheduling | For GPU-based containers in Kubernetes or OpenShift, assign a label or request nvidia.com/gpu: 1 so that the container lands on GPU-enabled nodes.
In Docker, ensure |
Comparing development and production containers
Review the following descriptions of development (DEV) and production (PROD) containers:
Container type | Description |
---|---|
DEV containers | Uses JupyterLab for interactive notebook development.
TensorBoard or other development tooling might be exposed on additional ports such as 8888 for Jupyter, or 6006 for TensorBoard. Typically used short-term to refine code, test data staging, or debug advanced logic. |
PROD containers | Uses minimal setup, containing only the necessary Python environment for model training and inference.
No Jupyter or development ports are exposed which reduces the attack surface and resource overhead. Might run on multiple replicas if concurrency is high. |
Configuring DSDL operations
Review this overview of the DSDL operations you can access in DSDL:
DSDL operation | Where to find it | Description |
---|---|---|
View container dashboard | In DSDL select Configuration, and then Containers. | You can see currently running containers and pods, start or stop them, and open Jupyter or TensorBoard if you're in DEV mode. |
View logs and diagnostics | Check Splunk _internal logs with "mltk-container" to reveal container startup errors, such as network timeouts or Docker and Kubernetes rejections.
|
You can forward container logs to the Splunk platform, letting you see Python exceptions from the fit or apply commands.
|
Clean up idle containers | Configure idle containers to stop after a certain timeout in the DSDL app. | DSDL typically stops idle containers after a certain timeout. In Kubernetes and OpenShift, old pods might remain in a "Completed" or "Terminated" state, but cluster housekeeping eventually prunes them. |
Resource allocation and scheduling
Review the following options for resource allocation and scheduling in DSDL:
Option | Description |
---|---|
CPU and memory requests | In Kubernetes, define requests and limits in your pod or deployment specification. This ensures the container won't exceed certain CPU or memory usage.
On Docker, you can specify |
GPU resources | On Kubernetes or OpenShift, you must configure GPU node drivers or device plugins such as NVIDIA so that pods requesting nvidia.com/gpu: 1 schedule properly.
On Docker, use the |
Production best practices | Monitor ephemeral storage usage because large data staging or logs can fill ephemeral volumes.
Monitor container logs for OOM killer events, which indicate insufficient memory limits. |
Typical container management and scaling use cases
See the following common use cased for container management and scaling:
Use case | Desription |
---|---|
Multiple DEV containers | Each data scientist spawns a personal DEV container with Jupyter. They eventually merge and store code in Git. |
1 PROD container in Docker | Single-host environment with moderate concurrency. The Splunk platform calls 1 container to handle model training and inference sequentially. |
Kubernetes high-performance computing (HPC) | Large-scale HPC environment with multiple GPU nodes. Kubernetes auto-scales pods so concurrent machine learning tasks each get their own container. |
OpenShift Enterprise Container Platform | Large-scale HPC environment integrated with the Red Hat security or operator frameworks. |
Troubleshooting container management and scaling
See the following issues you might experience and how to resolve them:
Problem | Solution |
---|---|
Container fails to start | 1. Check _internal logs for mltk-container messages about Docker or Kubernetes API errors or timeouts.
2. Ensure the Docker REST socket or Kubernetes API server is reachable. |
Development (DEV) container times out | If DEV containers auto-stop, adjust the idle timeout or manually keep them active in the DSDL app. |
Resource exhaustion | If logs indicate Out of Memory (OOM) kills, or CPU throttling, raise the memory or CPU requests in Kubernetes, or refine Docker resource constraints. |
GPU not recognized | Confirm that you have the correct GPU drivers, device plugins, or runtime=nvidia if using Docker.
Check container logs or the |
Example: Kubernetes multi-pod setup
The following steps are for an example multi-pod setup in Kubernetes:
- Configure DSDL. In DSDL go to Setup, then select Connect to your Kubernetes cluster.
- Define a deployment. DSDL automatically creates a deployment for DEV or PROD containers. You can edit the resource specs or add a Horizontal Pod Autoscaler (HPA) as follows:
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: dsdl-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: dsdl-dev minReplicas: 1 maxReplicas: 5 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70
- Use DSDL. The Splunk platform calls the
fit
orapply
commands, and DSDL spawns pods. If CPU usage is high, Horizontal Pod Autoscaler (HPA) scales up. - Observe container states. In the Kubernetes dashboard or on the DSDL Containers page, you can see how many pods are active.