Model governance and security in the Splunk App for Data Science and Deep Learning

Train and serve advanced ML models in containerized environments with tThe Splunk App for Data Science and Deep Learning (DSDL). Enterprise-grade machine learning might require model governance, secure container management, and strict access controls to ensure that data, models, and container images meet compliance and operational standards.

Ensure you fulfill these model governance and security requirements for your advanced ML models.

Overview

DSDL supports the following model governance features:

  • Model training or versioning
  • Automatic sync for notebooks and models
  • Container image security, including private registries, image scanning, restricted GPU usage, and custom TLS certificates
  • Roles, capabilities, and container access
  • Auditing and traceability
  • Transport Layer Security (TLS) and data encryption

The following permissions are available with your models:

Permissions Description
App context By default, model names such as app:MyModel are recognized by DSDL.
Sharing Splunk knowledge object sharing can be set to User, App, or Global.
User Visible only to the model creator.
App Shared by users of the same Splunk app.
Global Visible across the Splunk platform and suitable for widely used HPC or production models.

Model retraining or versioning

First, run the following command to create and train a new model:

| fit MLTKContainer algo=my_notebook ... into app:MyModel

DSDL spins up a container, runs the training, and saves model artifacts under app:MyModel.

Note: The model is stored in the container environment during training, but references appear in the Splunk platform.

To retrain the model, run the following command with new data or parameters. This overwrites old artifacts:

| fit MLTKContainer algo=my_notebook ... into app:MyModel

To version the model , for example MyModel_v2, specify a new name in the into app: clause.

Note: Store ML-SPL and .ipynb code in Git to revert changes easily.

Automatic sync for notebooks and models

DSDL automatically stores your notebooks and model files in the Splunk platform instance. Because containers are ephemeral by default, automatic sync prevents data loss if ephemeral or NFS volumes go offline and lets new containers retrieve the same notebooks and models.

The SyncHandler and related scripts remove orphaned containers, reconcile stanzas with actual containers, and ensure ephemeral data is synced. This preserves your environment from data loss, letting you focus on the machine learning workflow, rather than container lifecycle details.

Container image security

Review the following options to secure your container images.

Private registry and air-gapped images

You can use a private Docker registry or an air-gapped approach. Push images from golden-cpu, golden-gpu, or custom to your internal registry. In DSDL go to Setup and then Container Settings, and specify that private registry URL so DSDL pulls from it.

Note: If your environment doesn't have internet use docker save or load or use bulk_build.sh. Keep a separate Git or artifact repository with Dockerfiles and pinned requirements.

Image scanning and hardening

Follow these best practices for image scanning and hardening:

  • Use scripts from [splunk-mltk-container-docker](#) or tools such as Trivy to detect known common vulnerabilities and exposures (CVE).
  • Remove unneeded packages for minimal images.
  • Patch OS-level vulnerabilities regularly such as Debian, Red Hat UBI, and so on.

GPU resource restrictions

In Kubernetes or OpenShift, define resource requests so only authorized machine learning tasks can claim GPUs. In single-host Docker containers, pass --gpus or runtime=nvidia to control GPU usage.

Embedding custom certificates for production HTTPS

In production environments, you must have trusted HTTPS on container endpoints. DSDL images can include your own TLS certificates instead of the default, self-signed certificates. The splunk-mltk-container-docker repo includes a certificates folder showing how to embed custom certificates.

Note: For development environments, you can use a self-signed certificate. For production environments, consider using your organization's CA-signed certificate for higher security.

Follow these steps:

  1. Clone the repo:
    git clone https://github.com/splunk/splunk-mltk-container-docker
  2. Place your certificates in the certificates directory, named dltk.key for the private key, and dltk.pem for the certificate.
  3. (Optional) Generate self-signed certificates for testing:
    openssl req -x509 -nodes -days 3650 -newkey rsa:2048 \
        -keyout dltk.key -out dltk.pem \
        -subj "/CN=bobobobobbo"
    
  4. Build your container image using scripts:
    ./build.sh golden-cpu-custom splunk/ 5.2.0
    
    dltk.key and dltk.pem into /dltk/.jupyter/. This sets up the container to serve HTTPS endpoints with your certificate.

CAUTION: Make sure the certificate file names are dltk.key and dltk.pem or adapt the Dockerfile references so the container recognizes them. Only these exact filenames are used at runtime.

Roles, capabilities, and container access

Review the following for information on roles and permissions in DSDL.

DSDL roles and capabilities

DSDL offers the following container-related capabilities:

Capability Description
configure_mltk_container Manages container settings such as Observability tokens, and certificate configurations.
list_mltk_container Lists containers on the container dashboard.
control_mltk_container Starts or stops containers from the DSDL app.

Note: Consider limiting configure_mltk_containercapabilities for Splunk admins, control_mltk_containerfor data-science roles, and list_mltk_container for general usage.

Model permissions

The following permissions are available with your models:

Permissions Description
App context By default, model names such as app:MyModel are recognized by DSDL.
Sharing Shares Splunk knowledge objects at the user, app, or global level.
User Shares the model only to the model creator.
App Shares the model to users of the same Splunk app.
Global Shares the model across the Splunk platform. Suitable for widely used HPC or production models.

By default, only the model creator sees the model. For HPC or large production usage, set model sharing to Global.

Securing HEC, Observability, and container endpoints

Use Splunk HEC tokens carefully if you log partial training data. If Observability is enabled, guard your Observability Access Token. If you want production-level TLS in the container, use embedding custom certificates.

Auditing and traceability

Review the following options for model auditing and traceability.

Option Description
Track model creation in _internal logs Use _internal logs to help track who trained which model and when. When you run fit ... into app:MyModel, logs appear in  _internal, referencing information including container staging.

For example:

index=_internal "mltk-container" "into=app:MyModel"
Audit with model summary and metadata Running summary MyModel returns model information such as hyperparameters and creation time.

You can build a model catalog or store these events in a dedicated Splunk index for extended auditing.

Collaborate on and roll back changes with notebook versioning in Git DSDL automatically syncs notebooks to the Splunk platform, but you can also store .ipynb files in Git for collaboration and rollback.

TLS and data encryption

Review the following table for information on TLS and data encryption in model governance and security:

Option Description
TLS from the Splunk platform to container Developer containers can use self-signed certificates. Production containers must have properly signed certificates for TLS.

For a Docker single-host container, the container endpoints handle TLS. For Kubernetes, often an Ingress object handles TLS termination.

GPU data in transit Data from the Splunk platform is subject to TLS encryption, even if the container uses GPUs.

The ephemeral GPU usage does not affect encryption but matters for ephemeral volumes, mitigated by the automated sync to the Splunk platform.

Governance and security guidelines

Review the following guidelines for model governance and security:

  • Restrict advanced container management capabilities to admin or power users.
    Use minimal images, adding only the libraries you need.
  • Use minimal images, adding only the libraries you need.
  • Use DSDL's automatic sync to avoid ephemeral data loss, and store .ipynb files in Git for version control.
  • Scan container images with Trivy or the built-in scripts from splunk-mltk-container-docker.
  • Use custom certificates for production HTTPS in containers.
  • If Observability is toggled on in DSDL, container endpoints are auto-instrumented with OTel. Confirm your endpoint, token, and service name.

Troubleshooting model governance and security

See the following table for issues you might experience and how to resolve them:

Problem Cause Solution
You see the error model not found: MyModel The model is private or in a different app context. Adjust sharing permissions or confirm container logs. Search the _internal logs for mltk-container for references to your model.
HPC node can't pull image There is a private registry or TLS error. Check your Docker or Kubernetes credentials, or check your images.conf file references to the registry.
Observability instrumentation not active on endpoints Observability is toggled off or has an invalid token in DSDL. n DSDL, go to Setup, then Observability Settings. You might need to restart the container with new configurations.
Notebooks vanish after container restarts Ephemeral volume is wiped or NFS is gone. Restore the notebooks with the automatic notebook to model sync in the Splunk platform. Check the _internal logs and mltk-container for any sync errors.
You see Invalid certificate on container endpoint The container uses self-signed or misnamed cerificatest, or the container lacks your official CA. Place your real certificate in certificates/dltk.pem and certificates/dltk.key and the rebuild container. Review Docker logs for TLS load errors.