Extend the Splunk App for Data Science and Deep Learning with custom notebooks

You can define custom notebooks for specialized machine learning or deep learning tasks with the Splunk App for Data Science and Deep Learning (DSDL). By writing your own Jupyter notebooks, you can incorporate custom algorithms, advanced Python libraries, domain-specific logic, and pull in data from the Splunk platform within the same environment.

Create, export, and maintain notebooks so that they seamlessly integrate with the ML-SPL commands of fit, apply, and summary.

Overview

When you develop a notebook in DSDL you can perform the following tasks:

  • Write Python code for data preprocessing, model training, or inference.
  • Expose that code to ML-SPL by defining functions such as fit and apply within special notebook cells.
  • Automatically export the code into a Python module at runtime.
  • Call those functions from Splunk platform searches.
  • Pull data directly from the Splunk platform using the Splunk Search API integration, allowing for interactive data exploration in your Jupyter notebook environment.

Note: Your custom code operates in the external container environment while staying fully integrated with Splunk platform search processing.

DSDL notebook components

A DSDL notebook typically includes the following components:

Component Description
Imports and setup Imported libraries such as NumPy, Pandas, and PyTorch.

Can define global constants or utility functions.

fit function A Python function that trains or fits your model.

Accepts data as a Pandas DataFrame and hyperparameters, returning model artifacts.

apply function (Optional) Used for inference or prediction.

Accepts new data and the trained model, and returns predictions.

summary function (Optional) Provides metadata about the model such as hyperparameters or training stats.
Other utility functions (Optional) Runs data cleaning, advanced transforms, or direct data pulls using the Splunk Search API.

Note: When you save a notebook, DSDL automatically generates a corresponding .py file in /srv/notebooks/app/ directory or a similar directory. The corresponding .py file uses the same base name as the notebook, for example my_notebook.py. When saved, the fit, apply, and summary functions can be called from ML-SPL.

The following example notebook is comprised of different components:

# ---
# jupyter:
#   jupytext:
#     formats: ipynb,py
#     notebook_metadata_filter: all
# ---

import json
import numpy as np
import pandas as pd
import os

MODEL_DIRECTORY = "/srv/app/model/data/"

from dsdlsupport import SplunkSearch as SplunkSearch
search = SplunkSearch.SplunkSearch()
df = search.as_df()
df

def init(df,param):
    model = {}
    model['hyperparameter'] = 42.0
    return model

model = init(df,param)
print(model)

def fit(model,df,param):
    info = {"message": "model trained"}
    return info

print(fit(model,df,param))

def apply(model,df,param):
    y_hat = df.index
    result = pd.DataFrame(y_hat, columns=['index'])
    return result

print(apply(model,df,param))

def summary(model=None):
    returns = {"version": {"numpy": np.__version__, "pandas": pd.__version__} }
    return returns

Notebook-to-module mechanism

DSDL runs the following internal mechanism that scans the notebook for functions named fit, apply, and summary:

  1. Trigger autosave: Each time you save the notebook in JupyterLab, a conversion step occurs.
  2. Export Python: Relevant Python cells such as a cell containing the fit function, are written into a .PY module. For example /srv/notebooks/app/<notebook_name>.py.
  3. Run ML-SPL lookup: The MLTKContainer command dynamically imports <notebook_name> at runtime to run the fit, apply, and summary functions.

You can help ensure this internal mechanism runs well in the following ways:

  • Avoid function name collisions such as 2 separate fit definitions in the same notebook.
  • If you rename your notebook file, a new .PY module is created but the older file isn't deleted. Remove older references that you no longer need.

Defining and passing parameters

Document your notebook's expected parameters so users know which SPL arguments to provide. Use sensible defaults to avoid a Python KeyError if a parameter (param) is missing. All parameter values from ML-SPL are strings. You can convert parameters to int, float, or bool as needed.

In the following example, all ML-SPL arguments after algo=<my_notebook> are passed to your notebook's Python code as the param dictionary:

| fit MLTKContainer algo=my_notebook alpha=0.01 epochs=10 ...
def fit(df, param):
    alpha = float(param.get('alpha', 0.001))
    epochs = int(param.get('epochs', 10))
    ...

Note: Use param.get('key', default_value) to handle optional arguments.

Stage data with iterative development in notebooks

You can use mode=stage for iterative development and data staging. Complete the following steps:

  1. If you want to push only a subset of Splunk platform data to your notebook without training, follow this syntax:
     | fit MLTKContainer mode=stage algo=my_notebook features_* into app:MyDevModel
    fit command.
  2. Open JupyterLab, and define or call a helper function as follows:
    def stage(name):
        with open("data/"+name+".csv", 'r') as f:
            df = pd.read_csv(f)
        with open("data/"+name+".json", 'r') as f:
            param = json.load(f)
        return df, param
    
    df, param = stage("MyDevModel")
    
  3. To debug, open my_notebook.ipynb in JupyterLab to test or modify code, using the staged data.
  4. Manually call your init, fit, or apply functions on that data to debug as needed.

Pull data directly into a notebook using the Splunk Search API

In addition to staging data with mode=stage, you can pull data directly using the Splunk Search API.

Complete the following steps:

  1. Turn on access to the Splunk platform on the DSDL Setup page. Provide your Splunk host, port 8089, and a valid token.
  2. Import SplunkSearch in your notebook, then either use an interactive search widget or define a predefined query.
  3. Run the query to retrieve data into a Pandas DataFrame in your notebook.
    For example:
    from dsdlsupport import SplunkSearch
    search = SplunkSearch.SplunkSearch()
    df = search.as_df()
    df
    

    Note: If you encounter connectivity issues, confirm firewall rules or check the _internal logs for mltk-container errors referencing timeouts.

Storing and sharing notebooks

Apply the following methods for version control and collaboration with custom models:

  • Store notebooks in a Git repo, allowing for merges, pull requests, and versioning.
  • By default, notebooks are stored in /srv/notebooks/. You can sort them by projects or by teams.
  • Jupyter saves automatically, but you can consider manually committing .IPYNB and .PY changes to Git for auditing.

Advanced notebook patterns

You can use advanced notebook patterns with custom models:

Notebook pattern Description
Multiple models per notebook You can define multiple training algorithms in a single .IPYNB file, but only one fit method is recognized. If you want to differentiate between them, parse extra arguments in param or create separate notebooks for clarity.
Additional utility functions You can define custom data preparation, feature engineering, or advanced plotting in separate Python cells. As long as they're not named fit, apply, or summary, they won't be exported to ML-SPL.
Auto-generating additional metrics You can log metrics or epoch-by-epoch logs to the Splunk platform. For example, you can write them to a CSV file that's forwarded, or send them to HTTP Event Collector (HEC) in real time.

Best practices for creating notebooks

Consider the following when creating custom notebooks:

  • DSDL only recognizes exact cell names. Be mindful of any typos when using init, fit, apply, and summary.
  • All parameter values from ML-SPL are strings. You can convert parameters to int, float, or bool as needed.
  • If your container image lacks large libraries, it results in an ImportError. Add large libraries through Docker.
  • Use unique .IPYNV filenames to help avoid conflicts or overwriting the file in /srv/notebooks/app/.
  • If you rely on Splunk search, ensure the container can reach the Splunk platform firewall, DNS, and TLS settings.

Example: Create a custom notebook

The following is an example workflow of creating a custom notebook:

  1. Start a dev container in DSDL, then open JupyterLab.
  2. Create a notebook and save it as my_custom_algo.ipynb in JupyterLab.
  3. Define code: Write cells for init, fit, apply, and summary, optionally using the Splunk Search API.
  4. Pull data with df, param = stage("MyTestModel") or use the Splunk Search API.
    Test logic interactively.
  5. Save your file. DSDL exports your code to my_custom_algo.py.
  6. Train the model in the Splunk platform:
    index=my_data
    | fit MLTKContainer algo=my_custom_algo features_* into app:MyProdModel
  7. Apply the model:
    index=my_data
    | apply my_custom_algo