Performance tuning and handling large datasets

Combine the power of the Splunk platform search with container-based machine learning workloads using tThe Splunk App for Data Science and Deep Learning (DSDL). It is important to manage large datasets with millions or billions of events effectively, so you don't push container memory and CPU usage to the limit and affect your costs and app performance.

When you run the fit or apply commands on multiple terabytes of data in the Splunk platform, the container environment handles that data in memory or stream it in a distributed manner. This can lead to the following issues:

  • Container memory overruns if the container tries to load a large DataFrame at once.
  • Container memory times out if the job exceeds the maximum search or container run time.
  • Excessive CPU usage overtaxes the high-performance computing (HPC) or container node due to no data sampling or partitioning.

To avoid performance issues with large datasets, follow these best practices for performance tuning, data partitioning, data preprocessing, sampling strategies, and resource configuration in Docker, Kubernetes, or OpenShift when working with large datasets.

Data filtering and partitioning

Review the options to filter and partition your data.

Use SPL to filter and summarize data

You can filter your data with SPL before sending it to the container as shown in the following example:

index=your_big_index
| search eventtype=anomalies source="some_path"
| stats count by user
| fit MLTKContainer ...

Aggregate or summarize large logs if you only need aggregated features. For time-series events, consider summarizing by the minute or hour.

CAUTION: Running the fit command on raw events can cause excessive memory consumption. Follow best practices for data preparation.

Use data splitting or partitioning

For large training sets, partition them in multiple Splunk platform searches or chunked intervals:

Index the data as shown in the following example:

index=your_big_index earliest=-30d latest=-15d
| fit MLTKContainer mode=stage ...

Then index another chunked interval from latest=-15d to present.

Note: If your algorithm or code supports incremental or partial training, your notebook can handle partial merges or checkpointing.

Set container resource tuning

Review the following options for container resource tuning.

CPU and memory requests

In Kubernetes or OpenShift, set the resources.requests.memory and resources.limits.memory attributes to a higher amount if you anticipate large in-memory DataFrames.

Example:

resources:
  requests:
    memory: "4Gi"
  limits:
    memory: "16Gi"

In Docker single-host setups, pass --memory 16g --cpus 4 to limit the memory to 16GB and 4 CPU cores. Adjust to your usage.

GPU considerations

If your algorithm relies on GPUs, ensure GPU resource requests with nvidia.com/gpu:1. GPU usage helps with large neural network training, but you must optimize your code for multi-GPU or HPC if you exceed the capacity of a single GPU.

Development and production containers

Development containers might only need a small memory limit for iterative notebook coding with a sample dataset. Production containers often require more robust resource allocations if the final dataset is large. Plan your HPC node or Docker host capacity accordingly.

Data sampling and splitting with SPL

Review methods by which you can use SPL for data sampling and data splitting.

Use the sample command

Using the ML-SPL sample command as follows can randomly downsample events to 10,000, giving a quick dataset for development or prototyping:

| sample 10000
For final model training, remove or reduce sample command usage, or use partial sampling as shown in the following example:
| sample partitions=10 seed=42 | where partition_number<7

Partition large datasets

You can partition data using sample partitions=N or a JSON modulo operator (%).

If your code supports incremental training, you can feed in each partition separately as shown in the following example:

index=your_big_index
| sample partitions=10 seed=42
| where partition_number < 8
| fit MLTKContainer algo=...

Then combine or continue training with other partitions in separate runs.

Reduce data volume with data summaries

For extremely large sets of raw data, you can run an initial stats or timechart command to reduce the data volume as shown in the following example:

index=your_big_index
| stats avg(value) as avg_value, count by some_field
| fit MLTKContainer ...

Note: This approach is useful for time series or aggregator-based machine learning tasks.

Managing memory-intensive code

Consider the following options when managing memory-intensive machine learning code.

Option Description
In-notebook chunking If your code reads the entire DataFrame at once, consider chunking inside the notebook. For example, using pandas.read_csv(..., chunksize=100000) if you load data from a .CSV.

For multi-million row data, libraries like Dask, Vaex, or Spark in your container might help and can handle out-of-core operations.

Partial-fit or streaming algorithms Some scikit-learn or River libraries support partial_fit for incremental learning.

If you define partial_fit logic in your notebook, you can stage data chunks one-by-one from the Splunk platform.

HPC and distributed approaches For extremely large datasets, you can perform distributed training with Spark, PyTorch DDP, or Horovod. DSDL can start the container, but you handle multi-node distribution.

Alternatively, rely on a separate HPC job manager to orchestrate multi-node training, then push the final results back into the Splunk platform.

Managing timeout and resource limits

Review the following table for options to manage your timeout and resource limits:

Option Details
Max search runtime By default, the Splunk platform can stop searches that exceed certain CPU or wall-clock times. You can increase or remove that limit if your training or inference is known to be long.

CAUTION: Be mindful if you are increasing from the default limits to avoid causing issues on your search head.
DSDL container max time In DSDL, you can configure a maximum model runtime or idle the stop threshold. If you expect a multi-hour HPC training job, increase these timeouts so the model doesn't terminate prematurely.
HPC queue or Splunk search scheduler If HPC usage is managed by a queue system such as Slurm, you might want to orchestrate jobs outside of the Splunk platform scheduling. Alternately you can set extended Splunk search timeouts so HPC tasks can complete properly.

Using Splunk Observability for large jobs

Review this table for options using Splunk Observability for large, data intensive jobs:

Option Details
Resource metrics For large workloads, track container CPU and memory or GPU usage in Splunk Observability or _internal logs.

If usage spikes or the container hits the OOM killers, you see it in container logs or HPC logs.

Step-by-step logging If your training takes hours, consider streaming intermediate logs to Splunk HEC, or partial logs in stdout, so you can see progress.

Alerts from the Splunk platform or Splunk Observability can notify you if usage patterns deviate from normal. For example, if your memory is climbing unexpectedly.

Example: Large dataset workflow

The following is an example of a large dataset workflow:

  1. Prepare your data. The following example code summarizes data before sending it to the container.
    index=huge_data
    | search user_type=customer
    | stats count, avg(metric) as avg_metric by user
    | fit MLTKContainer algo=big_notebook into app:BigModel
    
  2. Create the container. For Docker or Kubernetes choose 16GB memory, 4 CPU, or possibly 1 GPU.

    Note: The code in your notebook uses scikit-learn or PyTorch with chunked data reading as needed.
  3. Set the time. The Splunk platform might only hold the search open for 2 hours. If so, set max time or rely on HPC queue outside of Splunk.

    Note: Container logs or partial metrics appear in _internal or the container logs index.
  4. Save the model. The model is saved as app:BigModel. HPC ephemeral volumes are irrelevant because DSDL syncs the final artifacts to the Splunk platform.

Troubleshooting performance tuning and large datasets

See the following table for issues you might experience and how to resolve them:

Problem Cause Solution
Container hits OOM killer mid-training. The dataset is too large or the container memory limit is too low. Increase memory requests in Kubernetes or Docker --memory. Reduce data size using SPL, or chunk your data.
The Splunk platform stops the search after N minutes. The Splunk platform default search timeout, or MLTK container maximum run time, is too small. Adjust the max_search_runtime setting or the container idle stop threshold in DSDL under Setup.
HPC cluster node runs out of GPU memory. The model or data batch size is too large on the GPU. Adjust your code to reduce batch size, use smaller model layers, or move to multi-GPU.
You see "RuntimeError: CUDNN_STATUS_ALLOC_FAILED" in container logs. You are out of memory on GPU, or there is another resource conflict. Check the container logs and consider a smaller batch size. You can also re-check HPC job scheduling if multiple GPU tasks are overlapping.
Partial data is loaded, but the container never finishes the fit command. There is insufficient filtering or no chunking method in your code, leading to a large data load. Use SPL to summarize or chunk data. Consider adding partial_fit logic in the notebook.