Configure algorithm performance costs
Running machine learning algorithms impacts your Splunk platform performance costs. Machine learning requires compute resources, run-time, memory, and disk space. Each machine learning algorithm has a different cost, complicated by the number of input fields you select and the total number of events being processed.
The Splunk Machine Learning Toolkit's mlspl.conf file controls the settings for the ML-SPL commands of fit
and apply
. This mlspl.conf file ships with conservative default settings to prevent the overloading of a search head. The default settings for the file are set to intelligently sample down to 100K events.
On-premises Splunk platform users with admin permissions can configure the default settings of the mlspl.conf file on the default directory, or make edits using the in-app Settings tab. Changes can be made across all algorithms, or for individual algorithms. Splunk Cloud Platform users must create a support ticket to make changes to the mlspl.conf file.
Edit default settings in the mlspl.conf file
Admin users can edit the settings of the mlspl.conf file located in the default directory.
Prerequisite
To avoid losing your configuration file changes when you upgrade the app in the future, create a copy of the mlspl.conf file with only the modified stanzas and settings. Save the copy to $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/local/ or %SPLUNK_HOME%\etc\apps\Splunk_ML_Toolkit\local\ depending on your operating system.
Access and edit the mlspl.conf file
Perform the following steps to access and edit the default algorithm settings:
- Access the default directory with the file path that corresponds to your operating system:
- For *nix based systems: $SPLUNK_HOME/etc/apps/Splunk_ML_Toolkit/default/mlspl.conf
- For Windows systems: %SPLUNK_HOME%\etc\apps\Splunk_ML_Toolkit\default\mlspl.conf
- In the mlspl.conf file, specify default settings for all algorithms, or for an individual algorithm. To apply global settings, use the [default] stanza. To apply algorithm-specific settings, use a stanza named for the algorithm. For example, use [LinearRegression] for the LinearRegression algorithm.
Not all global settings can be set or overwritten in an algorithm-specific section. For further information, see How to edit a configuration file in the Splunk Enterprise Admin Manual.
Available fields in the mlspl.conf file
The following fields can be edited within the mlspl.conf file provided you have admin permissions:
Setting name | Default value | Description |
---|---|---|
max_inputs
|
100000 | The maximum number of events an algorithm considers when fitting a model. If this limit is exceeded and use_sampling is true, the fit command down samples its input using the Reservoir Sampling algorithm before fitting a model. If use_sampling is false and this limit is exceeded, the fit command throws an error.
|
use_sampling
|
true | Indicates whether to use Reservoir Sampling for data sets that exceed max_inputs or to instead throw an error.
|
max_fit_time
|
600 | The maximum time, in seconds, to spend in the "fit" phase of an algorithm. This setting does not relate to the other phases of a search such as retrieving events from an index. |
max_memory_usage_mb
|
1000 | The maximum allowed memory usage, in megabytes, by the fit command while fitting a model.
|
max_model_size_mb
|
15 | The maximum allowed size of a model, in megabytes, created by the fit command. Some algorithms such as SVM and RandomForest might create unusually large models, which can lead to performance problems with bundle replication.
|
max_distinct_cat_values
|
100 | The maximum number of distinct values in a categorical feature field, or input field, that will be used in one-hot encoding. One-hot encoding is when you convert categorical values to numeric values. If the number of distinct values exceeds this limit, the field will be dropped, or excluded from analysis, and a warning appears. |
max_distinct_cat_values_for_classifiers
|
100 | The maximum number of distinct values in a categorical field that is the target, or output, variable in a classifier algorithm. |
Edit default settings in the MLTK app
You can edit the default algorithm settings through the Settings tab of the Splunk Machine Learning Toolkit.
Perform the following steps to access and edit the Settings tab:
- In your Splunk platform instance, choose the Splunk Machine Learning Toolkit app.
- From the main navigation bar, choose the Settings tab.
- On the resulting Algorithm Settings page, edit all algorithm settings, or settings for a specific algorithm:
- Choose the Edit Default Settings button in the top right of the page change settings across all algorithms.
- Select the name of an individual algorithm to change the settings for that particular algorithm.
- On the edit settings page, hover over any field name for additional information. Make changes to one or more fields as needed.
- Click Save when done.
Available fields on the in-app Settings page
The following fields are available to edit across all algorithms offered in MLTK:
Field name | Default value | Description |
---|---|---|
handle_new_cat
| default | Action to perform when new value(s) for categorical variable/ explanatory variable is encountered in partial_fit . Default sets all values of the column that correspond to the new categorical value to zeroes. Skip skips over rows that contain the new value(s) and raises a warning. Stop stops the operation by raising an error. |
max_distinct_cat_values
| 100 | The maximum number of distinct values in a categorical feature field, or input field, that will be used in one-hot encoding. One-hot encoding is when you convert categorical values to numeric values. If the number of distinct values exceeds this limit, the field will be dropped, or excluded from analysis, and a warning appears. |
max_distinct_cat_values_for_classifiers
| 100 | The maximum number of distinct values in a categorical field that is the target, or output, variable in a classifier algorithm. |
max_distinct_cat_values_for_scoring
| 100 | Determines the upper limit for the number of distinct values in a categorical field that is the target (or response) variable in a scoring method. If the number of distinct values exceeds this limit, the field will be dropped (with an appropriate error message). |
max_fit_time
| 600 | The maximum time, in seconds, to spend in the "fit" phase of an algorithm. This setting does not relate to the other phases of a search such as retrieving events from an index. |
max_inputs
| 100000 | The maximum number of events an algorithm considers when fitting a model. If this limit is exceeded and use_sampling is true, the fit command downsamples its input using the Reservoir Sampling algorithm before fitting a model. If use_sampling is false and this limit is exceeded, the fit command throws an error. |
max_memory_usage_mb
| 1000 | The maximum amount of memory in megabytes that can be used by the fit command when computing the model. |
max_model_size_mb
| 15 | The maximum allowed amount of space in megabytes that the final model as created by the fit command is allowed to take up on disk. Some algorithms such as SVM and RandomForest, might create unusually large models, which can lead to performance problems with bundle replication. |
max_score_time
| 600 | The maximum time, in seconds, to spend in the "score" phase of an algorithm. |
use_sampling
| true | Indicates whether to use Reservoir Sampling for data sets that exceed max_inputs or to instead throw an error. |