Standardize data using SPL2 data types

Standardize data by constraining it to specific SPL2 data types, ensuring consistent formats for downstream operations.

By default, data is loosely typed in SPL2. Data is not restricted to specific formats or rules about its contents, and so logs and event fields that belong to the same dataset can contain a wide variety of values. For situations where your data needs to follow a consistent format, such as in order to meet requirements for downstream operations, you can standardize your data by constraining it to types.

A data type defines the allowed schema and range of values for a given piece of data. When you constrain your data to a particular type, you apply structure and rules to the data so that all values consistently meet the requirements described in the type definition.

To standardize your data, you can follow this high-level pattern:

  1. Choose a data type that describes the requirements you want your data to meet. If the built-in data types in SPL2 do not meet your needs, then define a custom data type that does. For more information, see Built-in data types and Custom data types.

  2. Decide whether you want to standardize indexed data at search time or incoming data at ingest time. This determines whether you need to use data types in a search module or in an Edge Processor or Ingest Processor pipeline.

  3. Check whether your data matches the chosen type, and identify any data that doesn't match the chosen type.

    • To check your data against a built-in type, you can use the IS operator in a predicate expression or use the appropriate informational eval function. See IS operator in the current manual and Informational functions in the SPL2 Search Reference.

    • To check your data against a custom type, you can use the IS operator in a predicate expression. This predicate expression can be the return expression of a custom eval function. See Checking data against custom data types.

  4. Convert the non-matching data into the chosen type.

    • To convert your data to a built-in type, use the appropriate conversion eval function. See Conversion functions in the SPL2 Search Reference.

    • To convert your data to a custom type, use a custom eval function. See Custom eval functions.

    Note: Alternatively, if you are working with data that is streaming through an Edge Processor or Ingest Processor pipeline, you can choose to drop or reroute any data that does not match the chosen type. For more information, see Validate and improve data quality using type checks.

The following example featuring the fictitious Buttercup Games company demonstrates how to standardize logs using data types. In this example, we will standardize indexed data at search time, but the same logic is also applicable to data that is streaming through an Edge Processor or Ingest Processor pipeline.

For information about creating searches, see the Get started searching chapter and Modules and SPL2 statements.

For information about creating pipelines, see the following:

Example: Standardize inconsistently formatted sales logs

The Buttercup Games company sells its products through stores that are located all over the world. When a customer makes a purchase at a particular store, logs pertaining to that sales transaction are routed from a server in their geographical region to a centralized Splunk Cloud Platform deployment that indexes all of the sales logs.

Typically, the sales logs are emitted in the following format:

[12/Jan/2025:18:23:07] VendorID=5037 Code=C AcctID=5317605039838520
[12/Jan/2025:18:23:22] VendorID=9108 Code=A AcctID=2194850084423218
[12/Jan/2025:18:23:49] VendorID=1285 Code=F AcctID=8560077531775179

As one of the data administrators managing the store_sales index that stores these sales logs, you've configured Splunk Cloud Platform to extract pertinent details from the logs into event fields so that you can run reports on this data and help track sales activity over time.

However, the company recently opened store locations in a new region, and the administrator for the servers in that region mistakenly configured the system to emit sales logs in this different format:

Thu Jan 12 2025 18:23:59 [1153] Code=D AcctID=4433276107716482
Thu Jan 12 2025 18:24:28 [7042] Code=N AcctID=4139333167069488
Thu Jan 12 2025 18:24:50 [1065] Code=L AcctID=3168124750473449

As a result, the store_sales index now contains logs that are a mix of different formats:

_time_raw

6:23:07 PM

12 Jan 2025

[12/Jan/2025:18:23:07] VendorID=5037 Code=C AcctID=5317605039838520

6:23:22 PM

12 Jan 2025

[12/Jan/2025:18:23:22] VendorID=9108 Code=A AcctID=2194850084423218

6:23:49 PM

12 Jan 2025

[12/Jan/2025:18:23:49] VendorID=1285 Code=F AcctID=8560077531775179

6:23:59 PM

12 Jan 2025

Thu Jan 12 2025 18:23:59 [1153] Code=D AcctID=4433276107716482

6:24:28 PM

12 Jan 2025

Thu Jan 12 2025 18:24:28 [7042] Code=N AcctID=4139333167069488

6:24:50 PM

12 Jan 2025

Thu Jan 12 2025 18:24:50 [1065] Code=L AcctID=3168124750473449

Your configurations for extracting event fields fail to run on these new logs because the logs are inconsistent with the expected data schema, causing gaps in the report data.

To standardize the logs in the store_sales index, you do the following:

  1. Define custom data types that describe the sales log formats

  2. Identify the logs that do not match the standard format

  3. Convert the inconsistent logs into the standard format

Define custom data types that describe the sales log formats

Start by defining these 2 custom data types in a search module:

  1. The original_format type, which describes the standard format of a sales log event from a Buttercup Games store.

  2. The new_format type, which describes the format of the sales log events from the stores in the new region.

You can then check all the data in the store_sales index against these types to differentiate between the logs that are already consistent with the standard format and the ones that are inconsistent.

The following SPL2 expressions define the original_format data type. The regular expression that describes the standard sales log format is returned by a custom function named original_regex instead of being included literally in the type definition, so that the regular expression can be easily reused in other parts of the module.

type original_format = string WHERE match($value, original_regex());

function original_regex(): regex {
    return /\[(?P<tmp_day>\d+)\/(?P<tmp_month>[A-z][a-z]*)\/(?P<tmp_year>\d+):(?P<tmp_time>\d+:\d+:\d+)\]\s+VendorID=(?P<tmp_vendor>\d+)\s+Code=(?P<tmp_code>[A-Z])\s+AcctID=(?P<tmp_acct>\d+)/;
}

The following SPL2 expressions define the new_format data type. The regular expression that describes the format of the sales logs from the new region is returned by a custom function named new_regex.

type new_format = string WHERE match($value, new_regex());

function new_regex(): regex {
    return /(?P<tmp_weekday>[A-z][a-z]+)\s+(?P<tmp_month>[A-z][a-z]+)\s+(?P<tmp_day>\d+)\s+(?P<tmp_year>\d+)\s+(?P<tmp_time>\d+:\d+:\d+)\s+\[(?P<tmp_vendor>\d+)\]\s+Code=(?P<tmp_code>[A-Z])\s+AcctID=(?P<tmp_acct>\d+)/;
}

Identify the logs that do not match the standard format

Use a custom eval function to check the store_sales index for logs that do not match the original_format data type.

The following SPL2 expression defines the get_format function, which evaluates any given input data and then returns one of the following values:

  1. original, if the input data matches the original_format data type.

  2. new, if the input data matches the new_format data type.

  3. <UNKNOWN>, if the input data does not match either of those data types.

function get_format($data) {
    return case(
        $data is original_format, "original",
        $data is new_format, "new",
        true, "<UNKNOWN>");
}

After defining the get_format function, you can use the following search statement retrieve all the logs from the store_sales index, check each log to see if it matches the original_format or new_format data types, and then store the results of the check in a field named format:

$standardized_logs = from store_sales | eval format = get_format(_raw)

This search statement returns the following results:

Convert the inconsistent logs into the standard format

Now that you know exactly which logs in the store_sales index are inconsistently formatted and need to be amended, you can select those logs for processing and convert them into the standard sales log format.

To convert the inconsistent logs into the standard format, you need to do the following:

  1. Use the rex command to extract each piece of information from the _raw field of the log into a top-level event field.

  2. Use a custom eval function to compile the extracted pieces of information into the standard sales log format, and write the resulting log back into the _raw field.

The following SPL2 expression defines the to_original function, which takes the information from the event fields specified in the function arguments and then compiles the information into the standard sales log format:

function to_original($day: int, $month: string, $year: string, $time: string, $vendor: int, $code: string, $account: long) {
    return "[${$day}/${$month}/${$year}:${$time}] VendorID=${$vendor} Code=${$code} AcctID=${$account}";
}

The following search statement continues the example shown during the previous section. After identifying the inconsistent logs using the get_format function, you can use the rex command and the to_original function to convert those logs into the standard format. Finally, you can use the fields command to drop any fields that were only needed to support the conversion.

$standardized_logs = from store_sales
    | eval format = get_format(_raw)
    | rex field=_raw new_regex()
    | eval _raw = if(format=="new", 
        to_original(tmp_day, tmp_month, tmp_year, tmp_time, tmp_vendor, tmp_code, tmp_acct), _raw)
    | fields - format, 'tmp_*'

Results

The complete SPL2 module for standardizing the Buttercup Games sales logs is as follows:

type original_format = string WHERE match($value, original_regex());

type new_format = string WHERE match($value, new_regex());

function original_regex(): regex {
    return /\[(?P<tmp_day>\d+)\/(?P<tmp_month>[A-z][a-z]*)\/(?P<tmp_year>\d+):(?P<tmp_time>\d+:\d+:\d+)\]\s+VendorID=(?P<tmp_vendor>\d+)\s+Code=(?P<tmp_code>[A-Z])\s+AcctID=(?P<tmp_acct>\d+)/;
}

function new_regex(): regex {
    return /(?P<tmp_weekday>[A-z][a-z]+)\s+(?P<tmp_month>[A-z][a-z]+)\s+(?P<tmp_day>\d+)\s+(?P<tmp_year>\d+)\s+(?P<tmp_time>\d+:\d+:\d+)\s+\[(?P<tmp_vendor>\d+)\]\s+Code=(?P<tmp_code>[A-Z])\s+AcctID=(?P<tmp_acct>\d+)/;
}

function get_format($data) {
    return case(
        $data is original_format, "original",
        $data is new_format, "new",
        true, "<UNKNOWN>");
}

function to_original($day: int, $month: string, $year: string, $time: string, $vendor: int, $code: string, $account: long) {
    return "[${$day}/${$month}/${$year}:${$time}] VendorID=${$vendor} Code=${$code} AcctID=${$account}";
}

$standardized_logs = | from $store_sales
    | eval format = get_format(_raw)
    | rex field=_raw new_regex()
     eval _raw = if(format=="new",
        to_original(tmp_day, tmp_month, tmp_year, tmp_time, tmp_vendor, tmp_code, tmp_acct), _raw)
    | fields - format, 'tmp_*'

This module returns the following consistently formatted set of sales logs:

In addition to explaining the situation to the other data administrator, you can make it easier for them to understand the required data schema and correct their data by sharing the module with them.

See also