Creating and using data schemas with SPL2 data types
SPL2 allows for flexible data typing, enabling schema creation with data types to add structure, validation, and handling logic to your data.
By design, SPL2 and SPL are loosely and implicitly typed languages that do not define the schema of the data. You don't need to define specific types for your data before working with it, and the language infers types as needed in order to determine if a piece of data is valid for a given operation. For example, when an expression like eval errors = 10 is processed, the errors field is implicitly typed as an integer and considered to be valid input for numerical operations.
However, when working in SPL2, you can choose to create a schema by using data types. Data types are classifications that specify the allowed format and range of values for a given piece of data. By constraining your data to well-defined types, you can add structure, descriptive metadata, validation logic, and handling logic to your data. Data types serve as the basic building blocks for constructing a data schema.
Flexible data typing in SPL2
SPL2 provides an extensible data typing system that you can use to define the schema of your data on an as-needed basis:
-
You can control when to constrain data to specific types and when to leave the data loosely typed.
-
You can choose to constrain various levels of data to different types. For example, a dataset can contain records that match the
objecttype, and the individual fields in the records can match other types such asstringornumber. -
You can expand beyond the default set of supported data types by creating fine-tuned custom data types that match the shape of your actual data.
You can also choose to not use data types, and instead rely on the loose typing logic that the Splunk platform uses by default.
When to use strong typing instead of loose typing
SPL2 is loosely typed by default. You don't need to define specific types for your data before successfully ingesting, searching, and processing it, and you can define functions that allow the input and output to be any type of data. This flexibility reduces the amount of overhead required to work with the language, and supports use cases where the schema of the data is unknown or highly variable.
However, there are also situations where it is beneficial to constrain your data to specific types. Compared to strongly typed data, loosely typed data is relatively ambiguous. For example, if an event field named customer is untyped, it is unclear if the field should contain names, ID numbers, detailed records in JSON object format, or something else. The field might even contain a mix of those values if consistency is not enforced during data entry.
You can use strong typing to eliminate this ambiguity and set logical guidelines for your data. For example, assume that the customer event field is intended to contain ID numbers only. You can make this requirement explicit by constraining the customer field to the int data type, which corresponds to integers or whole numbers.
-
You can verify whether all the values in the
customerfield are ID numbers instead of names or other kinds of text strings. The following expression returnstrueif thecustomervalue matches theinttype, and returnsfalseotherwise:... | eval type_check_results = if(isint(customer), "true", "false") -
You can filter the data so that only records that have ID numbers in the
customerfield are allowed to continue downstream. The following expression filters the data and only retains records where the value in thecustomerfield matches theinttype:... | where isint(customer)
Specifying precise data requirements using custom data types
In addition to using the default data types that are built into SPL2, you can also create custom data types that specify more precise requirements.
For example, assume that a valid customer ID number must be exactly 6 digits, and that the first digit cannot be a 0. In this case, constraining the customer field to the int type will not suffice, since integers include numbers that have fewer or more than 6 digits as well as numbers that have 0 as the first digit. To capture these requirements for customer ID numbers, you can define a custom data type that is based on the int type but restricts the allowed range of values. The following expression defines a custom type named id_number, which corresponds to integers that are between 100000 and 999999, inclusive:
type id_number = int where ($value BETWEEN 100000 AND 999999)
You can then constrain the customer field to this custom type.
-
The following expression returns
trueif thecustomervalue matches theid_numbertype, and returnsfalseotherwise:... | eval type_check_results = if(customer IS id_number, "true", "false") -
The following expression filters the data and only retains records where the value in the
customerfield matches theid_numbertype:... | where customer IS id_number
You can also combine multiple data types together in order to create an advanced data type that layers together the requirements defined in those individual types.
As another example, assume that the _raw field in your data contains complete customer records, which are JSON objects containing the keys customer, name, email, and vip_member. A valid customer record looks like this:
{customer: 109351, name: "Buttercup Games Company", email: "info@buttercupgames.com", vip_member: "true"}
You can use a combination of data types to constrain the values in the _raw field to this customer record format and ensure that the values for each key are valid. To do this, start by identifying data types that describe the validation requirement for each key in the JSON object:
| Validation requirement | Data type |
|---|---|
|
The |
Use the following SPL2 expression to define a custom data type named
|
|
The |
Use the built-in |
|
The |
Use the following SPL2 expression and regular expression to define a custom data type named
|
|
The |
Use the built-in |
Then, to specify how these data types must be contained in a JSON object format, combine them in a structured data type named customer_record:
type customer_record = {
customer: id_number,
name: string,
email: email_address,
vip_member: boolean
}
You can then constrain the _raw field to this customer_record type in order to ensure that the customer records are all JSON objects that contain the keys customer, name, email, and vip_member, and that the value in each key is valid according to your requirements.
By constraining a piece of data to a particular type, you can specify and enforce rules about the structure and content of that data.
Using schemas to work with your data
When you create a schema for your data using data types, you implement a metadata framework that describes the characteristics of the data, such as the relevant fields and the allowed range of values. This framework allows you to distinguish between different kinds of data, identify data of interest, and selectively filter and process specific subsets of data.
Depending on your particular needs, you might want to use schemas during ingest time, search time, or both. Constraining incoming data to types as it streams through an Edge Processor or Ingest Processor pipeline allows you to categorize that data, process and route different categories differently, and validate and correct the data before writing it to storage. In contrast, constraining search results to types allows you to categorize and selectively process data while investigating, analyzing, and reporting on indexed data.
Across all contexts, schematizing your data allows you to achieve a broad range of data processing outcomes, including the following:
Standardize your data
Ensure that your data follows a consistent schema by constraining the data to an appropriate type. For more information, see Standardize data using SPL2 data types.
Validate and improve data quality
Assess the quality of your data by checking whether the structure and content of the data meets the requirements specified in the type definitions. Improve the quality of your data by identifying invalid data, preventing it from reaching production environments, and correcting it. For more information, see Validate and improve data quality using type checks.
Implement data handling
Distinguish between different subsets of data based on their schemas so that you can select, process, and route data of interest in different ways. For more information, see Implement data handling logic using SPL2 data types.
See also
Related reference
Conversion functions in the SPL2 Search Reference
Informational functions in the SPL2 Search Reference