Validate and improve data quality using type checks

Validate and enhance data quality by using SPL2 type checks to ensure data meets specified criteria, identify invalid data, and implement corrective actions.

When you define a data type, you're specifying the criteria that determines whether a piece of data is valid based on its structure and contents. Data typing provides a framework for data validation and allows you to identify any invalid data. You can check your data against the appropriate types, and then use the results of these checks to gauge the quality of your data and identify potential remediation actions for improving the level of data quality.

To validate and improve the quality of your data, you can follow this high-level pattern:

Choose a data type that describes the criteria for valid data. If the built-in data types in SPL2 do not meet your needs, then define a custom data type that does. For more information, see Built-in data types and Custom data types.
Check whether your data matches the chosen type.
- To check your data against a built-in type, you can use the IS operator in a predicate expression or use the appropriate informational eval function. See IS operator in the current manual and Informational functions in the SPL2 Search Reference.
- To check your data against a custom type, you can use the IS operator in a predicate expression. This predicate expression can be the return expression of a custom eval function. See Checking data against custom data types.
For any data that doesn't pass the type check, inspect the data to identify the reasons for the validation errors and then process the data as needed to fix the problems.
Implement the same type checks upstream in your data ingestion workflows to detect invalid data and route it away from production systems. For example, you can configure an Edge Processor or Ingest Processor pipeline to validate incoming data by checking it against a type, and then send valid and invalid data to different indexes. For more information, see the following documentation:
- Process a subset of data using an Edge Processor in the Use Edge Processors manual.
- Process a subset of data using an Ingest Processor in the Use Ingest Processors manual.

The following example featuring the fictitious Buttercup Games company demonstrates how to validate and improve the quality of your data using data types. In this example, we'll validate and correct indexed data at search time before configuring pipelines to validate and route incoming data.

For information about creating searches, see the Get started searching chapter and Modules and SPL2 statements.

For information about creating pipelines, see the following:

The Working with pipelines chapter in the Use Ingest Processors manual
The Working with pipelines chapter in the Use Edge Processors manual

Example: Validate and improve the data quality of company personnel records

The Buttercup Games company uses Splunk Cloud Platform to store and manage internal data, which includes personnel records. This data is not strictly typed, and over time, the structure and content of the data has become inconsistent as a result of multiple upgrades and migrations on the upstream systems that generate the data. As a result, the personnel index contains records with mixed schemas, such as the following:


employee	name	status	type
109561	Claudia Garcia	Full-time
109562	Ikraam Rahat	Full-time
109563	Wei Zhang	Intern
#110368	Taylor Lee		Contractor
#206895	Sasha Patel	Current	FTE
#206896	John Anderson	Retired	FTE

Notice how the employee field sometimes contains values that are prefixed with the number sign ( # ), and how the information about each employee's contract type and employment status is variously stored in the status or type fields. These variations in the personnel records cause downstream searches and reports to sometimes return errors or contain gaps.

As a data administrator working for Buttercup Games, you want to validate the data in the personnel index against a desired standard so that you can gauge the quality of the data in the index. You also want to improve the quality of the data by fixing any issues that are identified, and preserve the quality of the data in the long term by preventing invalid records from being added to the index.

To validate and improve the quality of the data in the personnel index, you do the following:

Identify data types that describe the criteria for valid data
Audit the index for invalid data
Correct and verify the invalid data
Prevent invalid data from being added to the index

Identify data types that describe the criteria for valid data

To validate the quality of your data, you need to be able to check each piece of data against a specific set of criteria. Start by identifying data types that describe the criteria you want your data to meet. You can use the built-in data types provided in SPL2 or define custom data types.

After speaking to your colleagues at Buttercup Games to confirm the specific criteria that the personnel data needs to meet, you identify the data types that capture those criteria. The following table describes the built-in data type or custom data type definition that captures each criterion:


Data validation criterion	Data type
The `employee` field must contain valid employee ID numbers. A valid ID number consists of 6 digits, starting with `100001` for the first employee.	Use the following SPL2 expression to define a custom data type named `emp_id`: `type emp_id = int WHERE ($value > 100000) AND ($value <= 999999)`
The `name` field must contain the full name of the employee.	For this requirement, you can use the built-in `string` type, which accepts any sequence of alphanumeric characters and special characters as a valid data value.
The `type` field must contain one of the following values: `Full-time`, `Part-time`, `Contractor`, or `Intern`.	Use the following SPL2 expression to define a custom data type named `emp_type`: `type emp_type = string WHERE $value IN ("Full-time", "Part-time", "Contractor", "Intern")`
Each personnel record must contain these fields only: `employee`, `name`, and `type`.	In order to check an entire record against a data type, you must first format the record as a single data value, such as a text string or a JSON object. For example, you can convert the personnel records into JSON objects, and then validate each object against a structured data type that describes the key-value pairs that the object must contain. Use the following SPL2 expression to define a custom data type named `personnel_record`: `type personnel_record = { employee: emp_id, name: string, type: emp_type }` This data type matches with JSON objects that contain `employee`, `name`, and `type` as top-level keys.

Audit the index for invalid data

Check the records in the personnel index to confirm whether they meet your data validation criteria, and identify the specific problems that need to be fixed.

You can start by checking if the records meet the criterion of only including the employee, name, and type fields. To do this, run a basic search that returns all the data in the personnel index, and then review the field names that are returned. In this case, you confirm that all the records fail this criterion because they include an extra status field.

Then, to validate the data values in the employee, name, and type fields, you do the following:

Check if the employee field matches the emp_id data type.
Check if the name field matches the string data type.
Check if the type field matches the emp_type data type.

The following search statement checks each field against the appropriate data type, and then returns the results in fields named employee_check, name_check, and type_check:

$validate_fields = | from $personnel 
| eval employee_check = if(employee IS emp_id, "valid", "invalid")
| eval name_check = if(isstr(name), "valid", "invalid")
| eval type_check = if(type IS emp_type, "valid", "invalid")

This search statement returns the following results:


employee	employee_check	name	name_check	status	type	type_check
109561	valid	Claudia Garcia	valid	Full-time		invalid
109562	valid	Ikraam Rahat	valid	Full-time		invalid
109563	valid	Wei Zhang	valid	Intern		invalid
#110368	invalid	Taylor Lee	valid		Contractor	valid
#206895	invalid	Sasha Patel	valid	Current	FTE	invalid
#206896	invalid	John Anderson	valid	Retired	FTE	invalid

You can then use statistical aggregation functions and mathematical functions to summarize these results for reporting purposes. The following search calculates the percentage of records that meet each field validation criterion, and returns the results in fields named valid_employee_percent, valid_name_percent, and valid_type_percent:

$data_quality_report = | from $validate_fields
| eventstats count(employee_check="valid") AS valid_employee
| eventstats count(name_check="valid") AS valid_name
| eventstats count(type_check="valid") AS valid_type
| eventstats count (employee) AS total
| eval valid_employee_percent = round((valid_employee / total * 100), 2)
| eval valid_name_percent = round((valid_name / total * 100), 2)
| eval valid_type_percent = round((valid_type / total * 100), 2)

The results of this search indicate the following:

50% of the records have valid employee fields.
100% of the records have valid name fields.
16.67% of the records have valid type fields.

Correct and verify the invalid data

After reporting and discussing the results of your audit, you identify the corrections that need to be made to the personnel records:

The number sign ( # ) needs to be removed from the invalid employee values.
Some of the information in the status field needs to be recaptured using standardized values and moved into the type field.
The status field is extraneous and needs to be dropped.

The following search statement fixes the validation errors in the personnel records:

$personnel_valid = from $personnel
| eval employee = trim(employee, "#")
| eval type = case(type = "", status, 
    type = "FTE", "Full-time",
    true, type)
| fields - status

The $personnel_valid search statement returns the following corrected personnel records:


employee	name	type
109561	Claudia Garcia	Full-time
109562	Ikraam Rahat	Full-time
109563	Wei Zhang	Intern
110368	Taylor Lee	Contractor
206895	Sasha Patel	Full-time
206896	John Anderson	Full-time

You can double-check the validity of this data by checking each record in its entirety against the personnel_record data type.

The following search statement does the following:

Uses the tojson function to convert each record into a JSON object.
Checks if each object matches the personnel_record data type.
Returns the results of the validation check in a field named record_check.

$final_validation = from $personnel_valid
| eval record_check = if(tojson() IS personnel_record, "valid", "invalid")

The record_check field returns valid for every personnel record.

Prevent invalid data from being added to the index

You now have a fully valid copy of the personnel data. But to maintain the quality of this data in the long term, you need to make sure that new invalid data doesn't get added to the personnel index. You can do this by implementing data type checks upstream in your data ingestion workflow so that invalid data gets detected and redirected.

For example, if you are using Edge Processor or Ingest Processor pipelines to ingest data into the personnel index, you can configure your pipeline to route any data that fails to match the personnel_record type to a different index for assessment and amendment.

The following pipeline routes invalid data to the staging index, and allows valid data to continue to the personnel index:

import route from /splunk.ingest.commands

type emp_id = int WHERE ($value > 100000) AND ($value <= 999999)

type emp_type = string WHERE $value IN ("Full-time", "Part-time", "Contractor", "Intern")

type personnel_record = {
    employee: emp_id,
    name: string,
    type: emp_type
}

$pipeline = | from $source 
| route NOT (tojson() IS personnel), [
    | eval index = "staging"
    | into $destination
]
| eval index = "personnel"
| into $destination;

Results

The complete SPL2 module for auditing the data in the personnel index is as follows:

type emp_id = int WHERE ($value > 100000) AND ($value <= 999999)

type emp_type = string WHERE $value IN ("Full-time", "Part-time", "Contractor", "Intern")

$validate_fields = | from $personnel 
| eval employee_check = if(employee IS emp_id, "valid", "invalid")
| eval name_check = if(isstr(name), "valid", "invalid")
| eval type_check = if(type IS emp_type, "valid", "invalid")

$data_quality_report = | from $validate_fields
| eventstats count(employee_check="valid") AS valid_employee
| eventstats count(name_check="valid") AS valid_name
| eventstats count(type_check="valid") AS valid_type
| eventstats count (employee) AS total
| eval valid_employee_percent = round((valid_employee / total * 100), 2)
| eval valid_name_percent = round((valid_name / total * 100), 2)
| eval valid_type_percent = round((valid_type / total * 100), 2)

The complete SPL2 module for correcting and validating the personnel records is as follows:

type emp_id = int WHERE ($value > 100000) AND ($value <= 999999)

type emp_type = string WHERE $value IN ("Full-time", "Part-time", "Contractor", "Intern")

type personnel_record = {
    employee: emp_id,
    name: string,
    type: emp_type
}

$personnel_valid = from $personnel
| eval employee = trim(employee, "#")
| eval type = case(type = "", status, type = 
    "FTE", "Full-time",
    true, type)
| fields - status

$final_validation = from $personnel_valid
| eval record_check = if(tojson() IS personnel_record, "valid", "invalid")

By using these modules to validate and correct the data in the personnel index, and then using the pipeline shown in the Prevent invalid data from being added to the index section to ensure that only valid data is allowed into the personnel index, you have improved the consistency and correctness of the Buttercup Games personnel records and implemented safeguards to uphold this improved level of data quality indefinitely.