Validate and improve data quality using type checks

Validate and enhance data quality by using SPL2 type checks to ensure data meets specified criteria, identify invalid data, and implement corrective actions.

When you define a data type, you're specifying the criteria that determines whether a piece of data is valid based on its structure and contents. Data typing provides a framework for data validation and allows you to identify any invalid data. You can check your data against the appropriate types, and then use the results of these checks to gauge the quality of your data and identify potential remediation actions for improving the level of data quality.

To validate and improve the quality of your data, you can follow this high-level pattern:

  1. Choose a data type that describes the criteria for valid data. If the built-in data types in SPL2 do not meet your needs, then define a custom data type that does. For more information, see Built-in data types and Custom data types.

  2. Check whether your data matches the chosen type.

    • To check your data against a built-in type, you can use the IS operator in a predicate expression or use the appropriate informational eval function. See IS operator in the current manual and Informational functions in the SPL2 Search Reference.

    • To check your data against a custom type, you can use the IS operator in a predicate expression. This predicate expression can be the return expression of a custom eval function. See Checking data against custom data types.

  3. For any data that doesn't pass the type check, inspect the data to identify the reasons for the validation errors and then process the data as needed to fix the problems.

  4. Implement the same type checks upstream in your data ingestion workflows to detect invalid data and route it away from production systems. For example, you can configure an Edge Processor or Ingest Processor pipeline to validate incoming data by checking it against a type, and then send valid and invalid data to different indexes. For more information, see the following documentation:

The following example featuring the fictitious Buttercup Games company demonstrates how to validate and improve the quality of your data using data types. In this example, we'll validate and correct indexed data at search time before configuring pipelines to validate and route incoming data.

For information about creating searches, see the Get started searching chapter and Modules and SPL2 statements.

For information about creating pipelines, see the following:

Example: Validate and improve the data quality of company personnel records

The Buttercup Games company uses Splunk Cloud Platform to store and manage internal data, which includes personnel records. This data is not strictly typed, and over time, the structure and content of the data has become inconsistent as a result of multiple upgrades and migrations on the upstream systems that generate the data. As a result, the personnel index contains records with mixed schemas, such as the following:

employeenamestatustype

109561

Claudia Garcia

Full-time

 

109562

Ikraam Rahat

Full-time

 

109563

Wei Zhang

Intern

 

#110368

Taylor Lee

 

Contractor

#206895

Sasha Patel

Current

FTE

#206896

John Anderson

Retired

FTE

Notice how the employee field sometimes contains values that are prefixed with the number sign ( # ), and how the information about each employee's contract type and employment status is variously stored in the status or type fields. These variations in the personnel records cause downstream searches and reports to sometimes return errors or contain gaps.

As a data administrator working for Buttercup Games, you want to validate the data in the personnel index against a desired standard so that you can gauge the quality of the data in the index. You also want to improve the quality of the data by fixing any issues that are identified, and preserve the quality of the data in the long term by preventing invalid records from being added to the index.

To validate and improve the quality of the data in the personnel index, you do the following:

  1. Identify data types that describe the criteria for valid data

  2. Audit the index for invalid data

  3. Correct and verify the invalid data

  4. Prevent invalid data from being added to the index

Identify data types that describe the criteria for valid data

To validate the quality of your data, you need to be able to check each piece of data against a specific set of criteria. Start by identifying data types that describe the criteria you want your data to meet. You can use the built-in data types provided in SPL2 or define custom data types.

After speaking to your colleagues at Buttercup Games to confirm the specific criteria that the personnel data needs to meet, you identify the data types that capture those criteria. The following table describes the built-in data type or custom data type definition that captures each criterion:

Data validation criterionData type

The employee field must contain valid employee ID numbers. A valid ID number consists of 6 digits, starting with 100001 for the first employee.

Use the following SPL2 expression to define a custom data type named emp_id:

type emp_id = int WHERE ($value > 100000) AND ($value <= 999999)

The name field must contain the full name of the employee.

For this requirement, you can use the built-in string type, which accepts any sequence of alphanumeric characters and special characters as a valid data value.

The type field must contain one of the following values: Full-time, Part-time, Contractor, or Intern.

Use the following SPL2 expression to define a custom data type named emp_type:

type emp_type = string WHERE $value IN ("Full-time", "Part-time", "Contractor", "Intern")

Each personnel record must contain these fields only: employee, name, and type.

In order to check an entire record against a data type, you must first format the record as a single data value, such as a text string or a JSON object. For example, you can convert the personnel records into JSON objects, and then validate each object against a structured data type that describes the key-value pairs that the object must contain.

Use the following SPL2 expression to define a custom data type named personnel_record:

type personnel_record = {
    employee: emp_id,
    name: string,
    type: emp_type
}

This data type matches with JSON objects that contain employee, name, and type as top-level keys.

Audit the index for invalid data

Check the records in the personnel index to confirm whether they meet your data validation criteria, and identify the specific problems that need to be fixed.

You can start by checking if the records meet the criterion of only including the employee, name, and type fields. To do this, run a basic search that returns all the data in the personnel index, and then review the field names that are returned. In this case, you confirm that all the records fail this criterion because they include an extra status field.

Then, to validate the data values in the employee, name, and type fields, you do the following:

  1. Check if the employee field matches the emp_id data type.

  2. Check if the name field matches the string data type.

  3. Check if the type field matches the emp_type data type.

The following search statement checks each field against the appropriate data type, and then returns the results in fields named employee_check, name_check, and type_check:

$validate_fields = | from $personnel 
| eval employee_check = if(employee IS emp_id, "valid", "invalid")
| eval name_check = if(isstr(name), "valid", "invalid")
| eval type_check = if(type IS emp_type, "valid", "invalid")

This search statement returns the following results:

You can then use statistical aggregation functions and mathematical functions to summarize these results for reporting purposes. The following search calculates the percentage of records that meet each field validation criterion, and returns the results in fields named valid_employee_percent, valid_name_percent, and valid_type_percent:

$data_quality_report = | from $validate_fields
| eventstats count(employee_check="valid") AS valid_employee
| eventstats count(name_check="valid") AS valid_name
| eventstats count(type_check="valid") AS valid_type
| eventstats count (employee) AS total
| eval valid_employee_percent = round((valid_employee / total * 100), 2)
| eval valid_name_percent = round((valid_name / total * 100), 2)
| eval valid_type_percent = round((valid_type / total * 100), 2)

The results of this search indicate the following:

  • 50% of the records have valid employee fields.

  • 100% of the records have valid name fields.

  • 16.67% of the records have valid type fields.

Correct and verify the invalid data

After reporting and discussing the results of your audit, you identify the corrections that need to be made to the personnel records:

  • The number sign ( # ) needs to be removed from the invalid employee values.

  • Some of the information in the status field needs to be recaptured using standardized values and moved into the type field.

  • The status field is extraneous and needs to be dropped.

The following search statement fixes the validation errors in the personnel records:

$personnel_valid = from $personnel
| eval employee = trim(employee, "#")
| eval type = case(type = "", status, 
    type = "FTE", "Full-time",
    true, type)
| fields - status

The $personnel_valid search statement returns the following corrected personnel records:

You can double-check the validity of this data by checking each record in its entirety against the personnel_record data type.

The following search statement does the following:

  1. Uses the tojson function to convert each record into a JSON object.

  2. Checks if each object matches the personnel_record data type.

  3. Returns the results of the validation check in a field named record_check.

$final_validation = from $personnel_valid
| eval record_check = if(tojson() IS personnel_record, "valid", "invalid")

The record_check field returns valid for every personnel record.

Prevent invalid data from being added to the index

You now have a fully valid copy of the personnel data. But to maintain the quality of this data in the long term, you need to make sure that new invalid data doesn't get added to the personnel index. You can do this by implementing data type checks upstream in your data ingestion workflow so that invalid data gets detected and redirected.

For example, if you are using Edge Processor or Ingest Processor pipelines to ingest data into the personnel index, you can configure your pipeline to route any data that fails to match the personnel_record type to a different index for assessment and amendment.

The following pipeline routes invalid data to the staging index, and allows valid data to continue to the personnel index:

import route from /splunk.ingest.commands

type emp_id = int WHERE ($value > 100000) AND ($value <= 999999)

type emp_type = string WHERE $value IN ("Full-time", "Part-time", "Contractor", "Intern")

type personnel_record = {
    employee: emp_id,
    name: string,
    type: emp_type
}

$pipeline = | from $source 
| route NOT (tojson() IS personnel), [
    | eval index = "staging"
    | into $destination
]
| eval index = "personnel"
| into $destination;

Results

The complete SPL2 module for auditing the data in the personnel index is as follows:

type emp_id = int WHERE ($value > 100000) AND ($value <= 999999)

type emp_type = string WHERE $value IN ("Full-time", "Part-time", "Contractor", "Intern")

$validate_fields = | from $personnel 
| eval employee_check = if(employee IS emp_id, "valid", "invalid")
| eval name_check = if(isstr(name), "valid", "invalid")
| eval type_check = if(type IS emp_type, "valid", "invalid")

$data_quality_report = | from $validate_fields
| eventstats count(employee_check="valid") AS valid_employee
| eventstats count(name_check="valid") AS valid_name
| eventstats count(type_check="valid") AS valid_type
| eventstats count (employee) AS total
| eval valid_employee_percent = round((valid_employee / total * 100), 2)
| eval valid_name_percent = round((valid_name / total * 100), 2)
| eval valid_type_percent = round((valid_type / total * 100), 2)

The complete SPL2 module for correcting and validating the personnel records is as follows:

type emp_id = int WHERE ($value > 100000) AND ($value <= 999999)

type emp_type = string WHERE $value IN ("Full-time", "Part-time", "Contractor", "Intern")

type personnel_record = {
    employee: emp_id,
    name: string,
    type: emp_type
}

$personnel_valid = from $personnel
| eval employee = trim(employee, "#")
| eval type = case(type = "", status, type = 
    "FTE", "Full-time",
    true, type)
| fields - status

$final_validation = from $personnel_valid
| eval record_check = if(tojson() IS personnel_record, "valid", "invalid")

By using these modules to validate and correct the data in the personnel index, and then using the pipeline shown in the Prevent invalid data from being added to the index section to ensure that only valid data is allowed into the personnel index, you have improved the consistency and correctness of the Buttercup Games personnel records and implemented safeguards to uphold this improved level of data quality indefinitely.

See also