Validate and improve data quality using type checks
Validate and enhance data quality by using SPL2 type checks to ensure data meets specified criteria, identify invalid data, and implement corrective actions.
When you define a data type, you're specifying the criteria that determines whether a piece of data is valid based on its structure and contents. Data typing provides a framework for data validation and allows you to identify any invalid data. You can check your data against the appropriate types, and then use the results of these checks to gauge the quality of your data and identify potential remediation actions for improving the level of data quality.
To validate and improve the quality of your data, you can follow this high-level pattern:
-
Choose a data type that describes the criteria for valid data. If the built-in data types in SPL2 do not meet your needs, then define a custom data type that does. For more information, see Built-in data types and Custom data types.
-
Check whether your data matches the chosen type.
-
To check your data against a built-in type, you can use the IS operator in a predicate expression or use the appropriate informational eval function. See IS operator in the current manual and Informational functions in the SPL2 Search Reference.
-
To check your data against a custom type, you can use the IS operator in a predicate expression. This predicate expression can be the
returnexpression of a custom eval function. See Checking data against custom data types.
-
-
For any data that doesn't pass the type check, inspect the data to identify the reasons for the validation errors and then process the data as needed to fix the problems.
-
Implement the same type checks upstream in your data ingestion workflows to detect invalid data and route it away from production systems. For example, you can configure an Edge Processor or Ingest Processor pipeline to validate incoming data by checking it against a type, and then send valid and invalid data to different indexes. For more information, see the following documentation:
-
Process a subset of data using an Edge Processor in the Use Edge Processors manual.
-
Process a subset of data using an Ingest Processor in the Use Ingest Processors manual.
-
The following example featuring the fictitious Buttercup Games company demonstrates how to validate and improve the quality of your data using data types. In this example, we'll validate and correct indexed data at search time before configuring pipelines to validate and route incoming data.
For information about creating searches, see the Get started searching chapter and Modules and SPL2 statements.
For information about creating pipelines, see the following:
-
The Working with pipelines chapter in the Use Ingest Processors manual
-
The Working with pipelines chapter in the Use Edge Processors manual
Example: Validate and improve the data quality of company personnel records
The Buttercup Games company uses Splunk Cloud Platform to store and manage internal data, which includes personnel records. This data is not strictly typed, and over time, the structure and content of the data has become inconsistent as a result of multiple upgrades and migrations on the upstream systems that generate the data. As a result, the personnel index contains records with mixed schemas, such as the following:
| employee | name | status | type |
|---|---|---|---|
|
109561 |
Claudia Garcia |
Full-time |
|
|
109562 |
Ikraam Rahat |
Full-time |
|
|
109563 |
Wei Zhang |
Intern |
|
|
#110368 |
Taylor Lee |
|
Contractor |
|
#206895 |
Sasha Patel |
Current |
FTE |
|
#206896 |
John Anderson |
Retired |
FTE |
Notice how the employee field sometimes contains values that are prefixed with the number sign ( # ), and how the information about each employee's contract type and employment status is variously stored in the status or type fields. These variations in the personnel records cause downstream searches and reports to sometimes return errors or contain gaps.
As a data administrator working for Buttercup Games, you want to validate the data in the personnel index against a desired standard so that you can gauge the quality of the data in the index. You also want to improve the quality of the data by fixing any issues that are identified, and preserve the quality of the data in the long term by preventing invalid records from being added to the index.
To validate and improve the quality of the data in the personnel index, you do the following:
Identify data types that describe the criteria for valid data
To validate the quality of your data, you need to be able to check each piece of data against a specific set of criteria. Start by identifying data types that describe the criteria you want your data to meet. You can use the built-in data types provided in SPL2 or define custom data types.
After speaking to your colleagues at Buttercup Games to confirm the specific criteria that the personnel data needs to meet, you identify the data types that capture those criteria. The following table describes the built-in data type or custom data type definition that captures each criterion:
| Data validation criterion | Data type |
|---|---|
|
The |
Use the following SPL2 expression to define a custom data type named
|
|
The |
For this requirement, you can use the built-in |
|
The |
Use the following SPL2 expression to define a custom data type named
|
|
Each personnel record must contain these fields only: |
In order to check an entire record against a data type, you must first format the record as a single data value, such as a text string or a JSON object. For example, you can convert the personnel records into JSON objects, and then validate each object against a structured data type that describes the key-value pairs that the object must contain. Use the following SPL2 expression to define a custom data type named
This data type matches with JSON objects that contain |
Audit the index for invalid data
Check the records in the personnel index to confirm whether they meet your data validation criteria, and identify the specific problems that need to be fixed.
You can start by checking if the records meet the criterion of only including the employee, name, and type fields. To do this, run a basic search that returns all the data in the personnel index, and then review the field names that are returned. In this case, you confirm that all the records fail this criterion because they include an extra status field.
Then, to validate the data values in the employee, name, and type fields, you do the following:
-
Check if the
employeefield matches theemp_iddata type. -
Check if the
namefield matches thestringdata type. -
Check if the
typefield matches theemp_typedata type.
The following search statement checks each field against the appropriate data type, and then returns the results in fields named employee_check, name_check, and type_check:
$validate_fields = | from $personnel
| eval employee_check = if(employee IS emp_id, "valid", "invalid")
| eval name_check = if(isstr(name), "valid", "invalid")
| eval type_check = if(type IS emp_type, "valid", "invalid")
This search statement returns the following results:
| employee | employee_check | name | name_check | status | type | type_check |
|---|---|---|---|---|---|---|
|
109561 |
valid |
Claudia Garcia |
valid |
Full-time |
|
invalid |
|
109562 |
valid |
Ikraam Rahat |
valid |
Full-time |
|
invalid |
|
109563 |
valid |
Wei Zhang |
valid |
Intern |
|
invalid |
|
#110368 |
invalid |
Taylor Lee |
valid |
|
Contractor |
valid |
|
#206895 |
invalid |
Sasha Patel |
valid |
Current |
FTE |
invalid |
|
#206896 |
invalid |
John Anderson |
valid |
Retired |
FTE |
invalid |
You can then use statistical aggregation functions and mathematical functions to summarize these results for reporting purposes. The following search calculates the percentage of records that meet each field validation criterion, and returns the results in fields named valid_employee_percent, valid_name_percent, and valid_type_percent:
$data_quality_report = | from $validate_fields
| eventstats count(employee_check="valid") AS valid_employee
| eventstats count(name_check="valid") AS valid_name
| eventstats count(type_check="valid") AS valid_type
| eventstats count (employee) AS total
| eval valid_employee_percent = round((valid_employee / total * 100), 2)
| eval valid_name_percent = round((valid_name / total * 100), 2)
| eval valid_type_percent = round((valid_type / total * 100), 2)
The results of this search indicate the following:
-
50% of the records have valid
employeefields. -
100% of the records have valid
namefields. -
16.67% of the records have valid
typefields.
Correct and verify the invalid data
After reporting and discussing the results of your audit, you identify the corrections that need to be made to the personnel records:
-
The number sign ( # ) needs to be removed from the invalid
employeevalues. -
Some of the information in the
statusfield needs to be recaptured using standardized values and moved into thetypefield. -
The
statusfield is extraneous and needs to be dropped.
The following search statement fixes the validation errors in the personnel records:
$personnel_valid = from $personnel
| eval employee = trim(employee, "#")
| eval type = case(type = "", status,
type = "FTE", "Full-time",
true, type)
| fields - status
The $personnel_valid search statement returns the following corrected personnel records:
| employee | name | type |
|---|---|---|
|
109561 |
Claudia Garcia |
Full-time |
|
109562 |
Ikraam Rahat |
Full-time |
|
109563 |
Wei Zhang |
Intern |
|
110368 |
Taylor Lee |
Contractor |
|
206895 |
Sasha Patel |
Full-time |
|
206896 |
John Anderson |
Full-time |
You can double-check the validity of this data by checking each record in its entirety against the personnel_record data type.
The following search statement does the following:
-
Uses the
tojsonfunction to convert each record into a JSON object. -
Checks if each object matches the
personnel_recorddata type. -
Returns the results of the validation check in a field named
record_check.
$final_validation = from $personnel_valid
| eval record_check = if(tojson() IS personnel_record, "valid", "invalid")
The record_check field returns valid for every personnel record.
Prevent invalid data from being added to the index
You now have a fully valid copy of the personnel data. But to maintain the quality of this data in the long term, you need to make sure that new invalid data doesn't get added to the personnel index. You can do this by implementing data type checks upstream in your data ingestion workflow so that invalid data gets detected and redirected.
For example, if you are using Edge Processor or Ingest Processor pipelines to ingest data into the personnel index, you can configure your pipeline to route any data that fails to match the personnel_record type to a different index for assessment and amendment.
The following pipeline routes invalid data to the staging index, and allows valid data to continue to the personnel index:
import route from /splunk.ingest.commands
type emp_id = int WHERE ($value > 100000) AND ($value <= 999999)
type emp_type = string WHERE $value IN ("Full-time", "Part-time", "Contractor", "Intern")
type personnel_record = {
employee: emp_id,
name: string,
type: emp_type
}
$pipeline = | from $source
| route NOT (tojson() IS personnel), [
| eval index = "staging"
| into $destination
]
| eval index = "personnel"
| into $destination;
Results
The complete SPL2 module for auditing the data in the personnel index is as follows:
type emp_id = int WHERE ($value > 100000) AND ($value <= 999999)
type emp_type = string WHERE $value IN ("Full-time", "Part-time", "Contractor", "Intern")
$validate_fields = | from $personnel
| eval employee_check = if(employee IS emp_id, "valid", "invalid")
| eval name_check = if(isstr(name), "valid", "invalid")
| eval type_check = if(type IS emp_type, "valid", "invalid")
$data_quality_report = | from $validate_fields
| eventstats count(employee_check="valid") AS valid_employee
| eventstats count(name_check="valid") AS valid_name
| eventstats count(type_check="valid") AS valid_type
| eventstats count (employee) AS total
| eval valid_employee_percent = round((valid_employee / total * 100), 2)
| eval valid_name_percent = round((valid_name / total * 100), 2)
| eval valid_type_percent = round((valid_type / total * 100), 2)
The complete SPL2 module for correcting and validating the personnel records is as follows:
type emp_id = int WHERE ($value > 100000) AND ($value <= 999999)
type emp_type = string WHERE $value IN ("Full-time", "Part-time", "Contractor", "Intern")
type personnel_record = {
employee: emp_id,
name: string,
type: emp_type
}
$personnel_valid = from $personnel
| eval employee = trim(employee, "#")
| eval type = case(type = "", status, type =
"FTE", "Full-time",
true, type)
| fields - status
$final_validation = from $personnel_valid
| eval record_check = if(tojson() IS personnel_record, "valid", "invalid")
By using these modules to validate and correct the data in the personnel index, and then using the pipeline shown in the Prevent invalid data from being added to the index section to ensure that only valid data is allowed into the personnel index, you have improved the consistency and correctness of the Buttercup Games personnel records and implemented safeguards to uphold this improved level of data quality indefinitely.
See also
Related reference