Verify data quality

The CMC Data Quality dashboard provides information to Splunk Cloud Platform administrators on issues that prevented the Splunk platform from correctly parsing your incoming data. Use this dashboard to analyze and resolve common issues that happen during the ingestion process.

Your data quality can have a great impact on both your system performance and your ability to achieve accurate results from your queries. If your data quality is degraded enough, it can slow down search performance and cause inaccurate search results. Be sure to regularly check and repair any data quality issues before they become a problem.

Generally, data quality issues fall under three main categories:

  • Line breaks: When there are problems with line breaks, the ability to parse your data into the correct separate events that it uses for searching is affected.
  • Timestamp parsing: When there are timestamp parsing issues, the ability to determine the correct time stamp to use for the event is affected.
  • Aggregation: When there are problems with aggregation, the ability to break out fields correctly is affected.

Review the Data Quality dashboard

The tables in this dashboard list the issues Splunk Cloud Platform encountered when processing your events at both the source type and source levels. To help you better identify which of your data sources have quality issues, you can opt to exclude Splunk source types in the results.

This dashboard contains one panel with a variable in the title: Issues by source type <variable> by source.

To investigate your panels, go to Cloud Monitoring Console > Indexing > Data Quality. Use the following table to understand the dashboard interface.

Panel or Filter Description
Time Range Set the time range for the data display.
Include Splunk Source Types Specify whether to include or exclude Splunk source types from the results. Choose No to exclude Splunk source types and filter the results to only your source types.
Event Processing Issues by Source Types The results table lists the following information:
  • Sourcetype: Select to open the Issues by source type <variable> by source panel.
  • Total issues
  • Source count: Total number of individual sources contained in the source type.
  • Line breaking, timestamp parsing, and aggregation issues

When any cell shows a number greater than 0, select the cell to view the underlying search and related information. This data will help you resolve the issue.

Issues by source type <variable> by source The <variable> value depends on the selected sourcetype. The results table lists the following information:
  • Source: Select any source to open its related Event Line Count, Event Size, and Event Time Disparity panels.
  • Total issues
  • Line breaking, timestamp parsing, and aggregation issues

Interpret data quality results

This section discusses how to check the quality of your data and how to repair issues you may encounter. However, the concept of data quality depends on what factors you use to judge quality. For the purposes of this section, data quality means that the data is correctly parsed.

Guidelines

Finding and repairing data quality issues is unique to each environment. However, using the following guidelines can help you address your data quality:

  • It's a good idea to check your most important data sources first. Often, you can have the most impact by making a few changes to a critical data source.
  • Data quality issues may generate hundreds or thousands of errors due to one root cause. Sort by volume and work on repairing the source that generates the largest volume of errors first.
  • Repairing data quality issues is an iterative process. Repair your most critical data sources first, and then run queries against the source again to see what problems remain.
  • For your most critical source, resolve all data quality issues. This helps to ensure that your searches are effective and your performance is optimal.
  • Run these checks on a regular cadence to keep your system healthy.

For more information, see Resolve data quality issues in the Splunk Cloud Platform Getting Data In manual.

Example

The following example shows the process of resolving a common data quality issue using information from the CMC Data Quality dashboard, specifically, resolving timestamp parsing issues in a source. The steps to resolve your particular data quality issues may differ, but you can use this example as a general template for resolving data quality issues.

  1. In the Data Quality dashboard, view the Event Processing Issues by Source Type panel. For this example, you are most concerned with timestamp errors in the syslog source, so you need to drill down into that source.

    The graphic shows the Cloud Monitoring Console > Indexing > Data Quality page detail. It is intended to orient the user.

  2. Drilling down, you can see that the majority of issues are with the following source: /var/log/suricata/stats.log.

    The graphic shows the Cloud Monitoring Console > Indexing > Data Quality page with a detailed view of syslog. This is a troubleshooting step for repairing timestamp issues.

  3. Select the source to drill down further and see the searches against this source.

    The graphic shows a Cloud Monitoring Console > Indexing > Data Quality detail. You can see the detailed search query and details about the search. It is a troubleshooting step in repairing timestamp issues.

  4. From here, you can look at a specific event. You can see that the issue is that the Splunk platform was unable to parse the timestamp in the MAX_TIMESTAMP_LOOKAHEAD field.

    The graphic shows a Cloud Monitoring Console > Indexing > Data Quality detail. From this detail, you can see that the timestamp in the MAX_TIMESTAMP_LOOKAHEAD field needs to be repaired.

  5. To fix this, go to Settings in the search bar and select Source types in the DATA section.
  6. In the filter, enter syslog for the source type.
  7. Select Actions > Edit. The Edit Source Type page opens.
  8. Select Timestamp > Advanced… to open the Timestamp page for editing. Ensure you are satisfied with the timestamp format and the Lookahead settings. In this case, you need to edit the Lookahead settings so that the Splunk platform can parse the timestamp correctly.

    The graphic shows the process of editing a timestamp in the Settings > Sourcetype screen in Splunk Cloud Platform. It is intended to illustrate editing the timestamp.

  9. Return to the main Edit Source Type page and go to the Advanced menu. From here you can make other changes if needed.

    The graphic shows the Edit Source > Advanced screen. It is intended to orient the user.