AI troubleshooting agent and remediation plan in Splunk Observability Cloud

Use AI troubleshooting agent and remediation plan to analyze root causes, test hypotheses, and resolve alerts.

When your systems experience failures and require troubleshooting, AI troubleshooting agent automates root cause analysis, allowing engineers to identify failures faster. After identifying potential root causes, you can move on to the AI remediation plan to see guided steps for remediation.

Availability

The AI troubleshooting agent and remediation plan is currently available only to customers in the us1 realm of Splunk Observability Cloud. Reach out to your Splunk representative if you are interested in using this feature.

Use AI troubleshooting agent to identify potential root causes

AI troubleshooting agent refers to the root cause analysis powered by Splunk AI. Alerts and detectors determine whether services and infrastructure components are healthy or not. Then, for alerts relating to Splunk APM services and Kubernetes in Infrastructure Monitoring, Splunk Observability Cloud automatically triggers the AI troubleshooting agent to do root cause analysis and display suspected root causes when the user accesses the alert.
Note: To learn more about alerts, see Introduction to alerts and detectors in Splunk Observability Cloud. The AI troubleshooting agent and remediation plan currently support detectors and alerts with standard, default metrics. Custom metrics are not supported at this time.
When systems begin to break down, alerts in Splunk Observability Cloud initiate contact with the humans responsible for maintaining your systems by sending notifications to designated users through email, Slack, or ServiceNow.

When users open the alert through an email or slack message, or directly via the Active Alerts page in Splunk Observability Cloud, they find a summary of the alert with the following tabs:

  • Overview

  • Root Cause Analysis

  • Evidence

On the Overview tab, you see the following:
  • Summary of the alert, including the rule triggered

  • Impact analysis, which defines the blast radius of the problem regarding applications and services

  • Primary root cause

  • Additional troubleshooting information

Troubleshooting is integrated with the AI Assistant chat experience. For example, the root cause analysis execution status or 'chain of thoughts' appears in the chat. Root cause analysis continues to run even when you exit the alert.

On the Root Cause Analysis tab, you can rate the overall investigation by selecting the thumbs up or thumbs down in the chat; Splunk uses this information to improve or reinforce the results.

Root cause analysis continues to run even if the user exits the page. Root cause analysis is run once per user, so upon re-entering the alert page, the results are available immediately. There is an option to regenerate (e.g., latency with logs) on the Root Cause page.

To minimize Mean Time to Resolution (MTTR), you can do the following when you receive an alert:

  1. Review the alert Overview tab to determine the domain and impact radius of the problem, the primary root cause, the alert and detector information, as well as additional troubleshooting information.

  2. Further investigate potential root causes by selecting Review root causes on the Overview tab.

  3. On the Root Cause Analysis tab, select the specific root cause you want to investigate. Then select links found in the summaries of suspected root causes to read more detail about the suspicious behavior in the relevant Splunk Observability Cloud component.

  4. Additionally, you can view the Evidence tab to see relevant logs, exemplar traces, Splunk APM services, and other evidence contributing to the alert.

Use AI remediation plan to resolve alerts

After reviewing the suspected root causes and evidence provided by the troubleshooting agent, you are now ready to leverage the AI remediation plan in Splunk Observability Cloud.

Navigate to Action plan by selecting View AI-generated action plan below the suspected root cause to see the guided steps for remediation. The AI remediation plan generates a set of hypotheses and root causes along with the associated actions.

To test the hypotheses and resolve the alert, you can do the following:

  1. Select a hypothesis and a root cause to remediate.

    You will see a high-level graph describing the selected hypothesis, root cause, and the associated steps you need to take to resolve the incident.

  2. Copy and run the suggested kubectl command or code blocks in your terminal or source code location.

  3. Submit the output so the plan can determine next steps.

  4. As you iterate through the action plan, mark each step as completed. Alternatively, you can undo steps as needed.

  5. When you complete all the steps, you receive a summary of your actions.

After going through the remediation flow, you can mark the alert as resolved and close the incident if the outcome is satisfactory.