AI troubleshooting agent and remediation plan in Splunk Observability Cloud

Use AI troubleshooting agent and remediation plan to analyze root causes, test hypotheses, and resolve alerts.

When your systems experience failures and require troubleshooting, AI troubleshooting agent automates root cause analysis, allowing engineers to identify failures faster. After identifying potential root causes, you can move on to the AI remediation to see guided steps for remediation.

Use AI troubleshooting agent to identify potential root causes

Alerts and detectors determine whether services and infrastructure components are healthy or not. Then, for incidents relating to Splunk APM services and Kubernetes in Infrastructure Monitoring, Splunk Observability Cloud alerts automatically trigger the AI troubleshooting agent to do root cause analysis and display suspected root causes.

Splunk Observability Cloud alerts initiate contact with the humans responsible for maintaining your systems by sending designated users alerts through email, Slack, or ServiceNow when systems begin to break down.

When users open the alert through an email or slack message, or directly via the Active Alerts page, they find the following:

  • Summary of the incident, including the rule triggered

  • Information about the detector

  • Analysis powered by Splunk AI

  • Evidence supporting why the root cause was presented

  • Explanation of the suspected root cause or causes

  • Action plan that shows remediation steps

  • AI Assistant chat available for further conversation

AI troubleshooting agent refers to the root cause analysis powered by Splunk AI.

To minimize Mean Time to Resolution (MTTR), you can do the following when you receive an alert:

  1. Read the alert summary to determine the domain and impact radius of the problem, alert information, and detector information

  2. To further investigate potential root causes, select the root cause you want to investigate.

  3. Select links found in the summaries of suspected root causes to read more detail about the suspicious behavior in the relevant Splunk Observability Cloud component.

  4. Select Explain root cause to see which parts of your system the troubleshooting agent analyzed. When clicking the Explain root cause section, you'll see more information on how the root cause was generated. You'll also see the ability to rate the evidence provided to you. Note that you can only submit one rating per hour.

  5. Additionally, you can view the evidence tab to see relevant logs, exemplar traces, Splunk APM services, and other evidence contributing to the alert. [2 SCREENSHOTS HERE]

Use AI remediation plan to resolve alerts

After reviewing the suspected root causes and evidence provided by the troubleshooting agent, you are now ready to leverage the AI remediation plan in Splunk Observability Cloud.

Navigate to Action plan below the suspected root cause see the guided steps for remediation. The AI remediation plan generates a set of hypotheses and root causes along with the associated actions.

To test the hypotheses and resolve the alert, you can do the following:

  1. Select a hypothesis and a root cause to remediate.

    You will see a high-level graph describing the selected hypothesis, root cause, and the associated steps you need to take to resolve the incident.

  2. Copy and run the suggested kubectl command or code blocks in your terminal or source code location.

  3. Submit the output so the plan can determine next steps.

  4. As you iterate through the action plan, mark each step as completed. Alternatively, you can undo steps as needed.

  5. After you complete all the steps, you receive a summary of your actions.

After going through the remediation flow, you can mark the alert as resolved and close the incident if the outcome is satisfactory.