Monitor AI agents with Splunk APM
Monitor the performance, token usage, and quality of your AI agents with Splunk APM.
Monitor the performance, quality, and cost of your AI agents with Splunk APM.
The Agents page can help you answer questions such as:
-
Which of my AI agents are currently degraded in performance?
-
What AI agents are using the most tokens?
-
What quality issues are currently affecting my AI agents?
-
What types of quality issues are most prevalent?
Prerequisites
To monitor AI agents with Splunk APM, set up AI Agent Monitoring.
Monitor all AI agents
To monitor all AI agents, select from the Splunk Observability Cloud main menu. The following screenshot displays an example of the Agents page.
On the Agents page, the panels above the table display the aggregate metrics across all your agents. The table displays a list of the instrumented agents in your environment and their individual metrics.
View related logs for an AI agent
In the table of agents on the Agents page, select the icon in the Related logs column to navigate to the Logs page. This page displays a table of related logs.
Select a log from the table to view additional details about the AI agent calls. You can select the Trace ID or Span ID to display an option to navigate to the related trace or span.
Drill down into the detail view of an AI agent
In the table of agents on the Agents page, select an agent name to navigate to the detail view. The detail view for an agent displays charts for the metrics shown in the table of agents.
The following screenshot displays an example of the detail view for an agent.
Use the agent detail view to answer questions such as:
-
When did my agent start experiencing errors or issues?
-
Is my agent consuming a high number of tokens?
-
What quality issues is my agent facing?
About AI agent quality scores
Splunk Observability Cloud evaluates interactions using quality scores on a scale of 0-100% that measure the toxicity, sentiment, bias, hallucinations, and relevance of AI agent responses. Quality scores below 50% count towards the Quality Issues field in the Agents table.
Quality scores are calculated from instrumentation-side evaluations. The instrumentation frameworks for your AI applications trigger evaluations performed by DeepEval, an open-source evaluation framework for LLMs, and the Splunk Distribution of the OpenTelemetry Collector sends the evaluation results to Splunk Observability Cloud. Splunk Observability Cloud receives and displays evaluation results, but does not have visibility into your interaction inputs or outputs.
By default, Splunk Observability Cloud samples all collected spans to calculate quality scores. To control the sample rate, you can configure the OTEL_INSTRUMENTATION_GENAI_EVALUATION_SAMPLE_RATE setting when you instrument your application with the Splunk Distribution of the OpenTelemetry Collector. For more information on this setting, see Configure the Python agent for AI applications.