Model accuracy measures in Splunk AI Assistant for SPL
When evaluating the accuracy of results generated by Splunk AI Assistant for SPL, 3 categories are considered:
-
Token similarity: Measures how closely the generated SPL tokens match a known reference. This metric helps understand if a search is useful even if the search cannot execute in a given Splunk platform environment. For example if the model produces place holders for implicit or unspecified values. Coverage of fields and indices when the information is specified in the input is also checked.
-
Structural similarity: Compares sequences of SPL commands between a reference and candidate SPL searches. This comparison captures a notion of SPL structural similarity that gives insight into the efficiency and readability of a proposed search when using high quality SPL references.
-
Execution accuracy: Measures how well the the results of an SPL search match their expected output when executed on the intended index. This metric assesses the large-language model's (LLM) ability to generate accurate SPL searches.
| Model | Bleu Score | Matched Source-Index, input, hyperlinks | Matching Sourcetype | Command Sequence Normalized Edit Distance | Execution Accuracy |
|---|---|---|---|---|---|
| GPT 4 - Turbo | 0.313 | 52.10% | 65.10% | 0.5683 | 20.40% |
| Llama 3 70B Instruct | 0.300 | 42.25% | 78.17% | 0.6477 | 8.40% |
| Splunk SAIA System | 0.493 | 82.40% | 85.90% | 0.4104 | 39.30% |