Large language model performance matrix
Stack Serverless
This page summarizes internal test results comparing large language models (LLMs) across Elastic AI Assistant for Observability and Search use cases. To learn more about these use cases, refer to AI Assistant.
Rating legend:
Excellent: Highly accurate and reliable for the use case.
Great: Strong performance with minor limitations.
Good: Possibly adequate for many use cases but with noticeable tradeoffs.
Poor: Significant issues; not recommended for production for the use case.
Recommended models are those rated Excellent or Great for the particular use case.
Models from third-party LLM providers.
Provider | Model | Alert questions | APM questions | Contextual insights | Documentation retrieval | Elasticsearch operations | ES|QL generation | Execute connector | Knowledge retrieval |
---|---|---|---|---|---|---|---|---|---|
Amazon Bedrock | Claude Sonnet 3.5 | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Good | Excellent |
Amazon Bedrock | Claude Sonnet 3.7 | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Great | Excellent |
Amazon Bedrock | Claude Sonnet 4 | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Excellent |
OpenAI | GPT-4.1 | Excellent | Excellent | Excellent | Excellent | Excellent | Great | Good | Excellent |
Google Gemini | Gemini 2.0 Flash | Excellent | Good | Excellent | Excellent | Excellent | Good | Good | Excellent |
Google Gemini | Gemini 2.5 Flash | Excellent | Good | Excellent | Excellent | Excellent | Good | Good | Excellent |
Google Gemini | Gemini 2.5 Pro | Excellent | Great | Excellent | Excellent | Excellent | Good | Good | Excellent |
Stack Serverless
Models you can deploy and manage yourself.
Provider | Model | Alert questions | APM questions | Contextual insights | Documentation retrieval | Elasticsearch operations | ES|QL generation | Execute connector | Knowledge retrieval |
---|---|---|---|---|---|---|---|---|---|
Meta | Llama-3.3-70B-Instruct | Excellent | Good | Great | Excellent | Excellent | Good | Good | Excellent |
Mistral | Mistral-Small-3.2-24B-Instruct-2506 | Excellent | Poor | Great | Great | Excellent | Poor | Good | Excellent |
Llama-3.3-70B-Instruct
is supported with simulated function calling.
You can run the Elastic AI Assistant for Observability and Search evaluation framework against any model, and use it to benchmark a custom or self-hosted model against the use cases in the matrix. Refer to the evaluation framework README for setup and usage details.
For consistency, all ratings in this matrix were generated using Gemini 2.5 Pro
as the judge model (specified via the --evaluateWith
flag). Use the same judge when evaluating your own model to ensure comparable results.