Loading

Large language model performance matrix

Stack Planned Serverless

This page summarizes internal test results comparing large language models (LLMs) across Elastic AI Assistant for Observability and Search use cases. To learn more about these use cases, refer to AI Assistant.

Important

Rating legend:

Excellent: Highly accurate and reliable for the use case.
Great: Strong performance with minor limitations.
Good: Possibly adequate for many use cases but with noticeable tradeoffs.
Poor: Significant issues; not recommended for production for the use case.

Recommended models are those rated Excellent or Great for the particular use case.

Models from third-party LLM providers.

Provider Model Alert questions APM questions Contextual insights Documentation retrieval Elasticsearch operations ES|QL generation Execute connector Knowledge retrieval
Amazon Bedrock Claude Sonnet 3.5 Excellent Excellent Excellent Excellent Excellent Great Good Excellent
Amazon Bedrock Claude Sonnet 3.7 Excellent Excellent Excellent Excellent Excellent Great Great Excellent
Amazon Bedrock Claude Sonnet 4 Excellent Excellent Excellent Excellent Excellent Excellent Great Excellent
OpenAI GPT-4.1 Excellent Excellent Excellent Excellent Excellent Great Good Excellent
Google Gemini Gemini 2.0 Flash Excellent Good Excellent Excellent Excellent Good Good Excellent
Google Gemini Gemini 2.5 Flash Excellent Good Excellent Excellent Excellent Good Good Excellent
Google Gemini Gemini 2.5 Pro Excellent Great Excellent Excellent Excellent Good Good Excellent

Stack Planned Serverless Preview

Models you can deploy and manage yourself.

Provider Model Alert questions APM questions Contextual insights Documentation retrieval Elasticsearch operations ES|QL generation Execute connector Knowledge retrieval
Meta Llama-3.3-70B-Instruct Excellent Good Great Excellent Excellent Good Good Excellent
Mistral Mistral-Small-3.2-24B-Instruct-2506 Excellent Poor Great Great Excellent Poor Good Excellent
Note

Llama-3.3-70B-Instruct is supported with simulated function calling.

You can run the Elastic AI Assistant for Observability and Search evaluation framework against any model, and use it to benchmark a custom or self-hosted model against the use cases in the matrix. Refer to the evaluation framework README for setup and usage details.

For consistency, all ratings in this matrix were generated using Gemini 2.5 Pro as the judge model (specified via the --evaluateWith flag). Use the same judge when evaluating your own model to ensure comparable results.