Large language model performance matrix

Stack Planned Serverless

This page summarizes internal test results comparing large language models (LLMs) across Elastic AI Assistant for Observability and Search use cases. To learn more about these use cases, refer to AI Assistant.

Important

Rating legend:

Excellent: Highly accurate and reliable for the use case.
Great: Strong performance with minor limitations.
Good: Possibly adequate for many use cases but with noticeable tradeoffs.
Poor: Significant issues; not recommended for production for the use case.

Recommended models are those rated Excellent or Great for the particular use case.

Proprietary models

Models from third-party LLM providers.

Provider	Model	Alert questions	APM questions	Contextual insights	Documentation retrieval	Elasticsearch operations	ES\|QL generation	Execute connector	Knowledge retrieval
Amazon Bedrock	Claude Sonnet 3.5	Excellent	Excellent	Excellent	Excellent	Excellent	Great	Good	Excellent
Amazon Bedrock	Claude Sonnet 3.7	Excellent	Excellent	Excellent	Excellent	Excellent	Great	Great	Excellent
Amazon Bedrock	Claude Sonnet 4	Excellent	Excellent	Excellent	Excellent	Excellent	Excellent	Great	Excellent
OpenAI	GPT-4.1	Excellent	Excellent	Excellent	Excellent	Excellent	Great	Good	Excellent
Google Gemini	Gemini 2.0 Flash	Excellent	Good	Excellent	Excellent	Excellent	Good	Good	Excellent
Google Gemini	Gemini 2.5 Flash	Excellent	Good	Excellent	Excellent	Excellent	Good	Good	Excellent
Google Gemini	Gemini 2.5 Pro	Excellent	Great	Excellent	Excellent	Excellent	Good	Good	Excellent

Open-source models

Stack Planned Serverless Preview

Models you can deploy and manage yourself.

Provider	Model	Alert questions	APM questions	Contextual insights	Documentation retrieval	Elasticsearch operations	ES\|QL generation	Execute connector	Knowledge retrieval
Meta	Llama-3.3-70B-Instruct	Excellent	Good	Great	Excellent	Excellent	Good	Good	Excellent
Mistral	Mistral-Small-3.2-24B-Instruct-2506	Excellent	Poor	Great	Great	Excellent	Poor	Good	Excellent

Note

Llama-3.3-70B-Instruct is supported with simulated function calling.

Evaluate your own model

You can run the Elastic AI Assistant for Observability and Search evaluation framework against any model, and use it to benchmark a custom or self-hosted model against the use cases in the matrix. Refer to the evaluation framework README for setup and usage details.

For consistency, all ratings in this matrix were generated using Gemini 2.5 Pro as the judge model (specified via the --evaluateWith flag). Use the same judge when evaluating your own model to ensure comparable results.