Semantic search with GPU inference in Elastic

Test Elastic's leading-edge, out-of-the-box capabilities. Dive into our sample notebooks, start a free cloud trial, or try Elastic on your local machine now.

Our inference tooling at Elastic has had some major power-ups this year with the introduction of our GPU-powered Elastic Inference Service (EIS), providing a platform for streamlined access to LLMs, embeddings, and reranker models via an always-on dedicated service.

Today, we’ll focus on how EIS can simplify the semantic search experience with our sparse embeddings model, ELSER (Elastic Learned Sparse EncodeR). Having semantic search in place as a foundation can help to unlock many additional abilities, including hybrid retrieval and being able to provide great context to LLMs for your agentic workflows.

Let’s get started!

Getting started with semantic search

You can now get started with your end-to-end semantic search use cases with the inference endpoints powered by EIS.

1. Create a semantic text field using the new endpoint

First, let’s create a new index using the semantic_text field type and the EIS inference id .elser-2-elastic.

This sets up an index called semantic-embeddings where the content field automatically generates embeddings using the .elser-2-elastic inference endpoint.

In future versions, this inference id will be the default, so you won’t need to specify it explicitly.

2. Reindex your data

Next, let’s reindex an existing text index that contains national parks data so it can use the new semantic_text field via the index we just created.

This is a necessary step if you want to bring in existing data to an EIS-powered index, but will be simplified further in our next release to allow you to update an existing index mapping without reindexing.

3. Search

Finally, let’s query the index for information about national parks via semantic search.

And the response:

And that’s all you need to do! With an EIS-powered semantic search index you get superior ingest performance, easy token-based pricing, and a service that’s available when you need it.

For a detailed example and tutorial, please refer to the Elastic docs for ELSER on EIS and the tutorial.

So how does this work?

The Elastic Inference Service runs on Elastic’s infrastructure, providing access to machine-learning models on demand.

The team used Ray to build out the service, specifically using the Ray Serve libraries to run our machine learning models. This runs on top of Kubernetes and uses pools of NVIDIA GPUs in the Elastic Cloud Platform to perform the model inference operations.

Ray gives us a lot of great benefits:

It works out of the box with PyTorch, Tensorflow, and other ML libraries;
It’s Python-native and supports custom business logic, allowing easy integration with models in the Python ecosystem;
It’s robust and works across heterogeneous GPU pools, allowing fractional resource management;
It supports response streaming, dynamic batching, and plenty of other useful features.

When an inference API request comes in for semantic search, we have the ELSER model up and running and ready to generate your sparse embeddings. Running on GPUs allows us to efficiently parallelise operations, which is particularly helpful if you have a lot of documents you need to ingest regularly.

What’s next?

In future versions of Elastic, ELSER via EIS will be the default semantic text setting, so you won’t even have to specify the inference id when you create the semantic index. We’ll be adding the ability to change an index to use semantic search via EIS without needing to do any reindexing. Finally, and most excitingly, we’ll be adding new models to support reranking and Jina’s multilingual and multimodal embedding models.

Stay tuned for future blog posts with further technical details of how the service works and more news about models and features!

Report an issue