Getting started with llm-d for distributed AI inference

As large language models (LLMs) shift from static, training-bound knowledge recall to dynamic, inference-time reasoning, their supporting infrastructure must also evolve. Inference workloads are no longer just about throughput—they demand adaptive computation, modular scaling, and intelligent caching to deliver complex reasoning with real-world efficiency.

llm-d is a Kubernetes-native distributed inference stack purpose-built for this new wave of LLM applications. Designed by contributors to Kubernetes and vLLM, llm-d offers a production-grade path for teams deploying large models at scale. Whether you're a platform engineer or a DevOps practitioner, llm-d brings increased performance per dollar across a wide range of accelerators and model families.

But this isn't just another inference-serving solution. llm-d is designed for the future of AI inference—optimizing for long-running, multi-step prompts, retrieval-augmented generation, and agentic workflows. It integrates cutting-edge techniques like KV cache aware routing, disaggregated prefill/decode, and a vLLM-optimized inference scheduler, with Inference Gateway (IGW) for seamless Kubernetes-native operations.

Why llm-d is needed for efficient inference

The key innovation with llm-d is its use case: distributed model serving. Unlike traditional applications, LLM inference requests are vastly different from typical HTTP requests, and traditional Kubernetes load balancing and scaling mechanisms can be ineffective.

For example, LLM inference requests are stateful, expensive, and have varying shapes (in the difference of input tokens and output tokens). To build a cost-efficient AI platform, it's critical that our infrastructure is used effectively, so let's see what typically happens during inference.

Let's say a user prompts an LLM with a question, such as customers who are up for renewal that we should reach out to. First, this request initiates a phase known as prefill, which computes hidden states (also referred to as the KV cache) for the input tokens in parallel. This is compute-intensive. Next, the decode phase consumes cached keys/values to generate tokens one at a time, making it memory bandwidth-bound. If this is all happening on a single GPU, it's an inefficient use of resources, especially for long sequences.

llm-d improves this using disaggregation (separating workloads between specialized nodes or GPUs) and an inference gateway (kgateway) to evaluate the incoming prompt and intelligently route requests, dramatically improving both performance and cost efficiency. See Figure 1.

A diagram illustrating a distributed LLM inference workflow with disaggregation. A user request is sent to the k-gateway, which dispatches prefill and decode tasks to separate nodes, then returns the final response to the user. — Figure 1: llm-d's disaggregated prefill/decode architecture for efficient LLM inference.

What are the main features of llm-d?

Before we look at how to deploy llm-d, let's explore the features that make it unique.

Smart load balancing for faster responses

llm-d includes a special load scheduler that ensures each request is routed to the correct model server, built using Kubernetes' Gateway API inference extension. Instead of using generic metrics, its inference scheduler uses smart rules based on real-time performance data—like system load, memory usage, and service level goals—to decide where to send each prompt. Teams can also customize how decisions are made, while benefiting from built-in features like flow control and latency balancing. Think of it as traffic control for LLM requests, but with AI-powered smarts.

Split-phase inference: Smarter use of compute

Instead of running everything on the same machine, llm-d splits the work:

One set of servers handles understanding the prompt (prefill).
Another set handles writing the response (decode). This helps use GPUs more efficiently—like having one group of chefs prep ingredients while another handles the cooking. It's powered by vLLM and high-speed connections like NVIDIA Inference Xfer Library (NIXL) or InfiniBand.

Reusing past work with disaggregated caching

llm-d also helps models remember more efficiently by caching previously computed results (KV cache). It can store these results in two ways:

Locally (on memory or disk) for low-cost, zero-maintenance savings.
Across servers (using shared memory and storage) for faster reuse and better performance in larger systems. This makes it easier to handle long or repeating prompts without redoing the same calculations.

Getting started with llm-d

If we look at the examples in the llm-d-infra repository, the prefill/decode disaggregation example is targeted at larger models like Llama-70B, deployed on high-end GPUs (NVIDIA H200s). For this article, we're going to focus on smaller models like Qwen3-0.6B on smaller GPUs (NVIDIA L40S). The example we're going to deploy is Precise Prefix Cache Aware Routing. This is more suitable to the hardware we have available, and will allow us to demonstrate KV cache aware routing.

Prerequisites for llm-d

Red Hat OpenShift Container Platform 4.17+.
NVIDIA GPU Operator 25.3.
Node Feature Discovery Operator 4.18.
2 NVIDIA L40S GPUs (e.g., AWS g6e.2xlarge instances).
A Hugging Face token, with permissions to download your desired model.
No service mesh or Istio installation, as Istio CRDs will conflict with the gateway.
Cluster administrator privileges to install the llm-d cluster scoped resources.

Follow the steps in this repository for an example of how to install the prerequisites for llm-d on an OpenShift cluster running on AWS.

Installing llm-d

Once you've installed the prerequisites, you're ready to deploy llm-d. Follow these steps carefully.

Step 1: Clone the repository

First, pull down the llm-d-infra repository that contains all deployment configurations:

git clone http://github.com/llm-d-incubation/llm-d-infra.git

Step 2: Install dependencies

Navigate into the quickstart folder and install the required CLI tools and dependencies:

cd quickstart
./dependencies/install-deps.sh

This script installs:

Helmfile.
Helm.
Other utilities required for deployment and orchestration.

Step 3: Deploy the gateway and infrastructure

Switch into the gateway-control-plane-providers directory and run the setup:

cd gateway-control-plane-providers     
./install-gateway-provider-dependencies.sh
helmfile apply -f istio.helmfile.yaml

Step 4: Deploy the example (precise prefix cache aware)

Move into the provided example deployment folder:

cd ../examples/precise-prefix-cache-aware

Create a namespace for your deployment:

export NAMESPACE=llm-d-precise
oc new-project ${NAMESPACE}

Set Hugging Face credentials (replace with your token):

export HF_TOKEN=
export HF_TOKEN_NAME=${HF_TOKEN_NAME:-llm-d-hf-token}

Create a Kubernetes secret for the Hugging Face token:

oc create secret generic ${HF_TOKEN_NAME} \
  --from-literal="HF_TOKEN=${HF_TOKEN}" \
  --namespace "${NAMESPACE}" \
  --dry-run=client -o yaml | kubectl apply -f -

Apply Helmfile in the namespace:

helmfile apply -n ${NAMESPACE}

Step 5: Handle GPU node taints (if needed)

At this point, the ms-kv-events-llm-d-modelservice-decode pods might be stuck in a Pending state. This typically happens when GPU nodes have taints applied; for example, for NVIDIA-L40S-PRIVATE (the GPUs we're using in this example).

To allow scheduling, add the proper tolerations:

oc patch deployment ms-kv-events-llm-d-modelservice-decode \
  -p '{"spec":{"template":{"spec":{"tolerations":[{"key":"nvidia.com/gpu","operator":"Equal","value":"NVIDIA-L40S-PRIVATE","effect":"NoSchedule"}]}}}}'

Step 6: Verify the pods

After a few minutes, your pods should be running. Confirm with:

kubectl get pods -n ${NAMESPACE}

Example output:

NAME                                                       READY   STATUS    RESTARTS   AGE
gaie-kv-events-epp-5d4f98d6b6-sxf9w                        1/1     Running   0          25m
infra-kv-events-inference-gateway-istio-5f68d4f854-qpnq4   1/1     Running   0          25m
ms-kv-events-llm-d-modelservice-decode-648464d84b-2r58r    2/2     Running   0          22m
ms-kv-events-llm-d-modelservice-decode-648464d84b-lclzf    2/2     Running   0          18m

At this point, the deployment is complete. You can now send prompts through the inference gateway and test KV cache–aware routing.

Testing KV cache aware routing

Get the service URL:

export SVC_EP=$(oc get svc infra-kv-events-inference-gateway-istio -o jsonpath='{.status.loadBalancer.ingress[0].ip}{.status.loadBalancer.ingress[0].hostname}')

Once this connection is established, from a separate terminal, run:

curl http://$SVC_EP/v1/models

You should see a response listing the models served, for example:

{
  "data": [
    {
      "created": 1753453207,
      "id": "Qwen/Qwen3-0.6B",
      "max_model_len": 40960,
      "object": "model",
      "owned_by": "vllm",
      "parent": null,
      "permission": [
        {
          "allow_create_engine": false,
          "allow_fine_tuning": false,
          "allow_logprobs": true,
          "allow_sampling": true,
          "allow_search_indices": false,
          "allow_view": true,
          "created": 1753453207,
          "group": null,
          "id": "modelperm-28c695df952c46d1b6efac02e0edb62d",
          "is_blocking": false,
          "object": "model_permission",
          "organization": "*"
        }
      ],
      "root": "Qwen/Qwen3-0.6B"
    }
  ],
  "object": "list"
}

Now that we can connect to our llm-d deployment, we can test KV-cache aware routing. To do this, we're going to use a larger message—one greater than 200 tokens, e.g.:

export LONG_TEXT_200_WORDS="Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

Now, let's send this prompt to the llm-d gateway:

curl http://$SVC_EP/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "prompt": "'"$LONG_TEXT_200_WORDS"'",
    "max_tokens": 50
  }' | jq

How can we tell if this request is being routed based on the prefix? We can look at the logs from the gaie-kv-events-epp pod, specifically for the text Got pod scores.

oc logs deployment/gaie-kv-events-epp -n llm-d --follow | grep "Got pod scores"

From our initial request, we should see something like:

Got pod scores  {"x-request-id": "57dbc51d-8b2f-46a4-a88f-5a9214ee2277", "model": "Qwen/Qwen3-0.6B", "resolvedTargetModel": "Qwen/Qwen3-0.6B", "criticality": "Sheddable", "scores": null}

This shows the prompt returned a null score, meaning it did not match a previously decoded prompt.

Now let's try the same prompt again. We're going to use the same curl command:

curl http://$SVC_EP/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "prompt": "'"$LONG_TEXT_200_WORDS"'",
    "max_tokens": 50
  }' | jq

This time we can see a prompt score with a pointer to the IP address of the preferred decode pod, based on the KV cache.

Got pod scores  {"x-request-id": "585b72a7-71e4-4eaf-96bc-1642a74a9d8e", "model": "Qwen/Qwen3-0.6B", "resolvedTargetModel": "Qwen/Qwen3-0.6B", "criticality": "Sheddable", "scores": {"10.131.2.23":2}}

If we look at the individual logs of the decode pods, we should see activity in the same pod when we send this large prompt.

To summarize, we've demonstrated how llm-d's KV cache aware routing optimizes inference by intelligently directing requests to the most suitable decode pods and leveraging previously computed results. This approach significantly improves efficiency and reduces latency for repeated or similar prompts.

Community and contribution

llm-d is a collaborative effort with a strong emphasis on community-driven development. Launched with contributions from Red Hat, Google, IBM, NVIDIA, and AMD, it aims to foster a collaborative environment for defining and implementing best practices for LLM inference scaling. We encourage you to check out and engage with the project.

Check out the llm-d project repository.
Try out the llm-d quick start.
Join the llm-d community on Slack to receive updates and chat with maintainers.

Wrapping up

With llm-d, there's a significant step forward in making large-scale LLM inference practical and efficient in production environments, powered by the operationalizability of Kubernetes. By focusing on intelligent request routing, KV cache optimization, and prefill/decode disaggregation, it offers organizations the ability to unlock the full potential of their LLM applications in a cost-effective and performant manner.

Try out llm-d and join the growing community today!

Last updated: September 11, 2025