Skip to main content
Redhat Developers  Logo
  • Products

    Platforms

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat AI
      Red Hat AI
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • View All Red Hat Products

    Featured

    • Red Hat build of OpenJDK
    • Red Hat Developer Hub
    • Red Hat JBoss Enterprise Application Platform
    • Red Hat OpenShift Dev Spaces
    • Red Hat OpenShift Local
    • Red Hat Developer Sandbox

      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Secure Development & Architectures

      • Security
      • Secure coding
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • Product Documentation
    • API Catalog
    • Legacy Documentation
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Getting started with llm-d for distributed AI inference

llm-d: Kubernetes-native distributed inference stack for large-scale LLM applications

August 19, 2025
Cedric Clyburn Philip Hayes
Related topics:
Artificial intelligence
Related products:
Red Hat AI

Share:

    As large language models (LLMs) shift from static, training-bound knowledge recall to dynamic, inference-time reasoning, their supporting infrastructure must also evolve. Inference workloads are no longer just about throughput—they demand adaptive computation, modular scaling, and intelligent caching to deliver complex reasoning with real-world efficiency.

    llm-d is a Kubernetes-native distributed inference stack purpose-built for this new wave of LLM applications. Designed by contributors to Kubernetes and vLLM, llm-d offers a production-grade path for teams deploying large models at scale. Whether you're a platform engineer or a DevOps practitioner, llm-d brings increased performance per dollar across a wide range of accelerators and model families.

    But this isn't just another inference-serving solution. llm-d is designed for the future of AI inference—optimizing for long-running, multi-step prompts, retrieval-augmented generation, and agentic workflows. It integrates cutting-edge techniques like KV cache aware routing, disaggregated prefill/decode, and a vLLM-optimized inference scheduler, with Inference Gateway (IGW) for seamless Kubernetes-native operations.

    Why llm-d is needed for efficient inference

    The key innovation with llm-d is its use case: distributed model serving. Unlike traditional applications, LLM inference requests are vastly different from typical HTTP requests, and traditional Kubernetes load balancing and scaling mechanisms can be ineffective. 

    For example, LLM inference requests are stateful, expensive, and have varying shapes (in the difference of input tokens and output tokens). To build a cost-efficient AI platform, it's critical that our infrastructure is used effectively, so let's see what typically happens during inference.

    Let's say a user prompts an LLM with a question, such as customers who are up for renewal that we should reach out to. First, this request initiates a phase known as prefill, which computes hidden states (also referred to as the KV cache) for the input tokens in parallel. This is compute-intensive. Next, the decode phase consumes cached keys/values to generate tokens one at a time, making it memory bandwidth-bound. If this is all happening on a single GPU, it's an inefficient use of resources, especially for long sequences. 

    llm-d improves this using disaggregation (separating workloads between specialized nodes or GPUs) and an inference gateway (kgateway) to evaluate the incoming prompt and intelligently route requests, dramatically improving both performance and cost efficiency. See Figure 1.

    A diagram illustrating a distributed LLM inference workflow with disaggregation. A user request is sent to the k-gateway, which dispatches prefill and decode tasks to separate nodes, then returns the final response to the user.
    Figure 1: llm-d's disaggregated prefill/decode architecture for efficient LLM inference.

    What are the main features of llm-d?

    Before we look at how to deploy llm-d, let's explore the features that make it unique.

    Smart load balancing for faster responses

    llm-d includes a special load scheduler that ensures each request is routed to the correct model server, built using Kubernetes' Gateway API inference extension. Instead of using generic metrics, its inference scheduler uses smart rules based on real-time performance data—like system load, memory usage, and service level goals—to decide where to send each prompt. Teams can also customize how decisions are made, while benefiting from built-in features like flow control and latency balancing. Think of it as traffic control for LLM requests, but with AI-powered smarts.

    Split-phase inference: Smarter use of compute

    Instead of running everything on the same machine, llm-d splits the work:

    • One set of servers handles understanding the prompt (prefill).
    • Another set handles writing the response (decode). This helps use GPUs more efficiently—like having one group of chefs prep ingredients while another handles the cooking. It's powered by vLLM and high-speed connections like NVIDIA Inference Xfer Library (NIXL) or InfiniBand.

    Reusing past work with disaggregated caching

    llm-d also helps models remember more efficiently by caching previously computed results (KV cache). It can store these results in two ways:

    • Locally (on memory or disk) for low-cost, zero-maintenance savings.
    • Across servers (using shared memory and storage) for faster reuse and better performance in larger systems. This makes it easier to handle long or repeating prompts without redoing the same calculations.

    Getting started with llm-d

    If we look at the examples in the llm-d-infra repository, the prefill/decode disaggregation example is targeted at larger models like Llama-70B, deployed on high-end GPUs (NVIDIA H200s). For this article, we're going to focus on smaller models like Qwen3-0.6B on smaller GPUs (NVIDIA L40S). The example we're going to deploy is Precise Prefix Cache Aware Routing. This is more suitable to the hardware we have available, and will allow us to demonstrate KV cache aware routing.

    Prerequisites for llm-d

    • Red Hat OpenShift Container Platform 4.17+.
    • NVIDIA GPU Operator 25.3.
    • Node Feature Discovery Operator 4.18.
    • 2 NVIDIA L40S GPUs (e.g., AWS g6e.2xlarge instances).
    • A Hugging Face token, with permissions to download your desired model.
    • No service mesh or Istio installation, as Istio CRDs will conflict with the gateway.
    • Cluster administrator privileges to install the llm-d cluster scoped resources.

    Follow the steps in this repository for an example of how to install the prerequisites for llm-d on an OpenShift cluster running on AWS.  

    Installing llm-d

    Once you've installed the prerequisites, you're ready to deploy llm-d. Follow these steps carefully.

    Step 1: Clone the repository

    First, pull down the llm-d-infra repository that contains all deployment configurations:

    git clone http://github.com/llm-d-incubation/llm-d-infra.git

    Step 2: Install dependencies

    Navigate into the quickstart folder and install the required CLI tools and dependencies:

    cd quickstart
    ./dependencies/install-deps.sh

    This script installs:

    • Helmfile.
    • Helm.
    • Other utilities required for deployment and orchestration.

    Step 3: Deploy the gateway and infrastructure

    Switch into the gateway-control-plane-providers directory and run the setup:

    cd gateway-control-plane-providers     
    ./install-gateway-provider-dependencies.sh
    helmfile apply -f istio.helmfile.yaml

    Step 4: Deploy the example (precise prefix cache aware)

    Move into the provided example deployment folder:

    cd ../examples/precise-prefix-cache-aware 

    Create a namespace for your deployment:

    export NAMESPACE=llm-d-precise
    oc new-project ${NAMESPACE}

    Set Hugging Face credentials (replace with your token):

    export HF_TOKEN=
    export HF_TOKEN_NAME=${HF_TOKEN_NAME:-llm-d-hf-token}

    Create a Kubernetes secret for the Hugging Face token:

    oc create secret generic ${HF_TOKEN_NAME} \
      --from-literal="HF_TOKEN=${HF_TOKEN}" \
      --namespace "${NAMESPACE}" \
      --dry-run=client -o yaml | kubectl apply -f -

    Apply Helmfile in the namespace:

    helmfile apply -n ${NAMESPACE}

    Step 5: Handle GPU node taints (if needed)

    At this point, the ms-kv-events-llm-d-modelservice-decode pods might be stuck in a Pending state. This typically happens when GPU nodes have taints applied; for example, for NVIDIA-L40S-PRIVATE (the GPUs we're using in this example).

    To allow scheduling, add the proper tolerations:

    oc patch deployment ms-kv-events-llm-d-modelservice-decode \
      -p '{"spec":{"template":{"spec":{"tolerations":[{"key":"nvidia.com/gpu","operator":"Equal","value":"NVIDIA-L40S-PRIVATE","effect":"NoSchedule"}]}}}}'

    Step 6: Verify the pods

    After a few minutes, your pods should be running. Confirm with:

    kubectl get pods -n ${NAMESPACE}

    Example output:

    NAME                                                       READY   STATUS    RESTARTS   AGE
    gaie-kv-events-epp-5d4f98d6b6-sxf9w                        1/1     Running   0          25m
    infra-kv-events-inference-gateway-istio-5f68d4f854-qpnq4   1/1     Running   0          25m
    ms-kv-events-llm-d-modelservice-decode-648464d84b-2r58r    2/2     Running   0          22m
    ms-kv-events-llm-d-modelservice-decode-648464d84b-lclzf    2/2     Running   0          18m

    At this point, the deployment is complete. You can now send prompts through the inference gateway and test KV cache–aware routing.

    Testing KV cache aware routing

    Get the service URL:

    export SVC_EP=$(oc get svc infra-kv-events-inference-gateway-istio -o jsonpath='{.status.loadBalancer.ingress[0].ip}{.status.loadBalancer.ingress[0].hostname}')

    Once this connection is established, from a separate terminal, run:

    curl http://$SVC_EP/v1/models

    You should see a response listing the models served, for example:

    {
      "data": [
        {
          "created": 1753453207,
          "id": "Qwen/Qwen3-0.6B",
          "max_model_len": 40960,
          "object": "model",
          "owned_by": "vllm",
          "parent": null,
          "permission": [
            {
              "allow_create_engine": false,
              "allow_fine_tuning": false,
              "allow_logprobs": true,
              "allow_sampling": true,
              "allow_search_indices": false,
              "allow_view": true,
              "created": 1753453207,
              "group": null,
              "id": "modelperm-28c695df952c46d1b6efac02e0edb62d",
              "is_blocking": false,
              "object": "model_permission",
              "organization": "*"
            }
          ],
          "root": "Qwen/Qwen3-0.6B"
        }
      ],
      "object": "list"
    }

    Now that we can connect to our llm-d deployment, we can test KV-cache aware routing. To do this, we're going to use a larger message—one greater than 200 tokens, e.g.:

    export LONG_TEXT_200_WORDS="Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

    Now, let's send this prompt to the llm-d gateway:

    curl http://$SVC_EP/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "Qwen/Qwen3-0.6B",
        "prompt": "'"$LONG_TEXT_200_WORDS"'",
        "max_tokens": 50
      }' | jq

    How can we tell if this request is being routed based on the prefix? We can look at the logs from the gaie-kv-events-epp pod, specifically for the text Got pod scores.

    oc logs deployment/gaie-kv-events-epp -n llm-d --follow | grep "Got pod scores"

    From our initial request, we should see something like:

    Got pod scores  {"x-request-id": "57dbc51d-8b2f-46a4-a88f-5a9214ee2277", "model": "Qwen/Qwen3-0.6B", "resolvedTargetModel": "Qwen/Qwen3-0.6B", "criticality": "Sheddable", "scores": null}

    This shows the prompt returned a null score, meaning it did not match a previously decoded prompt. 

    Now let's try the same prompt again. We're going to use the same curl command:

    curl http://$SVC_EP/v1/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "Qwen/Qwen3-0.6B",
        "prompt": "'"$LONG_TEXT_200_WORDS"'",
        "max_tokens": 50
      }' | jq

    This time we can see a prompt score with a pointer to the IP address of the preferred decode pod, based on the KV cache.

    Got pod scores  {"x-request-id": "585b72a7-71e4-4eaf-96bc-1642a74a9d8e", "model": "Qwen/Qwen3-0.6B", "resolvedTargetModel": "Qwen/Qwen3-0.6B", "criticality": "Sheddable", "scores": {"10.131.2.23":2}} 

    If we look at the individual logs of the decode pods, we should see activity in the same pod when we send this large prompt.

    To summarize, we've demonstrated how llm-d's KV cache aware routing optimizes inference by intelligently directing requests to the most suitable decode pods and leveraging previously computed results. This approach significantly improves efficiency and reduces latency for repeated or similar prompts.

    Community and contribution

    llm-d is a collaborative effort with a strong emphasis on community-driven development. Launched with contributions from Red Hat, Google, IBM, NVIDIA, and AMD, it aims to foster a collaborative environment for defining and implementing best practices for LLM inference scaling. We encourage you to check out and engage with the project.

    • Check out the llm-d project repository.
    • Try out the llm-d quick start.
    • Join the llm-d community on Slack to receive updates and chat with maintainers.

    Wrapping up

    With llm-d, there's a significant step forward in making large-scale LLM inference practical and efficient in production environments, powered by the operationalizability of Kubernetes. By focusing on intelligent request routing, KV cache optimization, and prefill/decode disaggregation, it offers organizations the ability to unlock the full potential of their LLM applications in a cost-effective and performant manner. 

    Try out llm-d and join the growing community today!

    Last updated: September 11, 2025

    Related Posts

    • llm-d: Kubernetes-native distributed inferencing

    • Distributed inference with vLLM

    • Deploy Llama 3 8B with vLLM

    • vLLM V1: Accelerating multimodal inference for large language models

    • How we optimized vLLM for DeepSeek-R1

    • Structured outputs in vLLM: Guiding AI responses

    Recent Posts

    • Run Qwen3-Next on vLLM with Red Hat AI: A step-by-step guide

    • How to implement observability with Python and Llama Stack

    • Deploy a lightweight AI model with AI Inference Server containerization

    • vLLM Semantic Router: Improving efficiency in AI reasoning

    • Declaratively assigning DNS records to virtual machines

    What’s up next?

    Configure your RHEL AI machine, download, serve, and interact with large language models (LLM) using RHEL AI and InstructLab, and discover how developers can benefit from AI models tailored to their needs.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit
    © 2025 Red Hat

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue