Technically Speaking | Scaling AI inference with open source

Scaling AI inference with open source ft. Brian Stevens

Episode Description
Transcript

How is artificial intelligence truly being reimagined for the real world, moving beyond labs and into critical business environments? This episode of "Technically Speaking" explores the pivotal shift towards production-quality AI inference at scale and how open source is spearheading this transformation. Red Hat CTO Chris Wright is joined by Brian Stevens, Red Hat's SVP and AI CTO, who shares his unique journey and insights. They discuss the fascinating parallels between standardizing Linux decades ago and the current mission to create a common, efficient stack for AI inference. The conversation delves into the practicalities of making AI work, the evolution to GPU-focused inference with projects like vLLM, the complexities of model optimization, and why a collaborative open source ecosystem is crucial for realizing the full potential of enterprise AI.

00:00 — Chris Wright
Today I wanted to take some time to explore something fascinating that's happening right now. Enterprises are completely re-imagining how AI works in the real world, not in labs, but in actual business environments. And it all comes down to production quality inference at scale, exactly what open source can help us solve. And joining us for this topic today is Brian Stevens. For our listeners, Brian is the Senior Vice president and AI CTO for Red Hat. Welcome to "Technically Speaking", where we explore how open source is shaping the future of technology. I'm your host, Chris Wright.

00:37 — Chris Wright
You and I go way back, so you're not just here, but you're back at Red Hat. Fun fact, you originally hired me here. So, give us a little bit of your view of having been here and then left and done other things and come back. A lot's changed.

00:53 — Brian Stevens
Yeah.

00:54 — Chris Wright
What's transpired in that time?

00:56 — Brian Stevens
A lot's changed and some of it stayed the same and great to be here and great to be especially working with you and the rest of the gang. I'd say there's a lot of parallels. The first time we were trying to establish a Linux platform for enterprise customers and the problem we solved for was enterprises back then had to be committed to one vendor. And we built a common version of Linux across lots of hardware and lots of apps that in the early days of Linux, right? The apps only worked on one version of Linux. And we kind of brought all that together and it really created reach for vendors, but it also created enterprise value. Believe it or not, I'm back here and I think we're gonna be working on the same thing this time, just in AI inference where there's lots of great accelerators for running AI and there's lots of great models, but they don't all work on the same common stack and that's really what I think the next journey is for AI inference to be successful.

01:54 — Chris Wright
The more things change, the more they stay the same. So you left Red Hat, you spent some time at Google, got a lot of exposure to the cloud world, then you went to Neural Magic as CEO, leading a startup. What drew you from cloud to AI and what was the initial impetus for-

02:16 — Brian Stevens
Yeah, I didn't really have an impetus. I was home writing code and using Google Cloud and enjoying myself quite honestly. But I met Nir Shavit, MIT founder at Neural Magic and I fell in love with him, and I fell in love with the technology that they were trying to solve. And they were doing this great thing of trying to bring AI inference to CPUs. The initial phase was the GPUs got it wrong, quite honestly, which was not a truth. Meaning that, that you could optimize models in such a way to make them smaller where they could run really well on CPUs. And then, it's a great dream. You can bring AI everywhere, really powerful AI everywhere, right?

03:11 — Chris Wright
Totally ubiquitous.

03:11 — Brian Stevens
So it was definitely from what we did in our first chapter of Red Hat with Linux and x86 when nobody thought x86 could compete with a high-end processor.

03:20 — Chris Wright
Right, right.

03:21 — Brian Stevens
So we were skipping the step of all the GPUs and going right to the hardest CPUs. And then, obviously with... And that worked really well, but then it was really early for production level inference and then ChatGPT was born and that just changed everything for us.

03:41 — Chris Wright
Changed everything for everybody. It was an eye-opening moment for the world. I think putting an interface on a large language model that makes it accessible to everybody and here we are a few years later and Neural Magic's focused on inference. You've got a deep technology focus in a core project, vLLM. What was that pivot like going from CPUs to GPUs, from machine learning to generative AI?

04:11 — Brian Stevens
You always hear about startups pivoting and whatnot, and it's an easy thing and you go here. It was tough. It was tough. I mean, we're only 42 people, so I'm not gonna paint that it was a massive city, it was a village. But culturally, it was hard because all the engineers had invested so much in their code base for inference. And then, all of a sudden we're saying, well, that doesn't matter, we're gonna go over here. And so, it was, I wouldn't call it disruptive, but it was definitely an emotional thing, right? But it had to happen because the impetus was, we saw like after ChatGPT, not right away, but the following year, we bet on the open source, maybe the open source LLMs would exist. And as they started to exist, we saw GPUs had trouble running them. They had just gotten big enough where even GPUs, people were using multiple GPUs instead of one to run a model. We knew we weren't GPU experts, but we were high performance computing experts and machine learning experts. The team, everybody got brave enough to say let's go figure out this new world of GPUs and really help models run efficiently on GPUs.

05:35 — Chris Wright
And I like that, being brave. It takes a lot to jump into a new space. You got domain expertise in one area, almost your identity becomes defined in that way. The importance of what you learned in sparsification and making models smaller for CPUs, were there immediate lessons that could bring into the GPU world?

05:58 — Brian Stevens
Yeah. What's funny is what I've learned in all this is, engineers, the ML researchers we had and then the high performance computing people on the production side, whether it's a CPU or a GPU is just an implementation detail, but I don't think I appreciated that. You brand yourself a little bit and you think I'm in this domain, but the reality is amazing engineers are just amazing engineers when you put on other problems. So, yeah, we had to do a couple of things. We had to, on the model side, really kind of invent some new compression techniques to make the models smaller, but just as accurate, but techniques that would work on a GPU, which were, let's call them 80% different than what we're doing on a CPU. Some of them still crossed over. And then, on the high performance computing side, which is what we call building the runtime inference stack, there we knew nothing about programming to CUDA which is the API for NVIDIA GPUs. And so, we tried to hire people and couldn't find them like the ones that are -

07:10 — Chris Wright
Yeah, rare commodity.

07:12 — Brian Stevens
Yeah and then the team just got great at it. They just got great at bringing their algorithms to CUDA and that got us off and running pretty quickly.

07:23 — Chris Wright
And the focus there was bringing models to GPUs, and then there's still a performance optimization. So with sparsification, you're making the model fit well and perform well on a CPU. There's some compression techniques, there's just some raw performance and even flexibility of how you use the underlying hardware. What was your initial focus when you got into the GPU space?

07:46 — Brian Stevens
I'd say like the first thing, we had spent 15 months building a serving stack on CPUs for LLMs. So because they process really different than anything we'd seen in the past, we learned a lot there. We learned all the techniques you need to do for serving gen AI large language models. But the thing we decided, but we knew for GPUs, we're probably gonna start from scratch. And so, the first part was, we picked an open source prod. We wanted to go open source this time on that side and there wasn't really a lot of choices out there at that time for code bases. And we obviously ended up picking vLLM, the Berkeley project that had just started. But the initial round of optimizations we focused on is that if you can figure out the compression techniques always map to a particular piece of hardware. So every generation of NVIDIA hardware out there, the NVIDIA engineers build more instruction capability inside the next GPUs. So you could see this lifecycle of hardware. And so, you wanna do quantization techniques that map to each generation of the hardware. We're working on Blackwell right now, which isn't really available mainstream. So we're building how do you quantize of LLM for Blackwell that recovers 99%? And then at the same time, on the serving stack side, on the inference side, how do you write these, what are called kernels, which I know is kind of funny for ... How do you write these kernels that actually run those optimizations efficiently on it? But you can get these 50%, 2X, 3X, 4X, utilization increases by doing it. So it's really important that enterprises have this capability.

09:35 — Chris Wright
One of the things I find talking to customers is making investments in hardware or even cloud instances reserved for which GPUs attached. It's difficult for them to get the full utilization of that hardware. So something like proven performance, optimizing the runtime to take advantage of the hardware is super important. I know we'll get into that quite a bit, but I'm curious, in the very beginning, switching to GPUs and identifying what open source project to get involved with, what were the criteria? What were you looking at and what led you to vLLM?

10:12 — Brian Stevens
I mean, we were proprietary before that on the model serving side. All of the optimizations we would do to models, we would publish the research. So we looked very much like a training organization, we're doing the algorithms for quantization. We'd always publish that research and then commit that code to open source, the tools for it. But the engine was, the CPU engine was proprietary for reasons that are quite boring honestly. But this time we wanted to be all open because as you know, the world of AI is just full of open. The ML people are picking open source tools. They just are. And so, we thought we'd have to start from scratch, but the SkyLab at Berkeley had just I think a month or two before, June of 2023, not even two years old, they just started this thing called vLLM and they implemented that one first algorithm that you need that's important to run LLMs. So we actually didn't think it would be very popular, but well then, why not just go invest in Dhat one? And so, that's where we started and then with the idea that it just makes it easier for people to use your technology.

11:25 — Chris Wright
The pivot into GPUs and the optimization and ease of use so that I think part of this is about putting models into production. This is where that runtime environment to me is so important. And I think we're looking at that as the AI operating system, that next generation of technology. We built Linux, okay, you compile your application once, you run it in production many times. Now we're training a model effectively once, but then you put it into production for inference repeatedly many times. So maybe we could dig into that view of the AI operating system.

12:11 — Brian Stevens
Yeah. And you said the thing you said before that it didn't really address is how does optimization fit into that production work? And you think about how much work your team does on Linux to make it super efficient. So you gotta make it reliable, but you also have to make it really efficient and really make hardware utilization high. So I think vLLM is fundamentally focused on that part first, maximum efficiency and put your GPUs to work for you. But to solve for the next one, we found, like even at Neural Magic, like early work with clients, that to go in production, there was a gap, quite honestly of what they had to build. And this is whether they're a big company that's putting inference at scale or whether they're a startup or whether they're a traditional enterprise. And we found they're all building their own thing. They're building their own scaffolding and framework, right? Unfortunately. And that again was another aha moment. You haven't solved a problem, right? Just to build like a runtime stack. And so, that just began our journey, which started at almost a year ago, but like now at Red Hat is really starting in earnest, was around how do you bring inference into a Kubernetes world, right? 'Cause Kubernetes is where people are running production level application services. So why wouldn't you bring inference into that world? For an enterprise.

13:40 — Chris Wright
Well, it makes sense because it's not a single production workload, it's a whole collection of production workloads. That means you have pooled resources, GPUs managed by some distributed system, Kubernetes. And so, how do we bring these two worlds together and maintain that optimal efficiency?

13:57 — Brian Stevens
Yeah. So, I don't know what the OS is for this whole stack. I guess it depends on whether you're looking up or looking down, but the Kubernetes I think is the... Kubernetes meets vLLM is kind of the first principle 'cause you could see this other world where you'd be coming to an enterprise and saying, "Okay, we gotta go do something entirely differently. Here's a cluster and here's another technology for scaling it." And that would be really disruptive and kind of unneeded. And so, I think the integration of scalable, secure, provisionable, multi-tenant delivering to SLO inference inside of a Kubernetes environment is exactly what enterprises are gonna need.

14:42 — Chris Wright
Yeah, I mean the vLLM, I think of that as the core runtime, but making it distributed is critical. And as you described, you could do that in a number of ways. One might be a bespoke mechanism to just do distributed inference, for example, or take advantage of all of those hardware resources. But then we'd miss the buzzword du jour around agents and the agentic workflows which bring kind of application logic connected with LLMs and having that orchestrated across a single platform, at least from my point of view, makes makes a ton of sense.

15:20 — Brian Stevens
Yeah, at the core, before we even talked about how does vLLM have to change for this kind of scale out world and operating system of AI, the core tenet of it is really powerful. I think not often realized is what it does is it brings a common inference platform to multiple accelerators, not trivial, right? And I think that's why vLLM's flourished a lot, is because all the accelerators and GPU builders out there are directly engaging to get integrated into vLLM.

16:00 — Chris Wright
Yeah.

16:00 — Brian Stevens
On top of that, every model that comes out, every frontier model that comes out, we had Qwen just drop. They've had four or five big major releases this year already of different frontier models is crazy. But every one of them, we always hear about like weights and they get bigger and smarter, but they also develop new structures and new techniques, right? Inside the model. And that's what makes them have these more capabilities and be more efficient. All of those have to be implemented inside of vLLM every time in a compatible way, of course, going forward. And so, there really is this nice relationship with all the frontier models that are coming out with vLLM and it works really hard to enable it just like we did with Linux back then. We're still doing Linux for every top performing applications, but that's where you see this emergent kind of kernel, right? In your words. For the platform, that is cross-hardware, cross-model, which is all the vendors win with that because they can reach the market and then enterprise wins 'cause they can consume everything really kind of simply.

17:08 — Chris Wright
To me, that's the beauty of open source. It's that creating the sort of defacto standard technology building blocks, that in this case brings together the hardware world and the model world where you can run optimized models on any hardware, choose your model. There's a lot of choice, especially as more and more models are coming with open source or at least easy access type of licenses and that solves for the any model on any hardware. So then, we switch gears into that. These are big models sometimes and the core implementation of vLLM is around a KV cache and-

17:52 — Brian Stevens
You went there, not me .

17:55 — Chris Wright
We love technology. And so, I think the next view of vLLM is about really distributing the KV cache and then building all the optimizations that come along with that. How do you see that evolving?

18:11 — Brian Stevens
You wouldn't think it because there's often these application-centric systems that you scale them by replication and low balancing them, right? Just add N servers instead of one server and you get N-times scalability.

18:25 — Chris Wright
The web scale.

18:25 — Brian Stevens
Yeah, yeah, that kind of tier. As we've learned, LLM serving's different than that. You could do it that way, but you won't have maximum output in two different dimensions. One, it's really hard if you always have to serve one model on one system. It kind of limits the flexibility that you talked about, 'cause people run big models as well, right? And so, you really want the underlying system to be able to run a model across one GPU on one server, eight GPUs on one server, or GPUs across multiple servers, all participating in the same inference request. And so, that's kind of one for meeting those service level objectives of the users that you're serving 'cause they have expectations. And then, the second is around efficiency at the hardware level. So it's really complex on how vLLM has to think about all of that. And then, what's happening further and what you just talked about was, to support all that, you have to end up having distributed services in this KV cache, which is on behalf of one inference request. You now wanna replicate that across lots of servers and that sets you up for this world where any server can respond to a token request, right? If any server can calculate the next token for you because it has pure state of the KV cache, you get all this flexibility on different ways of meeting service level objectives for customers while increasing the number of tokens you can produce per unit of infrastructure. So it kind of like all comes together in terms of the distributed inference world.

20:11 — Chris Wright
I think for one, it's super exciting. The building blocks, the OS vision. I'm obviously aligned to that with my Linux history. Distributing that, those are always the hardest problems in computer science. And then, thinking about what that enables. So it's not just technology for technology's sake, but this is building that multi-tenant, multi-model inference capabilities pooled on hardware. And if we go back to the potential of something like agentic, which has multiple models working together on a common workflow, it's clear you have to have scheduling and sharing of resources and it starts to look like this whole distributed infrastructure and distributed application world that we understand pretty well. So I think there's a lot in this optimization space that vLLM plays a critical role, but there's probably more around coordinating all of that. How do you see that working together?

21:13 — Brian Stevens
I mean, you alluded to the first core piece, is obviously there's vLLM and how does vLLM have to change? It actually has to change to integrate into this distributed KV cache world. So there's kind of that's like the core project. It's also dragging in higher performance software for data distribution across servers. Trying to leverage what's been out there in the past, but there's also some new things that are very out and specific that are being built and that also integrate with all the different fabrics, hardware fabrics, that are gonna be very heterogeneous that enterprises have. And that's kind of the bottom half of the stack and the top half is, with that, it's almost like a layer of violation in my... You know where I'm going with this, is that now you want the routing of new requests coming in to be very aware of that distribution of KV cache for different users, which usually you wouldn't do that, right? You'd be like, no, load balancers should just be about lease loading and things like that. Your desire is that you send it to a system that has to do less calculations 'cause it's already got the cache. And then, that's another level of not just speed and performance, but it also increases what your infrastructure can produce because it's not relying on the GPUs-

22:33 — Chris Wright
It's not repeating the same multiplications over and over.

22:35 — Brian Stevens
Exactly, exactly. So all this is coming together to be honest in kind of the next generation distributed AI inference capabilities.

22:45 — Chris Wright
First of all, it's super exciting and especially how quickly it all moves. And then second, we touched a little bit on agentic, but if you look at other techniques for making models better, we have scaling limitations in data. Talked a lot about the data wall in the industry and how much data you can put into a model to improve the outputs of the model. Really the current focus is moving more towards inference time scaling and that puts something like vLLM right in the middle of a picture. When you take inference time scaling connected with reasoning capabilities, this is all about the importance of inference and that optimization and that distributed picture that you were describing.

23:38 — Brian Stevens
That one was kind of a surprise to me to be honest, because in the past, the whole thing is smaller models, right? Fine-tune them,

23:50 — Chris Wright
Quantize

23:51 — Brian Stevens
get the tokens right first shot. You know what I mean? Always get the answer by doing as fewest calculations as possible. And now all of a sudden, we're building these models that build all these intermediate state tokens, right? To get you to the final answer. So now the final answer's coming not just from a, call it a larger model or at least a reasoning model that's producing far more tokens than what you'll ever see. And so, you think of what that requires on the platform side is, all of a sudden, the tax on the platform to deliver that, the cost per token's gone up, right? But we've already realized that produces better higher quality answers. So it puts even more of a burden on, you better be as hyper-efficient as possible, right? For an enterprise and their infrastructure. And that's why I think they really need this kind of capability. And you just said it, there's just all kinds of different agentic workloads, different size models, reasoning, different QPS rates for different demands on each of those.

25:00 — Chris Wright
Yeah, which is why I think vLLM is such an exciting project and that inference space is so important and taking it from the core kernel to distributed and then solving all those hard problems along the way. You've got all the hardware vendors that you mentioned. You've got the different model providers coming to... We work together to build optimized models on day zero when they're released. How do you see that coming together at this broader community level?

25:27 — Brian Stevens
I've been like, and this is kind of ironic, I've always been it's more about the ecosystem than the technology . And I know that's as you and I work together on the technology part of that, just meaning the picture that we're painting here shouldn't just be... I don't think a one-vendor approach would ever be successful, whether that's proprietary or open source, quite honestly. And I think it has to be one that is cross-industry, all the accelerators, all the model providers. Now we're talking about network interconnects, we're talking about server manufacturers. It's everybody, right? It's the agentic platforms and how they integrate. So I think this one path is land some open source project out there and hope people notice it and wanna build around it. And then, there's another approach that says, let's bring the ecosystem, try to get the ecosystem and like-minded people together, even if they compete in some aspects of their business, get everybody together with this common spec of what we've been talking about for distributed inference and let them go build it shoulder to shoulder.

26:38 — Chris Wright
We've seen that work well in the Linux space, we've seen it work well in the Kubernetes space, this classic coopetition. We work together in communities, we find different ways to compete in the market commercially. But it builds better technology into this robust ecosystem that you described. I don't always think of it that way. I think of a defacto standard technology of you bringing it outside in, all the people come together to build the-

27:04 — Brian Stevens
Yeah. It is true because they've got diverse opinions, right? And one thing I love about open, well, developers generally, but open source developers that are used to their own opinion might not be the one that lands. It's shaped in some way, right? And I think like tackling what we've just talked about is probably one of the biggest, not just projects in open source AI, but probably could have a really large impact on whether or not it works at all for enterprises. Because I just can't imagine this world where enterprises are in a DIY mode and 20 different platforms to deploy 20 different accelerators and 20 different models and can't drive to the next level of efficiency, and therefore AI's not delivering the value they want. I think if we solve for this as a community of developers, it really sets up AI to have maximum impact. I really feel that way.

28:06 — Chris Wright
Well, I think that's an awesome place to leave it. We're building this technology together to make AI accessible to the enterprise, the broadest ecosystem because it's how it'll be effective for making value and realizing the potential of AI in businesses and as technologists, I think it's a pretty awesome place to be.

28:33 — Brian Stevens
And what better place to do that than here?

28:36 — Chris Wright
Yeah, and in open source. Well, awesome. Really appreciate the time, Brian. Great conversation.

28:43 — Chris Wright
What a fascinating look under the hood of AI inference with Brian Stevens today. The big takeaway for me, as we push the boundaries of hardware, bringing models to production, the community is helping transform the enterprise AI landscape. It's a story we've seen before, mirroring the open source journey of Linux. Smart, collaborative problem-solving, making powerful technology accessible. It shows that progress isn't just inevitable, it's built often by passionate communities sharing their insights and expertise. That blend of ingenuity and accessibility keeps me truly optimistic for AI's next chapter.

29:23 — Chris Wright
Thanks for joining the conversation, I'm Chris Wright and I can't wait to see what we explore next on "Technically Speaking".

Keywords:
AI,
ML

Brian Stevens

Senior Vice President and AI Chief Technology Officer, Red Hat

Keep exploring

Understanding vLLM: Efficient LLM serving

vLLM is an open source library designed for fast and easy LLM inference and serving. Discover how it optimizes large language model performance on GPUs through innovative techniques like PagedAttention, helping to make advanced AI models more accessible and cost-effective.

Explore the vLLM Project

Open Source communities driving AI forward

The rapid advancements in AI are significantly propelled by open source communities. Engaging with projects like vLLM on platforms like GitHub allows for collaborative development, shared innovation, and the creation of transparent, adaptable AI tools for everyone.

Visit vLLM on GitHub

More like this

How open source can help with AI transparency

Delve into the complexities of achieving transparency in AI through open source practices. Explore the challenges around making AI more open, trustworthy, and accountable, and how projects like TrustyAI aim to address bias.

Watch the episode

Building Trust in Enterprise AI

To realize the power of AI/ML in enterprise environments, users need an inference engine to run on their hardware. Two open toolkits from Intel do precisely that.

Listen to the episode

Bringing Deep Learning to Enterprise Applications

Like houseplants, machine learning models require some attention to thrive. That's where MLOps and ML pipelines come in.

Listen to the episode

Scaling AI inference with open source ft. Brian Stevens

Episode Description

Transcript

Keep exploring

Understanding vLLM: Efficient LLM serving

Open Source communities driving AI forward

More like this

How open source can help with AI transparency

Building Trust in Enterprise AI

Bringing Deep Learning to Enterprise Applications

Red Hat legal and privacy links

Red Hat legal and privacy links

Scaling AI inference with open source ft. Brian Stevens

Episode Description

Transcript

Keep exploring

Understanding vLLM: Efficient LLM serving

Open Source communities driving AI forward

More like this

How open source can help with AI transparency

Building Trust in Enterprise AI

Bringing Deep Learning to Enterprise Applications

Share our shows

Presented by Red Hat

Red Hat legal and privacy links

Red Hat legal and privacy links