Integrate vLLM inference on macOS/iOS using OpenAI APIs

This post is the third in a series about using vLLM inference in macOS and iOS applications. It will explain how to communicate with vLLM using the OpenAI specification as implemented by the SwiftOpenAI and MacPaw/OpenAI open source projects. The first article presented a business case for a vLLM-powered chatbot for macOS and iOS. The second article introduced a sample chatbot application communicating with vLLM via a Llama Stack server using the Llama Stack Swift SDK.

Why use OpenAI to communicate with vLLM?

The OpenAI API specification, governed and maintained by OpenAI (the creators of ChatGPT and earlier GPT models), provides a RESTful standard for interacting with inference servers. Its official documentation details endpoints, request/response formats, and overall structure.

Developers often prefer the OpenAI API specification for communicating with inference servers such as vLLM due to its widespread adoption and ease of implementation and integration. Its simple design includes well-documented RESTful architecture, JSON-based requests, and clear endpoints for tasks like text generation, embeddings, and chat completions. This familiarity streamlines development and lowers the initial learning requirements, particularly for teams already accustomed to it.

Why use an API wrapper for an API endpoint?

To interact with a vLLM instance that uses the OpenAI endpoint, Swift code is necessary for making OpenAI API REST requests. This article explores reducing development time and coding complexity by leveraging an API wrapper approach. Specifically, using the SwiftOpenAI and MacPaw/OpenAI open source projects enables the sample code to perform REST calls with fewer lines of code compared to lower-level alternatives such as URLSession. Furthermore, adopting these open source projects benefits from the community's existing implementation and testing of the OpenAI specification, saving development time.

However, this approach involves a trade-off. Employing the API wrapper strategy means surrendering some granular control and detailed error handling capabilities. This is a reasonable compromise considering the sample chatbot focuses on fundamental text inference.

Get the sample source code set up

To quickly facilitate this article's learning objectives, sample source code is available in a GitHub repository. To get started with that code, you’ll need to clone it to your macOS computer that also has Xcode 16.4 or greater on it. We will use Xcode to review, edit, and compile the code.

Explore the following interactive tutorial on how to clone the repository, and refer to the text describing the steps that follow it.

Here are the steps to follow to clone the repository using Xcode so you can start using the project:

Make sure you have set your source code control preferences in Xcode
Select the Clone Git Repository option of the startup dialog for XCode, or the Clone… option under the Integrate menu in the main menu bar.
In the dialog box that appears, enter “http://github.com/RichNasz/vLLMSwiftApp.git” into the search bar that appears at the top of the dialog box
In the file dialog that appears choose the folder you want to clone the repository into.

Build the project and developer documentation

Once the cloning process is complete, a new Xcode project will open and immediately start building the project. Stop that build using the Product → Stop menu option. Once that is done, check the following items to make sure you get a clean build:

During your first build of the project, a dialog box asking you to Enable & Trust the OpenAPIGenerator extension will pop up. Select the option to “Trust & Enable” when prompted. If you don’t do this, code for the SDK can’t be generated, and the code won’t work
You need to have your Apple developer account set up in Xcode so that projects you work on can be signed and used on physical devices. You can create a free Apple Developer account, or use an existing account.
1. First, make sure your developer account is set via Xcode Settings… → Accounts.
2. Select the vLLMSwiftApp in the project navigator, and then set the values in the Signing & capabilities section of the vLLMSwiftApp target.
  1. Select the check box for “Automatically manage signing”
  2. Set the team to your personal team
  3. Choose a unique bundle identifier for your build.

Once you have verified the critical items identified above, go ahead and clean the build folder using the Product → Clean Build Folder… menu option. Then start a new build using the Product → Build menu option. The initial build will take a while to complete since all package dependencies must be downloaded (such as SwiftOpenAI and MacPaw/OpenAI ) before a full build starts. Provided the build completes without error, you can now run the sample chatbot.

In addition to building the source code, build the developer documentation by selecting the Product → Build Documentation menu item in Xcode. This build takes a minute or two to complete, and you can open the generated Developer Documentation using the Xcode Help → Developer Documentation menu item. Once the documentation is open, select the vLLMSwiftApp item, and then the vLLM Swift Application (vLLMSwiftApp) items the sidebar on the left of the help screen. We’ll use the generated documentation to help simplify the code review process.

Chatbot code walkthrough

The chatbot code in this article builds upon the previous article in this series. Four new source code files are introduced for chatbot functionality: SwiftOpenAIChatViewModel.swift and MacPawOpenAIChatViewModel.swift in the Models folder handle remote connections to the vLLM server, along with SwiftOpenAIChatView.swift and MacPawOpenAIChatView.swift in the Views folder, that present the SwiftUI based user interface.

Since the source code follows the Model-View-ViewModel (MVVM) pattern, commonly used in SwiftUI development, the new SwiftOpenAI and MacPaw/OpenAI functionality were easily integrated into the existing source code that previously only demonstrated Llama Stack functionality.

Note on general sample code functionality

If you need to review the basic structure of the chatbot application, review the article titled “Understanding basic application structure” article included in the developer documentation you built earlier.

Now let’s take a closer look at the SwiftOpenAI and MacPaw/OpenAI code required to make inference calls to vLLM.

SwiftOpenAI implementation

SwiftOpenAI is a dedicated OpenAI API wrapper that is growing in popularity with over 520 GitHub stars, has 14 active contributors, and under the permissive MIT license.

The project supports ten different OpenAI endpoints, including chat completions that are used in the sample code. Getting started with the API was fairly simple as there is documentation, a sample application, and of course source code available with the project.

The project is geared towards communicating directly with OpenAI hosted endpoints, so there is a little configuration work required to set the API key, and the URL of the vLMM server to connect to for the and inference call. This is done using the service method of the OpenAIServiceFactory class.

Generated developer documentation is available

The creators of SwiftOpenAI included DocC compliant comments in their source code, so you can look up the Classes, Structures, Type Aliases, and Enumeration associated with the project in the documentation you generated earlier!

Explore the following interactive walkthrough of the code in the SwiftOpenAIChatViewModel.swift file, and then refer to the text describing the code that follows it.

Below is the code used to set up a the server connection to use for the inference request:

// default value is 60 seconds, and if you need longer than that, set the durations here
let configuration = URLSessionConfiguration.default
configuration.timeoutIntervalForRequest = 60 // set same as default for now.
let inferenceService: OpenAIService // allowed due to deferred initialization
if let apiKey = onServer.apiKey { // if we have an API key then use it
 inferenceService = OpenAIServiceFactory.service(apiKey: apiKey, overrideBaseURL: onServer.url, configuration: configuration)
} else {
 inferenceService = OpenAIServiceFactory.service(apiKey: "", overrideBaseURL: onServer.url, configuration: configuration)
}

The first method to be called in the API is OpenAIServiceFactory.service that has three required three parameters:

The first parameter is the API key value to send. To provide the value, the sample code checks to see if an API Key is defined for the server that the app user selected to use for the chat, and if defined, uses it. Otherwise a blank value is used. This logic is controlled by the if statement.
The second required parameter is the URL to use for connecting to the vLLM server. The URL needs to contain the protocol, server name, port, and base path to use. Since the user defines this in the server definition, we use that value directly.
The third parameter is a collection of configuration values for the connection. None of the default configuration parameters need to be changed for the sample code, so we can use the URLSessionConfiguration.default value.

Once the call to OpenAIServiceFactory.service is made, we are ready to move on to setting the inference request to send to the vLLM server. The code for this is simple:

// Set the parameters that will be used to request inference. 
// Only two parameters are required, but there are many more available to enhance future code with.  
// Add a .system message in the future for prompt engineering
let inferenceParameters = ChatCompletionParameters(
   messages: [.init(role: .user, content: .text(prompt))],
   model: .custom(modelName)
  )

As shown in the code, two parameters are required:

The text prompt that the user entered to submit for the inference request.
The model name associated with the server definition. The SwiftOpenAI source code for the Model enumeration provides multiple pre-defined options for model names. However, sample code uses the option that allows a text value for the model to be provided.

Now that the service to connect to is defined, and the parameters set to pass, the call to the vLLM server can be made using following code:

// Start the stream
do {
 llmResponse = "" // need to set the variable that holds the response to empty before inference call is made
 let stream = try await inferenceService.startStreamedChat(parameters: inferenceParameters)
 for try await result in stream {
  let content = result.choices?.first?.delta?.content ?? ""
  // Directly set the llmResponse in the for loop since the class is observable and we want updates
  self.llmResponse += content
 }
}
catch APIError.responseUnsuccessful(let description, let statusCode) {
 throw swiftOpenAIInferenceError.apiError(statusCode: statusCode, description: description)
}
catch {
 throw swiftOpenAIInferenceError.apiError(statusCode: -999, description: "Unexpected error: \\(error.localizedDescription)" )
}

Before each call to the vLLM server is made, the llmResponse variable used to store the inference response is cleared. Next, a streaming response request is made using Swift concurrency. Since the code doesn’t currently allow the user to cancel an inference request, wrapping the stream in a task isn’t required. (Save that for a future code feature!)

Once the streaming request is made, response chunks are received in the for loop, with each chunk of response added to the llmResponse variable. Since the class is marked @Observable, the SwiftUI view in SwiftOpenAIChatView.swift can observe the llmResponse variable and update to the user interface as inference responses are received.

Finally, should an error occur in a vLLM request, the error is captured in the sendMessage method (not shown) and the llmResponse value is set to the error text. This isn’t a rich error notification or recovery implementation, but it does give the user feedback that an error occurred, and insight into what the cause of the error is.

That’s it! That’s all the code that needed to be implemented to make a vLLM inference using SwiftOpenAI, and usage was fairly easy. While the project documentation is somewhat concise, it covers all of the package's capabilities. For some details not covered in the project documentation, I referred to the developer documentation generated from the source code, and the source code itself. This was straightforward, and I would recommend using SwiftOpenAI despite this minor point.

More SwiftOpenAI functionality is available

The current sample code only demonstrates a basic text inference request, but the SwiftOpenAI project offers far more functionality as defined in the OpenAI API specification.

MacPaw/OpenAI implementation

MacPaw/OpenAI is a popular OpenAI API Swift implementation that is popular with over 2500 github stars, has 63 active contributors, and is under the permissive MIT license.

The project supports major OpenAI endpoints, including chat completions that are used in the sample code. To implement nonblocking code, developers can choose to use closures, Apple Combine, or Swift structured concurrency when using the API. I used the structured concurrency option as that is the most current and “modern” approach for concurrency when developing in Swift. Also, with Swift 6 usage becoming more mainstream, use of structured concurrency is a must since with that language version, strict concurrency checking is part of the build process.

Explore the following interactive walkthrough of the code in the MacPawOpenAIChatViewModel.swift file, and then refer to the text describing the code that follows it.

The initial challenge in using the API stemmed from limited project documentation and my current level of skill developing with Swift. The project's GitHub documentation provides a very basic framework for implementing chat streaming using closures, Combine, and structured concurrency.

Given the scarcity of project documentation, I explored the developer documentation generated via Xcode. I was pleased to discover DocC comments within the code, enabling documentation generation. Initially, the OpenAI.chatsStream method documentation seemed to only cover closure usage, while I sought information on structured concurrency. However, a closer examination revealed documentation for the OpenAIAsynch implementations of the same method, which was great.

func chatsStream(query: ChatQuery) -> AsyncThrowingStream<ChatStreamResult, Error>

Now—as seen in the above code snippet—we know what the method signature is for using the async call like I want to in order to use structured concurrency.

Since chatsStream returns an AsyncThrowingStream, the sample shown below code uses a for-await-in syntax to process the chunks of streaming responses from the vLLM server, and a catch block for any errors that are thrown.

llmResponse = "" // need to set the variable that holds the response to empty before inference call is made
  // use structured concurrency to make the inference request and process results
  do {
    for try await result in openAI.chatsStream(query: chatQuery) {
      // we need to check the result for errors first that are sent back as
      let content = result.choices.first?.delta.content ?? ""
      // Directly set the llmResponse in the for loop since the class is observable and we want updates
      self.llmResponse += content
      let finishReason = result.choices.first?.finishReason
      switch finishReason {
       case .stop:
        break // **TODO: implement any code for last chunk returned**
       case .length:
        throw macPawOpenAIInferenceError.apiError(statusCode: -1, description: "OpenAI cumulative tokens exceed max\_tokens")
        // **TODO: implement any code if the cumulative tokens exceed max\_tokens**
       case .contentFilter:
        throw macPawOpenAIInferenceError.apiError(statusCode: -1, description: "OpenAI safety filters were triggered")
        // **TODO: code for when OpenAI safety filters are triggered**
       case .error:
        throw macPawOpenAIInferenceError.apiError(statusCode: -1, description: "OpenAI Error: check prompt")
        // **TODO: find where error details are available**
       case .functionCall:
        break // **TODO: code for handling function calls**
       case .toolCalls:
        break // **TODO: code for handling tool calls**
       case .none:
        throw macPawOpenAIInferenceError.apiError(statusCode: -1, description: "OpenAI Error: response is incomplete or interrupted")
        // **TODO: implement code for cases where the response is incomplete or interrupted.**
      }
    }
  }
  catch let error as OpenAIError {
   let errorText = error.localizedDescription
   switch errorText {
    case let errorCheck where errorCheck.contains("400"):
     throw macPawOpenAIInferenceError.apiError(statusCode: 400, description: "Bad request: Check model name")
    case let errorCheck where errorCheck.contains("401"):
     throw macPawOpenAIInferenceError.apiError(statusCode: 401, description: "Invalid API key")
    case let errorCheck where errorCheck.contains("429"):
     throw macPawOpenAIInferenceError.apiError(statusCode: 429, description: "Rate limit of quota exceeded")
    case let errorCheck where errorCheck.contains("503"):
     throw macPawOpenAIInferenceError.apiError(statusCode: 429, description: "The engine is currently overloaded, please try again later")
    default:
     throw macPawOpenAIInferenceError.apiError(statusCode: -999, description: " Unkown API error: \\(errorText)")
   }
  }
  catch {
   // Print error type and details when debugging
   //   print("Error type: \\(type(of: error))")
   //   print("API error description: \\(error)")
   let errorText = error.localizedDescription
   switch errorText {
    case let errorCheck where errorCheck.contains("-1004"):
     throw macPawOpenAIInferenceError.apiError(statusCode: -1004, description: "Could not connect to the server")
    default:
     throw macPawOpenAIInferenceError.apiError(statusCode: -999, description: " \\(errorText)")
   }
  }

Any low level errors will cause an error to be thrown and caught in the catch block. This allows the llmResponse variable to simply append each chunk value to the existing variable contents.

Logical errors thrown by vLLM, are caught in the do block of the code. Critical error handling is performed in the catch blocks. I find it to be good practice to create a catch block for each type of error that can be thrown, and as shown in the code, one catch block is specific for an OpenAIError. For any unexpected errors, a default catch block is provided.

That’s it for the code that needed to make a vLLM inference using MacPaw/OpenAI. The process of performing vLLM inference with MacPaw/OpenAI was relatively straightforward after navigating the Swift concurrency documentation. While the extensive configuration options offer granular control, they could potentially be simplified into a URL string.

Initially, the bridging functions used to adapt the closure-based asynchronous APIs in the project to Swift's concurrency model (returning an AsyncThrowingStream) raised performance concerns. However, research suggests any performance impact from this approach is minimal.

In conclusion, I would confidently use the MacPaw/OpenAI project again, and hope its documentation continues to evolve.

More MacPaw/OpenAI functionality is available

The current sample code only demonstrates a basic text inference request, but the MacPaw/OpenAI project offers far more functionality as defined in the OpenAI API specification.

Conclusion

This article explored integrating vLLM inference into macOS and iOS applications using SwiftOpenAI and MacPaw/OpenAI. Both libraries provide effective wrappers for the OpenAI API, simplifying the process of communicating with a vLLM server. SwiftOpenAI offers a straightforward setup with clear documentation, making it easy to implement basic text inference requests. MacPaw/OpenAI, despite some documentation limitations, provides robust functionality and excellent error handling with Swift's structured concurrency, albeit with a slightly steeper learning curve.

While the sample code focused on basic text inference, both projects offer extensive capabilities aligned with the OpenAI API specification, allowing for future expansion and more complex interactions with vLLM. Apple developers can leverage these open source libraries to accelerate the development of vLLM-powered applications on Apple platforms, balancing simplicity with flexibility depending on their specific needs.

Ready to build your own vLLM-powered macOS and iOS applications?

Clone the GitHub repository with the sample code now and start experimenting with SwiftOpenAI and MacPaw/OpenAI to integrate powerful AI inference into your projects. Explore the provided sample code and build instructions, and consult the developer documentation to get started. When running the code, be sure to connect to existing vLLM to get started quickly. If you need help setting up vLLM for the first time, take a look at the quickstart documentation for the project.

Now go have some coding fun!

Integrate vLLM inference on macOS/iOS using OpenAI APIs

Share:

Why use OpenAI to communicate with vLLM?

Why use an API wrapper for an API endpoint?

Get the sample source code set up

Build the project and developer documentation

Chatbot code walkthrough

SwiftOpenAI implementation

MacPaw/OpenAI implementation

Conclusion

Ready to build your own vLLM-powered macOS and iOS applications?

Products

Build

Quicklinks

Communicate

RED HAT DEVELOPER

Red Hat legal and privacy links

Red Hat legal and privacy links

Report a website issue