Skip to main content
Redhat Developers  Logo
  • Products

    Featured

    • Red Hat Enterprise Linux
      Red Hat Enterprise Linux Icon
    • Red Hat OpenShift AI
      Red Hat OpenShift AI
    • Red Hat Enterprise Linux AI
      Linux icon inside of a brain
    • Image mode for Red Hat Enterprise Linux
      RHEL image mode
    • Red Hat OpenShift
      Openshift icon
    • Red Hat Ansible Automation Platform
      Ansible icon
    • Red Hat Developer Hub
      Developer Hub
    • View All Red Hat Products
    • Linux

      • Red Hat Enterprise Linux
      • Image mode for Red Hat Enterprise Linux
      • Red Hat Universal Base Images (UBI)
    • Java runtimes & frameworks

      • JBoss Enterprise Application Platform
      • Red Hat build of OpenJDK
    • Kubernetes

      • Red Hat OpenShift
      • Microsoft Azure Red Hat OpenShift
      • Red Hat OpenShift Virtualization
      • Red Hat OpenShift Lightspeed
    • Integration & App Connectivity

      • Red Hat Build of Apache Camel
      • Red Hat Service Interconnect
      • Red Hat Connectivity Link
    • AI/ML

      • Red Hat OpenShift AI
      • Red Hat Enterprise Linux AI
    • Automation

      • Red Hat Ansible Automation Platform
      • Red Hat Ansible Lightspeed
    • Developer tools

      • Red Hat Trusted Software Supply Chain
      • Podman Desktop
      • Red Hat OpenShift Dev Spaces
    • Developer Sandbox

      Developer Sandbox
      Try Red Hat products and technologies without setup or configuration fees for 30 days with this shared Openshift and Kubernetes cluster.
    • Try at no cost
  • Technologies

    Featured

    • AI/ML
      AI/ML Icon
    • Linux
      Linux Icon
    • Kubernetes
      Cloud icon
    • Automation
      Automation Icon showing arrows moving in a circle around a gear
    • View All Technologies
    • Programming Languages & Frameworks

      • Java
      • Python
      • JavaScript
    • System Design & Architecture

      • Red Hat architecture and design patterns
      • Microservices
      • Event-Driven Architecture
      • Databases
    • Developer Productivity

      • Developer productivity
      • Developer Tools
      • GitOps
    • Secure Development & Architectures

      • Security
      • Secure coding
    • Platform Engineering

      • DevOps
      • DevSecOps
      • Ansible automation for applications and services
    • Automated Data Processing

      • AI/ML
      • Data Science
      • Apache Kafka on Kubernetes
      • View All Technologies
    • Start exploring in the Developer Sandbox for free

      sandbox graphic
      Try Red Hat's products and technologies without setup or configuration.
    • Try at no cost
  • Learn

    Featured

    • Kubernetes & Cloud Native
      Openshift icon
    • Linux
      Rhel icon
    • Automation
      Ansible cloud icon
    • Java
      Java icon
    • AI/ML
      AI/ML Icon
    • View All Learning Resources

    E-Books

    • GitOps Cookbook
    • Podman in Action
    • Kubernetes Operators
    • The Path to GitOps
    • View All E-books

    Cheat Sheets

    • Linux Commands
    • Bash Commands
    • Git
    • systemd Commands
    • View All Cheat Sheets

    Documentation

    • API Catalog
    • Product Documentation
    • Legacy Documentation
    • Red Hat Learning

      Learning image
      Boost your technical skills to expert-level with the help of interactive lessons offered by various Red Hat Learning programs.
    • Explore Red Hat Learning
  • Developer Sandbox

    Developer Sandbox

    • Access Red Hat’s products and technologies without setup or configuration, and start developing quicker than ever before with our new, no-cost sandbox environments.
    • Explore Developer Sandbox

    Featured Developer Sandbox activities

    • Get started with your Developer Sandbox
    • OpenShift virtualization and application modernization using the Developer Sandbox
    • Explore all Developer Sandbox activities

    Ready to start developing apps?

    • Try at no cost
  • Blog
  • Events
  • Videos

Integrate vLLM inference on macOS/iOS using OpenAI APIs

Part 3: A developer's guide to vLLM on macOS and iOS

June 25, 2025
Rich Naszcyniec
Related topics:
Artificial intelligence
Related products:
Red Hat AI

Share:

    This post is the third in a series about using vLLM inference in macOS and iOS applications. It will explain how to communicate with vLLM using the OpenAI specification as implemented by the SwiftOpenAI and MacPaw/OpenAI open source projects. The first article presented a business case for a vLLM-powered chatbot for macOS and iOS. The second article introduced a sample chatbot application communicating with vLLM via a Llama Stack server using the Llama Stack Swift SDK.

    Why use OpenAI to communicate with vLLM?

    The OpenAI API specification, governed and maintained by OpenAI (the creators of ChatGPT and earlier GPT models), provides a RESTful standard for interacting with inference servers. Its official documentation details endpoints, request/response formats, and overall structure.

    Developers often prefer the OpenAI API specification for communicating with inference servers such as vLLM due to its widespread adoption and ease of implementation and integration. Its simple design includes well-documented RESTful architecture, JSON-based requests, and clear endpoints for tasks like text generation, embeddings, and chat completions. This familiarity streamlines development and lowers the initial learning requirements, particularly for teams already accustomed to it.

    Why use an API wrapper for an API endpoint?

    To interact with a vLLM instance that uses the OpenAI endpoint, Swift code is necessary for making OpenAI API REST requests. This article explores reducing development time and coding complexity by leveraging an API wrapper approach. Specifically, using the SwiftOpenAI and MacPaw/OpenAI open source projects enables the sample code to perform REST calls with fewer lines of code compared to lower-level alternatives such as URLSession. Furthermore, adopting these open source projects benefits from the community's existing implementation and testing of the OpenAI specification, saving development time.

    However, this approach involves a trade-off. Employing the API wrapper strategy means surrendering some granular control and detailed error handling capabilities. This is a reasonable compromise considering the sample chatbot focuses on fundamental text inference.

    Get the sample source code set up

    To quickly facilitate this article's learning objectives, sample source code is available in a GitHub repository. To get started with that code, you’ll need to clone it to your macOS computer that also has Xcode 16.4 or greater on it. We will use Xcode to review, edit, and compile the code. 

    Explore the following interactive tutorial on how to clone the repository, and refer to the text describing the steps that follow it.

     

    Here are the steps to follow to clone the repository using Xcode so you can start using the project:

    1. Make sure you have set your source code control preferences in Xcode
    2. Select the Clone Git Repository option of the startup dialog for XCode, or the Clone… option under the Integrate menu in the main menu bar.
    3. In the dialog box that appears, enter “http://github.com/RichNasz/vLLMSwiftApp.git” into the search bar that appears at the top of the dialog box
    4. In the file dialog that appears choose the folder you want to clone the repository into.

    Build the project and developer documentation

    Once the cloning process is complete, a new Xcode project will open and immediately start building the project. Stop that build using the Product → Stop menu option. Once that is done, check the following items to make sure you get a clean build:

    1. During your first build of the project, a dialog box asking you to Enable & Trust the OpenAPIGenerator extension will pop up. Select the option to “Trust & Enable” when prompted. If you don’t do this, code for the SDK can’t be generated, and the code won’t work
    2. You need to have your Apple developer account set up in Xcode so that projects you work on can be signed and used on physical devices. You can create a free Apple Developer account, or use an existing account.
      1. First, make sure your developer account is set via Xcode Settings… → Accounts.
      2. Select the vLLMSwiftApp in the project navigator, and then set the values in the Signing & capabilities section of the vLLMSwiftApp target.
        1. Select the check box for “Automatically manage signing”
        2. Set the team to your personal team
        3. Choose a unique bundle identifier for your build. 

    Once you have verified the critical items identified above, go ahead and clean the build folder using the Product → Clean Build Folder… menu option. Then start a new build using the Product → Build menu option. The initial build will take a while to complete since all package dependencies must be downloaded (such as SwiftOpenAI and MacPaw/OpenAI ) before a full build starts. Provided the build completes without error, you can now run the sample chatbot. 

    In addition to building the source code, build the developer documentation by selecting the Product → Build Documentation menu item in Xcode. This build takes a minute or two to complete, and you can open the generated Developer Documentation using the Xcode Help → Developer Documentation menu item. Once the documentation is open, select the vLLMSwiftApp item, and then the vLLM Swift Application (vLLMSwiftApp) items the sidebar on the left of the help screen. We’ll use the generated documentation to help simplify the code review process.

    Chatbot code walkthrough

    The chatbot code in this article builds upon the previous article in this series. Four new source code files are introduced for chatbot functionality: SwiftOpenAIChatViewModel.swift and MacPawOpenAIChatViewModel.swift in the Models folder handle remote connections to the vLLM server, along with SwiftOpenAIChatView.swift and MacPawOpenAIChatView.swift in the Views folder, that present the SwiftUI based user interface.

    Since the source code follows the Model-View-ViewModel (MVVM) pattern, commonly used in SwiftUI development, the new SwiftOpenAI and MacPaw/OpenAI functionality were easily integrated into the existing source code that previously only demonstrated Llama Stack functionality.

    Note on general sample code functionality

    If you need to review the basic structure of the chatbot application, review the article titled “Understanding basic application structure” article included in the developer documentation you built earlier.

    Now let’s take a closer look at the SwiftOpenAI and MacPaw/OpenAI code required to make inference calls to vLLM.

    SwiftOpenAI implementation

    SwiftOpenAI is a dedicated OpenAI API wrapper that is growing in popularity with over 520 GitHub stars, has 14 active contributors, and under the permissive MIT license.  

    The project supports ten different OpenAI endpoints, including chat completions that are used in the sample code. Getting started with the API was fairly simple as there is documentation, a sample application, and of course source code available with the project.

    The project is geared towards communicating directly with OpenAI hosted endpoints, so there is a little configuration work required to set the API key, and the URL of the vLMM server to connect to for the and inference call. This is done using the service method of the OpenAIServiceFactory class.

    Generated developer documentation is available

    The creators of SwiftOpenAI included DocC compliant comments in their source code, so you can look up the Classes, Structures, Type Aliases, and Enumeration associated with the project in the documentation you generated earlier!

    Explore the following interactive walkthrough of the code in the SwiftOpenAIChatViewModel.swift file, and then refer to the text describing the code that follows it.

    Below is the code used to set up a the server connection to use for the inference request:

    // default value is 60 seconds, and if you need longer than that, set the durations here
    let configuration = URLSessionConfiguration.default
    configuration.timeoutIntervalForRequest = 60 // set same as default for now.
    let inferenceService: OpenAIService // allowed due to deferred initialization
    if let apiKey = onServer.apiKey { // if we have an API key then use it
     inferenceService = OpenAIServiceFactory.service(apiKey: apiKey, overrideBaseURL: onServer.url, configuration: configuration)
    } else {
     inferenceService = OpenAIServiceFactory.service(apiKey: "", overrideBaseURL: onServer.url, configuration: configuration)
    }

    The first method to be called in the API is OpenAIServiceFactory.service that has three required three parameters:

    1. The first parameter is the API key value to send.  To provide the value, the sample code checks to see if an API Key is defined for the server that the app user selected to use for the chat, and if defined, uses it. Otherwise a blank value is used. This logic is controlled by the if statement.
    2. The second required parameter is the URL to use for connecting to the vLLM server. The URL needs to contain the protocol, server name, port, and base path to use. Since the user defines this in the server definition, we use that value directly.
    3. The third parameter is a collection of configuration values for the connection. None of the default configuration parameters need to be changed for the sample code, so we can use the URLSessionConfiguration.default value.

    Once the call to OpenAIServiceFactory.service is made, we are ready to move on to setting the inference request to send to the vLLM server. The code for this is simple:

    // Set the parameters that will be used to request inference. 
    // Only two parameters are required, but there are many more available to enhance future code with.  
    // Add a .system message in the future for prompt engineering
    let inferenceParameters = ChatCompletionParameters(
       messages: [.init(role: .user, content: .text(prompt))],
       model: .custom(modelName)
      )

    As shown in the code, two parameters are required:

    • The text prompt that the user entered to submit for the inference request.
    • The model name associated with the server definition. The SwiftOpenAI source code for the Model enumeration provides multiple pre-defined options for model names. However, sample code uses the option that allows a text value for the model to be provided.

    Now that the service to connect to is defined, and the parameters set to pass, the call to the vLLM server can be made using following code: 

    // Start the stream
    do {
     llmResponse = "" // need to set the variable that holds the response to empty before inference call is made
     let stream = try await inferenceService.startStreamedChat(parameters: inferenceParameters)
     for try await result in stream {
      let content = result.choices?.first?.delta?.content ?? ""
      // Directly set the llmResponse in the for loop since the class is observable and we want updates
      self.llmResponse += content
     }
    }
    catch APIError.responseUnsuccessful(let description, let statusCode) {
     throw swiftOpenAIInferenceError.apiError(statusCode: statusCode, description: description)
    }
    catch {
     throw swiftOpenAIInferenceError.apiError(statusCode: -999, description: "Unexpected error: \\(error.localizedDescription)" )
    }

    Before each call to the vLLM server is made, the llmResponse variable used to store the inference response is cleared. Next, a streaming response request is made using Swift concurrency. Since the code doesn’t currently allow the user to cancel an inference request, wrapping the stream in a task isn’t required. (Save that for a future code feature!)

    Once the streaming request is made, response chunks are received in the for loop, with each chunk of response added to the llmResponse variable. Since the class is marked @Observable, the SwiftUI view in SwiftOpenAIChatView.swift can observe the llmResponse variable and update to the user interface as inference responses are received. 

    Finally, should an error occur in a vLLM request, the error is captured in the sendMessage method (not shown) and the llmResponse value is set to the error text. This isn’t a rich error notification or recovery implementation, but it does give the user feedback that an error occurred, and insight into what the cause of the error is.

    That’s it! That’s all the code that needed to be implemented to make a vLLM inference using SwiftOpenAI, and usage was fairly easy. While the project documentation is somewhat concise, it covers all of the package's capabilities. For some details not covered in the project documentation, I referred to the developer documentation generated from the source code, and the source code itself. This was straightforward, and I would recommend using SwiftOpenAI despite this minor point.

    More SwiftOpenAI functionality is available

    The current sample code only demonstrates a basic text inference request, but the SwiftOpenAI project offers far more functionality as defined in the OpenAI API specification.

    MacPaw/OpenAI implementation

    MacPaw/OpenAI is a popular OpenAI API Swift implementation that is popular with over 2500 github stars, has 63 active contributors, and is under the permissive MIT license.  

    The project supports major OpenAI endpoints, including chat completions that are used in the sample code. To implement nonblocking code, developers can choose to use closures, Apple Combine, or Swift structured concurrency when using the API. I used the structured concurrency option as that is the most current and “modern” approach for concurrency when developing in Swift. Also, with Swift 6 usage becoming more mainstream, use of structured concurrency is a must since with that language version, strict concurrency checking is part of the build process.

    Explore the following interactive walkthrough of the code in the MacPawOpenAIChatViewModel.swift file, and then refer to the text describing the code that follows it.

    The initial challenge in using the API stemmed from limited project documentation and my current level of skill developing with Swift. The project's GitHub documentation provides a very basic framework for implementing chat streaming using closures, Combine, and structured concurrency.

    Given the scarcity of project documentation, I explored the developer documentation generated via Xcode. I was pleased to discover DocC comments within the code, enabling documentation generation. Initially, the OpenAI.chatsStream method documentation seemed to only cover closure usage, while I sought information on structured concurrency. However, a closer examination revealed documentation for the OpenAIAsynch implementations of the same method, which was great.

    func chatsStream(query: ChatQuery) -> AsyncThrowingStream<ChatStreamResult, Error>

    Now—as seen in the above code snippet—we know what the method signature is for using the async call like I want to in order to use structured concurrency.

    Since chatsStream returns an AsyncThrowingStream, the sample shown below code uses a for-await-in syntax to process the chunks of streaming responses from the vLLM server, and a catch block for any errors that are thrown.

    llmResponse = "" // need to set the variable that holds the response to empty before inference call is made
      // use structured concurrency to make the inference request and process results
      do {
        for try await result in openAI.chatsStream(query: chatQuery) {
          // we need to check the result for errors first that are sent back as
          let content = result.choices.first?.delta.content ?? ""
          // Directly set the llmResponse in the for loop since the class is observable and we want updates
          self.llmResponse += content
          let finishReason = result.choices.first?.finishReason
          switch finishReason {
           case .stop:
            break // **TODO: implement any code for last chunk returned**
           case .length:
            throw macPawOpenAIInferenceError.apiError(statusCode: -1, description: "OpenAI cumulative tokens exceed max\_tokens")
            // **TODO: implement any code if the cumulative tokens exceed max\_tokens**
           case .contentFilter:
            throw macPawOpenAIInferenceError.apiError(statusCode: -1, description: "OpenAI safety filters were triggered")
            // **TODO: code for when OpenAI safety filters are triggered**
           case .error:
            throw macPawOpenAIInferenceError.apiError(statusCode: -1, description: "OpenAI Error: check prompt")
            // **TODO: find where error details are available**
           case .functionCall:
            break // **TODO: code for handling function calls**
           case .toolCalls:
            break // **TODO: code for handling tool calls**
           case .none:
            throw macPawOpenAIInferenceError.apiError(statusCode: -1, description: "OpenAI Error: response is incomplete or interrupted")
            // **TODO: implement code for cases where the response is incomplete or interrupted.**
          }
        }
      }
      catch let error as OpenAIError {
       let errorText = error.localizedDescription
       switch errorText {
        case let errorCheck where errorCheck.contains("400"):
         throw macPawOpenAIInferenceError.apiError(statusCode: 400, description: "Bad request: Check model name")
        case let errorCheck where errorCheck.contains("401"):
         throw macPawOpenAIInferenceError.apiError(statusCode: 401, description: "Invalid API key")
        case let errorCheck where errorCheck.contains("429"):
         throw macPawOpenAIInferenceError.apiError(statusCode: 429, description: "Rate limit of quota exceeded")
        case let errorCheck where errorCheck.contains("503"):
         throw macPawOpenAIInferenceError.apiError(statusCode: 429, description: "The engine is currently overloaded, please try again later")
        default:
         throw macPawOpenAIInferenceError.apiError(statusCode: -999, description: " Unkown API error: \\(errorText)")
       }
      }
      catch {
       // Print error type and details when debugging
       //   print("Error type: \\(type(of: error))")
       //   print("API error description: \\(error)")
       let errorText = error.localizedDescription
       switch errorText {
        case let errorCheck where errorCheck.contains("-1004"):
         throw macPawOpenAIInferenceError.apiError(statusCode: -1004, description: "Could not connect to the server")
        default:
         throw macPawOpenAIInferenceError.apiError(statusCode: -999, description: " \\(errorText)")
       }
      }

    Any low level errors will cause an error to be thrown and caught in the catch block. This allows the llmResponse variable to simply append each chunk value to the existing  variable contents. 

    Logical errors thrown by vLLM, are caught in the do block of the code. Critical error handling is performed in the catch blocks. I find it to be good practice to create a catch block for each type of error that can be thrown, and as shown in the code, one catch block is specific for an OpenAIError. For any unexpected errors, a default catch block is provided. 

    That’s it for the code that needed to make a vLLM inference using MacPaw/OpenAI. The process of performing vLLM inference with MacPaw/OpenAI was relatively straightforward after navigating the Swift concurrency documentation. While the extensive configuration options offer granular control, they could potentially be simplified into a URL string.

    Initially, the bridging functions used to adapt the closure-based asynchronous APIs in the project to Swift's concurrency model (returning an AsyncThrowingStream) raised performance concerns. However, research suggests any performance impact from this approach is minimal.

    In conclusion, I would confidently use the MacPaw/OpenAI project again, and hope its documentation continues to evolve.

    More MacPaw/OpenAI functionality is available

    The current sample code only demonstrates a basic text inference request, but the MacPaw/OpenAI project offers far more functionality as defined in the OpenAI API specification.

    Conclusion

    This article explored integrating vLLM inference into macOS and iOS applications using SwiftOpenAI and MacPaw/OpenAI. Both libraries provide effective wrappers for the OpenAI API, simplifying the process of communicating with a vLLM server. SwiftOpenAI offers a straightforward setup with clear documentation, making it easy to implement basic text inference requests. MacPaw/OpenAI, despite some documentation limitations, provides robust functionality and excellent error handling with Swift's structured concurrency, albeit with a slightly steeper learning curve.

    While the sample code focused on basic text inference, both projects offer extensive capabilities aligned with the OpenAI API specification, allowing for future expansion and more complex interactions with vLLM. Apple developers can leverage these open source libraries to accelerate the development of vLLM-powered applications on Apple platforms, balancing simplicity with flexibility depending on their specific needs.

    Ready to build your own vLLM-powered macOS and iOS applications?

    Clone the GitHub repository with the sample code now and start experimenting with SwiftOpenAI and MacPaw/OpenAI to integrate powerful AI inference into your projects. Explore the provided sample code and build instructions, and consult the developer documentation to get started. When running the code, be sure to connect to existing vLLM to get started quickly. If you need help setting up vLLM for the first time, take a look at the quickstart documentation for the project.

    Now go have some coding fun!

    Related Posts

    • Integrate vLLM inference on macOS/iOS with Llama Stack APIs

    • How to integrate vLLM inference into your macOS and iOS apps

    • How we improved AI inference on macOS Podman containers

    • RHEL 9 and single node OpenShift as VMs on macOS Ventura

    • How to set up your first Kubernetes environment on macOS

    • Unlock WebAssembly workloads with Podman on macOS and Windows

    Recent Posts

    • OpenShift 4.19 brings a unified console for developers and admins

    • 3 steps to secure network segmentation with Ansible and AWS

    • Integrate vLLM inference on macOS/iOS using OpenAI APIs

    • How to implement developer self-service with Backstage

    • How Kruize optimizes OpenShift workloads

    What’s up next?

    This hands-on learning path demonstrates how retrieval-augmented generation (RAG) works and how users can implement a RAG workflow using Red Hat OpenShift AI and Elasticsearch vector database.

    Start the activity
    Red Hat Developers logo LinkedIn YouTube Twitter Facebook

    Products

    • Red Hat Enterprise Linux
    • Red Hat OpenShift
    • Red Hat Ansible Automation Platform

    Build

    • Developer Sandbox
    • Developer Tools
    • Interactive Tutorials
    • API Catalog

    Quicklinks

    • Learning Resources
    • E-books
    • Cheat Sheets
    • Blog
    • Events
    • Newsletter

    Communicate

    • About us
    • Contact sales
    • Find a partner
    • Report a website issue
    • Site Status Dashboard
    • Report a security problem

    RED HAT DEVELOPER

    Build here. Go anywhere.

    We serve the builders. The problem solvers who create careers with code.

    Join us if you’re a developer, software engineer, web designer, front-end designer, UX designer, computer scientist, architect, tester, product manager, project manager or team lead.

    Sign me up

    Red Hat legal and privacy links

    • About Red Hat
    • Jobs
    • Events
    • Locations
    • Contact Red Hat
    • Red Hat Blog
    • Inclusion at Red Hat
    • Cool Stuff Store
    • Red Hat Summit

    Red Hat legal and privacy links

    • Privacy statement
    • Terms of use
    • All policies and guidelines
    • Digital accessibility

    Report a website issue