Overview
This new feature allows VM disk I/O to be spread among multiple submission threads, which are also mapped to multiple disk queues inside the VM. The combination of these two options allows VMs to more efficiently utilize both vcpus and host cpus for periods of highly threaded I/O load, which can lead to greatly improved performance in many cases.
For more background on how this feature is implemented in KVM, see this article on IOThread Virtqueue Mapping as well this companion article demonstrating performance improvements for Database workloads in VMs running in a RHEL environment.
Feature Details
OpenShift Virtualization 4.19 exposes this feature to VMs by expanding the ioThreadsPolicy option to include a new supplementalPool policy. VM admins can also select the supplementalPoolThreadCount to determine how many host threads should be utilized for the VM’s I/O submission. These IOthreads use adaptive polling so they are not wasting cpu cycles when there is no I/O being performed, rather they allow VMs with many vcpus performing I/O to more efficiently utilize host resources instead of bottlenecking behind a single submission thread.
How to Enable
Example yaml definition with this tuning enabled, VM with 16 vcpus and 4 IOthreads:
spec:
domain:
cpu:
cores: 1
sockets: 16
threads: 1
memory:
guest: 16Gi
ioThreadsPolicy: supplementalPool
ioThreads:
supplementalPoolThreadCount: 4
devices:
blockMultiQueue: true
disks:
- name: rootdisk
disk:
bus: virtio
The key tuning options shown above are:
- ioThreadsPolicy
- ioThreads.supplementalPoolThreadCount
- blockMultiQueue
Tuning Tips
Using the tuning described here, we’ve measured VM storage performance improvements up to 2X+ in some microbenchmarks, depending on various I/O workload patterns and the total amount of workload threading. We recommend considering this tuning option for many demanding storage workloads.
- Consider enabling this tuning for VMs with multiple vcpus (i.e. >4) where workload I/O is likely to be multi-threaded / multi-process (meaning multiple vcpus are performing I/O), especially when the workload drives a high iodepth.
- Keep in mind that the number of IOthreads may not need to be very high to handle the average workload, we’ve often seen that a ThreadCount of 4 can provide very significant performance improvements. In some cases up to 8 or 16 threads could provide slightly higher performance for some very fast storage environments.
- IOthreads can be overcommitted by default however configuring “full” cpu resources may significantly improve performance in some cases. See the CPU allocation ratio documentation for more information about configuring the default cluster behavior for automatic cpu request values. And see our Tuning and Scaling Guide for more information about explicitly tuning VM CPU resources.
Deeper Dive
Technical Notes
- This new supplementalPool policy can serve as a replacement for the shared or auto policies, which control single thread dedicatedIOThread behavior per disk, when the workload environment calls for more threads
- supplementalPool IOthreads are automatically shared across all VM disks, configure enough threads to handle multiple disks if all are very active
- The requested supplementalPoolThreadCount value will be added to the total vcpu value, this influences how many cpu requests are automatically assigned to the VM (and virt-launcher pod) which is determined by the vmiCPUAllocationRatio
- For example, for a VM with 16 vcpus and an IO ThreadCount of 4:
- Default CPUAlloc Ratio of 10 will configure (16+4)*.10 = 2 cpu requests
- A CPU Alloc Ratio set to “1” will configure (16+4)*1 = 20 cpu requests
- Note that any explicit cpu request configured in the VM definition will not be overwritten by this default behavior
- For example, for a VM with 16 vcpus and an IO ThreadCount of 4:
- A few qualifying conditions for this feature:
- bus: virtio
- this is the default type for non-hotplugged disks
- note: upstream work is in progress for virtio-scsi support
- blockMultiQueue: true
- this allows the host-level IOthreads to be automatically mapped to multiple queues inside the guest, so work is properly spread among vcpus as well
- note: queues are automatically set to the total number of vcpus
- io: native
- This IO mode is automatically configured for VM disks when using volumeMode: Block PVCs
- If using Filesystem mode PVCs, this can be enabled by using preallocation
- bus: virtio
- Currently, supplementalPool policy should not be used in conjunction with isolateEmulatorThread or dedicatedCpuPlacement (bug tracker: CNV-64201)
Low Level Feature Details
To see the internals of how the threads are defined per VM disk, you can dump the Libvirt XML from the virt-launcher pod, for ex:
# oc exec -it <-n namespace> virt-launcher-xx-xx -- bash
bash-5.1$ virsh dumpxml 1
For disks that qualify, you should see both queues defined in the driver section and multiple iothreads listed below the drive. For example, for 4 threads:
<driver name='qemu' type='raw' cache='none' error_policy='stop' io='native' discard='unmap' queues='$vcpus'>
<iothreads>
<iothread id='1'/>
<iothread id='2'/>
<iothread id='3'/>
<iothread id='4'/>
</iothreads>
</driver>
You can also check the actual utilization of the iothreads during periods of multi-threaded I/O by viewing pidstat output of the VM’s qemu-kvm pid from the host worker node, for ex:
oc debug node/<host_node>
## note: don’t chroot /host
pidstat -t -p <qemu-kvm pid> 5
Example output with 4 iothreads ~70% utilized:

Conclusion
As you can see, enabling this feature for OpenShift VMs can significantly improve storage performance for multi-threaded workloads. Consider the guidelines mentioned above and evaluate if this tuning option may give your VM storage performance a big boost!
Keep an eye out for more examples of how this feature can improve VM performance in future Performance and Scale Engineering Blogs!