Skip to content

Commit 6c3b9a7

Browse files
committed
DRA: device taints and tolerations
Initial documentation of this new feature in Kubernetes 1.33.
1 parent 9e160e8 commit 6c3b9a7

File tree

3 files changed

+127
-1
lines changed

3 files changed

+127
-1
lines changed

content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md

Lines changed: 101 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,9 +71,13 @@ DeviceClass
7171
in a ResourceClaim must reference exactly one DeviceClass.
7272

7373
ResourceSlice
74-
: Used by DRA drivers to publish information about resources
74+
: Used by DRA drivers to publish information about resources (typically devices)
7575
that are available in the cluster.
7676

77+
DeviceTaintRule
78+
: Used by admins or control plane components to add device taints
79+
to the devices described in ResourceSlices.
80+
7781
All parameters that select devices are defined in the ResourceClaim and
7882
DeviceClass with in-tree types. Configuration parameters can be embedded there.
7983
Which configuration parameters are valid depends on the DRA driver -- Kubernetes
@@ -357,6 +361,95 @@ spec:
357361
value: 6Gi
358362
```
359363

364+
## Device taints and tolerations
365+
366+
{{< feature-state feature_gate_name="DRADeviceTaints" >}}
367+
368+
Device taints are similar to node taints: a taint has a string key, a string
369+
value, and an effect. The effect is applied to the ResourceClaim which is
370+
using a tainted device and to all Pods referencing that ResourceClaim.
371+
The "NoSchedule" effect prevents scheduling those Pods.
372+
Tainted devices are ignored when trying to allocate a ResourceClaim
373+
because using them would prevent scheduling of Pods.
374+
375+
The "NoExecute" effect implies "NoSchedule" and in addition causes eviction
376+
of all Pods which have been scheduled already. This eviction is implemented
377+
in the device taint eviction controller in kube-controller-manager by
378+
deleting affected Pods.
379+
380+
ResourceClaims can tolerate taints. If a taint is tolerated, its effect does
381+
not apply. An empty toleration matches all taints. A toleration can be limited to
382+
certain effects and/or match certain key/value pairs. A toleration can check
383+
that a certain key exists, regardless which value it has, or it can check
384+
for specific values of a key.
385+
For more information on this matching see the
386+
[node taint concepts](/docs/concepts/scheduling-eviction/taint-and-toleration#concepts).
387+
388+
Eviction can be delayed by tolerating a taint for a certain duration.
389+
That delay starts at the time when a taint gets added to a device, which is recorded in a field
390+
of the taint.
391+
392+
Taints apply as described above also to ResourceClaims allocating "all" devices on a node.
393+
All devices must be untainted or all of their taints must be tolerated.
394+
Allocating a device with admin access (described [above](#admin-access))
395+
is not exempt either. An admin using that mode must explicitly tolerate all taints
396+
to access tainted devices.
397+
398+
Taints can be added to devices in two different ways:
399+
400+
### Taints set by the driver
401+
402+
A DRA driver can add taints to the device information that it publishes in ResourceSlices.
403+
Consult the documentation of a DRA driver to learn whether the driver uses taints and what
404+
their keys and values are.
405+
406+
### Taints set by an admin
407+
408+
An admin or a control plane component can taint devices without having to tell
409+
the DRA driver to include taints in its device information in ResourceSlices. They do that by
410+
creating DeviceTaintRules. Each DeviceTaintRule adds one taint to devices which
411+
match the device selector. Without such a selector, no devices are tainted. This
412+
makes it harder to accidentally evict all pods using ResourceClaims when leaving out
413+
the selector by mistake.
414+
415+
Devices can be selected by giving the name of a DeviceClass, driver, pool,
416+
and/or device. The DeviceClass selects all devices that are selected by the
417+
selectors in that DeviceClass. With just the driver name, an admin can taint
418+
all devices managed by that driver, for example while doing some kind of
419+
maintenance of that driver across the entire cluster. Adding a pool name can
420+
limit the taint to a single node, if the driver manages node-local devices.
421+
422+
Finally, adding the device name can select one specific device. The device name
423+
and pool name can also be used alone, if desired. For example, drivers for node-local
424+
devices are encouraged to use the node name as their pool name. Then tainting with
425+
that pool name automatically taints all devices on a node.
426+
427+
Drivers might use stable names like "gpu-0" that hide which specific device is
428+
currently assigned to that name. To support tainting a specific hardware
429+
instance, CEL selectors can be used in a DeviceTaintRule to match a vendor-specific
430+
unique ID attribute, if the driver supports one for its hardware.
431+
432+
The taint applies as long as the DeviceTaintRule exists. It can be modified and
433+
and removed at any time. Here is one example of a DeviceTaintRule for a fictional
434+
DRA driver:
435+
436+
```yaml
437+
apiVersion: resource.k8s.io/v1alpha3
438+
kind: DeviceTaintRule
439+
metadata:
440+
name: example
441+
spec:
442+
# The entire hardware installation for this
443+
# particular driver is broken.
444+
# Evict all pods and don't schedule new ones.
445+
deviceSelector:
446+
driver: dra.example.com
447+
taint:
448+
key: dra.example.com/unhealthy
449+
value: Broken
450+
effect: NoExecute
451+
```
452+
360453
## Enabling dynamic resource allocation
361454

362455
Dynamic resource allocation is a *beta feature* which is off by default and only enabled when the
@@ -426,6 +519,13 @@ and only enabled when the `DRAPartitionableDevices`
426519
[feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
427520
is enabled in the kube-apiserver and kube-scheduler.
428521

522+
### Enabling device taints and tolerations
523+
524+
[Device taints and tolerations](#device-taints-and-tolerations) is an *alpha feature* and only enabled when the
525+
`DRADeviceTaints` [feature gate](/docs/reference/command-line-tools-reference/feature-gates/)
526+
is enabled in the kube-apiserver, kube-controller-manager and kube-scheduler. To use DeviceTaintRules, the
527+
`resource.k8s.io/v1alpha3` API version must be enabled.
528+
429529
## {{% heading "whatsnext" %}}
430530

431531
- For more information on the design, see the

content/en/docs/concepts/scheduling-eviction/taint-and-toleration.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -322,9 +322,18 @@ tolerations to all daemons, to prevent DaemonSets from breaking.
322322
Adding these tolerations ensures backward compatibility. You can also add
323323
arbitrary tolerations to DaemonSets.
324324

325+
## Device taints and tolerations
326+
327+
Instead of tainting entire nodes, administrators can also [taint individual devices](/docs/concepts/scheduling-eviction/dynamic-resource-allocation#device-taints-and-tolerations)
328+
when the cluster uses [dynamic resource allocation](/docs/concepts/scheduling-eviction/dynamic-resource-allocation)
329+
to manage special hardware. The advantage is that tainting can be targeted towards exactly the hardware that
330+
is faulty or needs maintenance. Tolerations are also supported and can be specified when requesting
331+
devices. Like taints they apply to all pods which share the same allocated device.
332+
325333
## {{% heading "whatsnext" %}}
326334

327335
* Read about [Node-pressure Eviction](/docs/concepts/scheduling-eviction/node-pressure-eviction/)
328336
and how you can configure it
329337
* Read about [Pod Priority](/docs/concepts/scheduling-eviction/pod-priority-preemption/)
338+
* Read about [device taints and tolerations](/docs/concepts/scheduling-eviction/dynamic-resource-allocation#device-taints-and-tolerations)
330339

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
title: DRADeviceTaints
3+
content_type: feature_gate
4+
_build:
5+
list: never
6+
render: false
7+
8+
stages:
9+
- stage: alpha
10+
defaultValue: false
11+
fromVersion: "1.33"
12+
---
13+
Enables support for
14+
[tainting devices and selectively tolerating those taints](/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations)
15+
when using dynamic resource allocation to manage devices.
16+
17+
This feature gate has no effect unless you also enable the `DynamicResourceAllocation` feature gate.

0 commit comments

Comments
 (0)