So I had a little time on my hands a couple weeks ago and decided to explore how to use eBPF to monitor container OOM kills.
Why?
Out Of Memory (OOM) kills are an early sign of saturation in Linux systems. It happens when a machine runs out of available memory space and starts killing processes until it can breathe again.
In the context of containers, with things such as Kubernetes or Docker, OOMs happen when a container's memory usage exceeds its memory limits. The offending container processes get selected and OOM killed. In essence, they receive a SIGKILL
signal and unexpectedly terminate any ongoing task.
OOM kills might lead to unexpected service degradation and prompt a mild "How did that happen?" (or "wtf just happened" depending on the situation) from whoever is investigating the crime scene. Trick being that all evidence is mostly lost to the void as the offending process is gone. So yea, back to the roots...
How do OOM kills happen in containers?
Container resource limits are enforced through cgroups. Neither Kubernetes nor Docker have any control whatsoever over which process gets killed. It is the responsibility of the underlying Linux host kernel to enforce its resource boundaries.
Journal kernel logs displaying a memory cgroup 'out of memory' message:
Excessive container memory usage, hence OOM kills, can be attributed to either:
- Under-provisioned container memory
- Faulty container application behavior
- A bit of both 1. and 2.
Picture how an unoptimized process that needs to use a lot of memory for inbound requests might be prone to OOM kills. Now imagine what happens in peak traffic conditions if one container goes down and the same load gets redistributed to the remaining containers while it restarts? Highly likely to cause cascading failures.
If the process chosen by the kernel is the container's main process, then the main process will exit which will stop the container and that would typically trigger a container restart. It takes some time to restart a container, time which is not spent running its workload.
But the OOM killer could instead select a child process within the container's main process, then the container itself will outlive the OOM kill and new child processes might eventually respawn. One could argue that is even worse because the symptom is much less obvious.
Unless you already know why OOMs happen and are fine with it, e.g. voluntarily relying on OOMs to restart containers; in most cases though you'd really want to be able to assert what is happening in your system at the time of the OOM kills.
ℹ️ cAdvisor metrics can be used to track containers memory usage and OOM kills to some extent. In this article we merely focus on exposing Linux kernel's internal memory management stats.
That's where eBPF comes in.
eBPF enters the room
The extended Berkeley Packet Filter (eBPF) allows you to run sandboxed probe programs without changing the Linux kernel source code or loading kernel modules. Think of it as a way to hook to kernel function and execute custom code. This opens a plethora of possibilities, amongst which troubleshooting OOM kills.
There are 2 main eBPF toolkits you can use: BCC and bpftrace. Both of them provide an 'oomkill' probe example, respectively tools/oomkill.py for BCC and tools/oomkill.bt for bpftrace. As a side note both were written by Brendan Gregg.
I found bpftrace syntax more concise and easier to understand so I tried running that locally, here is an example output:
❯ sudo /usr/sbin/oomkill.bt
Attaching 2 probes...
Tracing oom_kill_process()... Hit Ctrl-C to end.
19:44:13 Triggered by PID 1516777 ("python3"), OOM kill of PID 1516777 ("python3"), 32768 pages, loadavg: 0.72 0.64 0.71 3/2230 1517322
The above shows the process PID, total pages count and CPU loadavg, from the host perspective. It doesn't give much container context, let's see if we can change that.
Here is the original bpftrace oomkill probe code:
#ifndef BPFTRACE_HAVE_BTF
#include <linux/oom.h>
#endif
BEGIN
{
printf("Tracing oom_kill_process()... Hit Ctrl-C to end.\n");
}
kprobe:oom_kill_process
{
$oc = (struct oom_control *)arg0;
time("%H:%M:%S ");
printf("Triggered by PID %d (\"%s\"), ", pid, comm);
printf("OOM kill of PID %d (\"%s\"), %d pages, loadavg: ",
$oc->chosen->pid, $oc->chosen->comm, $oc->totalpages);
cat("/proc/loadavg");
}
Let's make sense of it, starting with this line:
kprobe:oom_kill_process
It tells that we define a kernel probe (kprobe) that will hook onto the oom_kill_process()
function. Makes sense.
Now let's look at the following line:
$oc = (struct oom_control *)arg0;
It tells that we receive the function's first argument arg0
and type cast it into an oom_control
struct. But wait, how do we know that the oom_kill_process()
function receives an oom_control
struct as first argument? Well, let's take a look at the Linux kernel source code.
Short dive into the Linux kernel
Go to the torvalds/linux git repo and search for the oom_kill_process()
function declaration, like this.
From mm/oom_kill.c you can see:
static void oom_kill_process(struct oom_control *oc, const char *message)
Not only have we confirmed where the oom_control
argument comes from but now we know that oom_kill_process()
receives a second string argument message
.
Next, what about what's inside an oom_control
struct? Might there be anything else useful in there? Let's have a look.
You know the drill, search for oom_control
in the source code, like this.
From include/linux/oom.h you can see:
/*
* Details of the page allocation that triggered the oom killer that are used to
* determine what should be killed.
*/
struct oom_control {
/* Used to determine cpuset */
struct zonelist *zonelist;
/* Used to determine mempolicy */
nodemask_t *nodemask;
/* Memory cgroup in which oom is invoked, or NULL for global oom */
struct mem_cgroup *memcg;
/* Used to determine cpuset and node locality requirement */
const gfp_t gfp_mask;
/*
* order == -1 means the oom kill is required by sysrq, otherwise only
* for display purposes.
*/
const int order;
/* Used by oom implementation, do not set */
unsigned long totalpages;
struct task_struct *chosen;
long chosen_points;
/* Used to print the constraint info. */
enum oom_constraint constraint;
};
There's some interesting stuff in there. First of all, we now understand where the $oc->totalpages
in the original oomkill probe's code comes from. And we can see a few other promising properties such as memcg
, chosen
and chosen_points
. Let's unpack what these are in more detail.
memcg
is a mem_cgroup object that holds the container's cgroup memory usage and limits.
chosen
is a task_struct object that holds the chosen process properties amongst which you can find mm
and nsproxy
that respectively hold memory management and container's namespaces information.
chosen_points
comes from the oom_badness() score calculation that is used by the OOM killer to select which process to kill.
The formula for the OOM badness score calculation ends up looking something like this:
chosen_points =
mm_filepages +
mm_anonpages +
mm_shmempages +
mm_swapents +
mm_pgtables_bytes / PAGE_SIZE +
oom_score_adj * totalpages / 1000
ℹ️
PAGE_SIZE
is a constant which may vary by system but is generally equivalent to 4096 bytes.
With the process resident set size (RSS) = mm_filepages
+ mm_anonpages
+ mm_shmempages
+ mm_swapents
.
In essence, it's summing all of the process RSS memory footprint, counted in pages, plus some page tables space reserved by the kernel and an adjustable factor that can be set by the user per process. The main take away is that the process with highest RSS, hence a higher number of oom badness points, is the most likely to get OOM killed. And from the $oc->chosen->active_mm->rss_stat
object you can access the various components of that process RSS.
As the above are the main variables which influence the OOM killer's process selection let's see if we can include them in the output of our probe.
A revised oomkill probe, for containers
Let's mkdir container-oomkill-probe && cd container-oomkill-probe
.
Here is the revised bpftrace script, let's name it container_oomkill.bt
:
#!/usr/bin/env bpftrace
/*
* container_oomkill Trace OOM killer in containers.
* For Linux, uses bpftrace and eBPF.
*
* This traces the kernel out-of-memory killer by using kernel dynamic tracing of oom_kill_process().
* Prints the process host pid, container id, cgroup path, command and a few other stats.
* Note: There's no guarantee that the OOM killed process is within a "container", this script just assumes it is.
*
* Example of usage:
*
* # ./container_oomkill.bt
* Tracing oom_kill_process()... Ctrl-C to end.
*
* Adapted from the original bpftrace's tools/oomkill.bt by Brendan Gregg:
* -> http://github.com/bpftrace/bpftrace/blob/master/tools/oomkill.bt
*/
#ifndef BPFTRACE_HAVE_BTF
#include <linux/oom.h>
#endif
BEGIN
{
printf("Tracing oom_kill_process()... Hit Ctrl-C to end.\n");
}
// fn: static void oom_kill_process(struct oom_control *oc, const char *message)
// http://github.com/torvalds/linux/blob/master/mm/oom_kill.c#L1017
kprobe:oom_kill_process
{
$oc = (struct oom_control *)arg0;
$message = str(arg1);
// print datetime with milliseconds precision
printf("%s", strftime("%Y-%m-%d %H:%M:%S", nsecs));
printf(",%03d", (nsecs % 1000000000) / 1000000);
// print labels
printf(" probe=\"%s\"\n",
probe);
printf(" message=\"%s\"\n",
$message);
printf(" host_pid=\"%d\" container_id=\"%s\" command=\"%s\"\n",
$oc->chosen->pid,
$oc->chosen->nsproxy->uts_ns->name.nodename,
$oc->chosen->comm);
// oom_control stats
printf(" oc_totalpages=\"%d\" oc_chosen_points=\"%d\"\n",
$oc->totalpages, // = mem + swap
$oc->chosen_points); // = filepages + anonpages + swapents + shmempages + pgtables_bytes / PAGE_SIZE + oom_score_adj * totalpages / 1000
// cgroup stats
printf(" memcg_memory_usage_pages=\"%d\" memcg_memory_max_pages=\"%d\" memcg_memory_low_pages=\"%d\"\n",
$oc->memcg->memory.usage.counter, // memory usage in pages
$oc->memcg->memory.max, // memory hard limit
$oc->memcg->memory.low); // memory request
printf(" memcg_swap_current_pages=\"%d\" memcg_swap_max_pages=\"%d\" memcg_swappiness=\"%d\"\n",
$oc->memcg->swap.usage.counter, // swap usage in pages
$oc->memcg->swap.max, // swap hard limit
$oc->memcg->swappiness);
// stats used in OOM badness calculation
printf(" mm_rss_filepages=\"%d\" mm_rss_anonpages=\"%d\" mm_rss_swapents=\"%d\" mm_rss_shmempages=\"%d\"\n",
$oc->chosen->mm->rss_stat[0].count,
$oc->chosen->mm->rss_stat[1].count,
$oc->chosen->mm->rss_stat[2].count,
$oc->chosen->mm->rss_stat[3].count);
// in case you get hit by
// "ERROR: The array index operator [] can only be used on arrays and pointers, found record..."
// "ERROR: Can not access field 'count' on expression of type 'none'..."
// prior to linux 6.2 $oc->chosen->mm->rss_stat is a mm_rss_stat struct
// http://github.com/torvalds/linux/commit/f1a7941243c102a44e8847e3b94ff4ff3ec56f25#diff-dc57f7b72015cf5f95444ec4f8a60f85d773f40b96ac59bf55b281cd63c06142
// you can use the version below instead
//printf(" mm_rss_filepages=\"%d\" mm_rss_anonpages=\"%d\" mm_rss_swapents=\"%d\" mm_rss_shmempages=\"%d\"\n",
// $oc->chosen->mm->rss_stat.count[0].counter,
// $oc->chosen->mm->rss_stat.count[1].counter,
// $oc->chosen->mm->rss_stat.count[2].counter,
// $oc->chosen->mm->rss_stat.count[3].counter);
printf(" mm_pgtables_bytes=\"%d\"\n",
$oc->chosen->mm->pgtables_bytes.counter);
printf(" proc_oom_score_adj=\"%d\"\n",
$oc->chosen->signal->oom_score_adj); // score adj used in oom_badness calculation
// minor and major page faults
printf(" proc_min_flt=\"%d\" proc_maj_flt=\"%d\"\n",
$oc->chosen->min_flt, // minor page faults
$oc->chosen->maj_flt); // major page faults
// calculated stats
printf(" uptime_ms=\"%lld\"\n",
(nsecs - $oc->chosen->start_time) / 1000000);
}
A few noteworthy bits:
-
$oc->chosen->nsproxy->uts_ns->name.nodename
gets the container id -
$oc->memcg->memory
contains the cgroup memory usage and limits,$oc->memcg->memory.max
is what you would call in Kubernetes the "resource limit" while$oc->memcg->memory.low
would be the "resource request" -
$oc->memcg->swap
contains the cgroup swap usage and limits, which might be relevant as we saw thatswapents
is one of the factors that influences OOM killer's selection -
$oc->chosen->mm->rss_stat
contains the process RSS stats list, ordered as defined in the enum NR_MM_COUNTERS, respectivelyMM_FILEPAGES
,MM_ANONPAGES
,MM_SWAPENTS
,MM_SHMEMPAGES
-
$oc->chosen->signal->oom_score_adj
and$oc->chosen->mm->pgtables_bytes.counter
are part of the variables we saw being used in the OOM badness calculation so we can display that as well
Now let's write a Dockerfile to build an image with bpftrace installed and the updated container_oomkill.bt
so that the probe runs from within a container:
FROM debian:stable-slim
RUN apt-get update && \
apt-get install -y \
linux-headers-generic \
bpftrace \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY *.bt /app/
CMD ["bpftrace", "container_oomkill.bt"]
Build command:
docker build -t container-oomkill-probe .
Run command:
docker run --privileged --pid=host -v /sys:/sys:ro container-oomkill-probe
You should see somethings like this:
❯ docker run --privileged --pid=host -v /sys:/sys:ro container-oomkill-probe
Attaching 2 probes...
Tracing oom_kill_process()... Hit Ctrl-C to end.
Looks fine and all, but how do we trigger the probe?
Test or it didn't happen
Here is a simple python script that hogs memory by 4MiB increments:
#!/usr/bin/env python3
import logging
import time
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s: %(message)s")
i = 0
memory = []
while True:
i += 1
logging.info(f"Allocating {i * 4} MiB...")
memory.append(bytearray(4 * 1024 * 1024))
time.sleep(0.1)
Save it in a new stress-mem/
folder as stress-mem/main.py
.
For testing purposes you can run the script in a 128MiB memory constrained Docker container with the container oomkill probe alongside it.
docker-compose.yml
file:
services:
probe:
build:
context: .
dockerfile: Dockerfile
volumes:
- /sys:/sys:ro # required to access /sys/fs/cgroup/docker/CONTAINER_ID
privileged: true # required to run bpftrace, as an alternative could try setting CAP_BPF, CAP_PERFMON
pid: host # required to trace processes from host
stress-mem:
image: python:3-slim
volumes:
- ./stress-mem:/app
working_dir: /app
command: python3 main.py
restart: on-failure
deploy:
replicas: 1
resources:
limits:
memory: 128MiB
reservations:
memory: 64MiB
Run command:
docker compose up --build
But wait, how come if we defined a 128MiB limit our stress-mem
container is able to allocate up to 252MiB?
Surely our probe is completely wrong and we've wasted our valuable time on something useless. Unless...
Notice the mm_rss_swapents="32549"
? Well, that is swap. And swap isn't disabled by default in Docker. Memory gets "swapped to disk" when memory pressure forces the kernel to do so. You can confirm that the system is struggling seeing the proc_maj_flt
count. Major page faults happen when the system reads from disk instead of RAM.
Do we want swap? No we don't. Let's disable that. In our docker-compose.yml
we can disable swap by setting memswap_limit equivalent to the memory limit, which is pretty confusing indeed.
stress-mem:
# ...
memswap_limit: 128MiB # disable swap by setting memswap_limit = mem_limit
Rerun the docker compose up --build
command and you should see:
Well that's much better. No major page faults, no swap, the process got OOM killed at 124MiB which is the last increment before it could reach 128MiB. Perfect. Our probe works yay!
Wrapping up
Taking a step back, here is what we did with our new container oomkill probe:
- Show container contextualized information from the process memory cgroup and namespace
- Expose kernel's internal memory management stats providing additional insights on the reasons a process might have been OOM killed
It's not much, but you are now fully equipped with the tools and knowledge to find out more, build your own eBPF probes, understand how the Linux kernel works.
References and further reading
- Linux Extended BPF (eBPF) Tracing Tools by Brendan Gregg
- Memory Management in Linux - Concepts overview (kernel docs)
- Chapter 3 - Memory Management (LDP)
- cgroups v2 (kernel docs)
- man namespaces
TL;DR:
glhf 😉
Top comments (4)
Cool stuff. Thanks for sharing.
Will try it out myself for sure
I probably can figure out myself but how do you calculate the 252 or 124 MiB?
Given
i
the iteration counter and4 * 1024 * 1024
the size of our byte array increments we have:As 1 MiB = 1024 KiB = 1024 * 1024 bytes, then:
Hence the code:
4 KiB btw is the size of a page in most systems, hence why I originally chose 4 * 1024 but the OOMs weren't happening fast enough so i went for 4 MiB increments x)
Awesome thanks
I was actually wondering about how did you figure out from the bpftrace script prints that the OOM happened at 124MIB but I played a bit with it and then asked copilot and it gave me the answer. Again related to the famous 4KiB "usual" page size
Btw I tried to replicate your python script but with rust I had a hard time to achieve memory allocation due to some optimizations. I am also new to rust
Also as you might be aware of, it is possible to run the bpftrace script without a container. But running it in a container might be more portable
Ah gotcha! What you want to look at mostly are the
mm_rss_*
stats which give you total resident set size.mm_rss_anonpages
in particular typically gives you the stack + heap space allocated to a running program.You can indeed run bpftrace directly, if you are on Linux x)
Running it from a container might also be convenient to run it ad-hoc on a kubernetes cluster when conducting an investigation for example, instead of having to exec into the node.