Arthur Picerna

Posted on Jun 16 • Edited on Jun 17

Troubleshoot Container OOM Kills with eBPF

#ebpf #linux #docker #sre

So I had a little time on my hands a couple weeks ago and decided to explore how to use eBPF to monitor container OOM kills.

Why?

Out Of Memory (OOM) kills are an early sign of saturation in Linux systems. It happens when a machine runs out of available memory space and starts killing processes until it can breathe again.

In the context of containers, with things such as Kubernetes or Docker, OOMs happen when a container's memory usage exceeds its memory limits. The offending container processes get selected and OOM killed. In essence, they receive a SIGKILL signal and unexpectedly terminate any ongoing task.

OOM kills might lead to unexpected service degradation and prompt a mild "How did that happen?" (or "wtf just happened" depending on the situation) from whoever is investigating the crime scene. Trick being that all evidence is mostly lost to the void as the offending process is gone. So yea, back to the roots...

How do OOM kills happen in containers?

Container resource limits are enforced through cgroups. Neither Kubernetes nor Docker have any control whatsoever over which process gets killed. It is the responsibility of the underlying Linux host kernel to enforce its resource boundaries.

Journal kernel logs displaying a memory cgroup 'out of memory' message:

Excessive container memory usage, hence OOM kills, can be attributed to either:

Under-provisioned container memory
Faulty container application behavior
A bit of both 1. and 2.

Picture how an unoptimized process that needs to use a lot of memory for inbound requests might be prone to OOM kills. Now imagine what happens in peak traffic conditions if one container goes down and the same load gets redistributed to the remaining containers while it restarts? Highly likely to cause cascading failures.

If the process chosen by the kernel is the container's main process, then the main process will exit which will stop the container and that would typically trigger a container restart. It takes some time to restart a container, time which is not spent running its workload.
But the OOM killer could instead select a child process within the container's main process, then the container itself will outlive the OOM kill and new child processes might eventually respawn. One could argue that is even worse because the symptom is much less obvious.

Unless you already know why OOMs happen and are fine with it, e.g. voluntarily relying on OOMs to restart containers; in most cases though you'd really want to be able to assert what is happening in your system at the time of the OOM kills.

ℹ️ cAdvisor metrics can be used to track containers memory usage and OOM kills to some extent. In this article we merely focus on exposing Linux kernel's internal memory management stats.

That's where eBPF comes in.

eBPF enters the room

The extended Berkeley Packet Filter (eBPF) allows you to run sandboxed probe programs without changing the Linux kernel source code or loading kernel modules. Think of it as a way to hook to kernel function and execute custom code. This opens a plethora of possibilities, amongst which troubleshooting OOM kills.

There are 2 main eBPF toolkits you can use: BCC and bpftrace. Both of them provide an 'oomkill' probe example, respectively tools/oomkill.py for BCC and tools/oomkill.bt for bpftrace. As a side note both were written by Brendan Gregg.

I found bpftrace syntax more concise and easier to understand so I tried running that locally, here is an example output:

❯ sudo /usr/sbin/oomkill.bt
Attaching 2 probes...
Tracing oom_kill_process()... Hit Ctrl-C to end.
19:44:13 Triggered by PID 1516777 ("python3"), OOM kill of PID 1516777 ("python3"), 32768 pages, loadavg: 0.72 0.64 0.71 3/2230 1517322

The above shows the process PID, total pages count and CPU loadavg, from the host perspective. It doesn't give much container context, let's see if we can change that.

Here is the original bpftrace oomkill probe code:

#ifndef BPFTRACE_HAVE_BTF
#include <linux/oom.h>
#endif

BEGIN
{
    printf("Tracing oom_kill_process()... Hit Ctrl-C to end.\n");
}

kprobe:oom_kill_process
{
    $oc = (struct oom_control *)arg0;
    time("%H:%M:%S ");
    printf("Triggered by PID %d (\"%s\"), ", pid, comm);
    printf("OOM kill of PID %d (\"%s\"), %d pages, loadavg: ",
        $oc->chosen->pid, $oc->chosen->comm, $oc->totalpages);
    cat("/proc/loadavg");
}

Let's make sense of it, starting with this line:

kprobe:oom_kill_process

It tells that we define a kernel probe (kprobe) that will hook onto the oom_kill_process() function. Makes sense.

Now let's look at the following line:

$oc = (struct oom_control *)arg0;

It tells that we receive the function's first argument arg0 and type cast it into an oom_control struct. But wait, how do we know that the oom_kill_process() function receives an oom_control struct as first argument? Well, let's take a look at the Linux kernel source code.

Short dive into the Linux kernel

Go to the torvalds/linux git repo and search for the oom_kill_process() function declaration, like this.

From mm/oom_kill.c you can see:

static void oom_kill_process(struct oom_control *oc, const char *message)

Not only have we confirmed where the oom_control argument comes from but now we know that oom_kill_process() receives a second string argument message.

Next, what about what's inside an oom_control struct? Might there be anything else useful in there? Let's have a look.

You know the drill, search for oom_control in the source code, like this.

From include/linux/oom.h you can see:

/*
 * Details of the page allocation that triggered the oom killer that are used to
 * determine what should be killed.
 */
struct oom_control {
    /* Used to determine cpuset */
    struct zonelist *zonelist;

    /* Used to determine mempolicy */
    nodemask_t *nodemask;

    /* Memory cgroup in which oom is invoked, or NULL for global oom */
    struct mem_cgroup *memcg;

    /* Used to determine cpuset and node locality requirement */
    const gfp_t gfp_mask;

    /*
     * order == -1 means the oom kill is required by sysrq, otherwise only
     * for display purposes.
     */
    const int order;

    /* Used by oom implementation, do not set */
    unsigned long totalpages;
    struct task_struct *chosen;
    long chosen_points;

    /* Used to print the constraint info. */
    enum oom_constraint constraint;
};

There's some interesting stuff in there. First of all, we now understand where the $oc->totalpages in the original oomkill probe's code comes from. And we can see a few other promising properties such as memcg, chosen and chosen_points. Let's unpack what these are in more detail.

memcg is a mem_cgroup object that holds the container's cgroup memory usage and limits.

chosen is a task_struct object that holds the chosen process properties amongst which you can find mm and nsproxy that respectively hold memory management and container's namespaces information.

chosen_points comes from the oom_badness() score calculation that is used by the OOM killer to select which process to kill.

The formula for the OOM badness score calculation ends up looking something like this:

chosen_points = 
    mm_filepages + 
    mm_anonpages + 
    mm_shmempages + 
    mm_swapents + 
    mm_pgtables_bytes / PAGE_SIZE + 
    oom_score_adj * totalpages / 1000

ℹ️ PAGE_SIZE is a constant which may vary by system but is generally equivalent to 4096 bytes.

With the process resident set size (RSS) = mm_filepages + mm_anonpages + mm_shmempages + mm_swapents.

In essence, it's summing all of the process RSS memory footprint, counted in pages, plus some page tables space reserved by the kernel and an adjustable factor that can be set by the user per process. The main take away is that the process with highest RSS, hence a higher number of oom badness points, is the most likely to get OOM killed. And from the $oc->chosen->active_mm->rss_stat object you can access the various components of that process RSS.

As the above are the main variables which influence the OOM killer's process selection let's see if we can include them in the output of our probe.

A revised oomkill probe, for containers

Let's mkdir container-oomkill-probe && cd container-oomkill-probe.

Here is the revised bpftrace script, let's name it container_oomkill.bt:

#!/usr/bin/env bpftrace
/*
 * container_oomkill    Trace OOM killer in containers.
 *        For Linux, uses bpftrace and eBPF.
 *
 * This traces the kernel out-of-memory killer by using kernel dynamic tracing of oom_kill_process().
 * Prints the process host pid, container id, cgroup path, command and a few other stats.
 * Note: There's no guarantee that the OOM killed process is within a "container", this script just assumes it is.
 *
 * Example of usage:
 *
 * # ./container_oomkill.bt
 * Tracing oom_kill_process()... Ctrl-C to end.
 *
 * Adapted from the original bpftrace's tools/oomkill.bt by Brendan Gregg:
 * -> http://github.com/bpftrace/bpftrace/blob/master/tools/oomkill.bt
 */

#ifndef BPFTRACE_HAVE_BTF
#include <linux/oom.h>
#endif

BEGIN
{
    printf("Tracing oom_kill_process()... Hit Ctrl-C to end.\n");
}

// fn: static void oom_kill_process(struct oom_control *oc, const char *message)
// http://github.com/torvalds/linux/blob/master/mm/oom_kill.c#L1017
kprobe:oom_kill_process
{
    $oc = (struct oom_control *)arg0;
    $message = str(arg1);

    // print datetime with milliseconds precision
    printf("%s", strftime("%Y-%m-%d %H:%M:%S", nsecs));
    printf(",%03d", (nsecs % 1000000000) / 1000000);

    // print labels
    printf(" probe=\"%s\"\n",
        probe);
    printf("  message=\"%s\"\n",
        $message);
    printf("  host_pid=\"%d\" container_id=\"%s\" command=\"%s\"\n",
        $oc->chosen->pid,
        $oc->chosen->nsproxy->uts_ns->name.nodename,
        $oc->chosen->comm);

    // oom_control stats
    printf("  oc_totalpages=\"%d\" oc_chosen_points=\"%d\"\n",
        $oc->totalpages,        // = mem + swap
        $oc->chosen_points);    // = filepages + anonpages + swapents + shmempages + pgtables_bytes / PAGE_SIZE + oom_score_adj * totalpages / 1000

    // cgroup stats
    printf("  memcg_memory_usage_pages=\"%d\" memcg_memory_max_pages=\"%d\" memcg_memory_low_pages=\"%d\"\n",
        $oc->memcg->memory.usage.counter,   // memory usage in pages
        $oc->memcg->memory.max,             // memory hard limit
        $oc->memcg->memory.low);            // memory request
    printf("  memcg_swap_current_pages=\"%d\" memcg_swap_max_pages=\"%d\" memcg_swappiness=\"%d\"\n",
        $oc->memcg->swap.usage.counter,     // swap usage in pages
        $oc->memcg->swap.max,               // swap hard limit
        $oc->memcg->swappiness);

    // stats used in OOM badness calculation
    printf("  mm_rss_filepages=\"%d\" mm_rss_anonpages=\"%d\" mm_rss_swapents=\"%d\" mm_rss_shmempages=\"%d\"\n",
        $oc->chosen->mm->rss_stat[0].count,
        $oc->chosen->mm->rss_stat[1].count,
        $oc->chosen->mm->rss_stat[2].count,
        $oc->chosen->mm->rss_stat[3].count);
    // in case you get hit by
    // "ERROR: The array index operator [] can only be used on arrays and pointers, found record..."
    // "ERROR: Can not access field 'count' on expression of type 'none'..."
    // prior to linux 6.2 $oc->chosen->mm->rss_stat is a mm_rss_stat struct
    // http://github.com/torvalds/linux/commit/f1a7941243c102a44e8847e3b94ff4ff3ec56f25#diff-dc57f7b72015cf5f95444ec4f8a60f85d773f40b96ac59bf55b281cd63c06142
    // you can use the version below instead
    //printf("  mm_rss_filepages=\"%d\" mm_rss_anonpages=\"%d\" mm_rss_swapents=\"%d\" mm_rss_shmempages=\"%d\"\n",
    //    $oc->chosen->mm->rss_stat.count[0].counter,
    //    $oc->chosen->mm->rss_stat.count[1].counter,
    //    $oc->chosen->mm->rss_stat.count[2].counter,
    //    $oc->chosen->mm->rss_stat.count[3].counter);
    printf("  mm_pgtables_bytes=\"%d\"\n",
        $oc->chosen->mm->pgtables_bytes.counter);
    printf("  proc_oom_score_adj=\"%d\"\n",
        $oc->chosen->signal->oom_score_adj);    // score adj used in oom_badness calculation

    // minor and major page faults
    printf("  proc_min_flt=\"%d\" proc_maj_flt=\"%d\"\n",
        $oc->chosen->min_flt,                   // minor page faults
        $oc->chosen->maj_flt);                  // major page faults

    // calculated stats
    printf("  uptime_ms=\"%lld\"\n",
        (nsecs - $oc->chosen->start_time) / 1000000);
}

A few noteworthy bits:

$oc->chosen->nsproxy->uts_ns->name.nodename gets the container id
$oc->memcg->memory contains the cgroup memory usage and limits, $oc->memcg->memory.max is what you would call in Kubernetes the "resource limit" while $oc->memcg->memory.low would be the "resource request"
$oc->memcg->swap contains the cgroup swap usage and limits, which might be relevant as we saw that swapents is one of the factors that influences OOM killer's selection
$oc->chosen->mm->rss_stat contains the process RSS stats list, ordered as defined in the enum NR_MM_COUNTERS, respectively MM_FILEPAGES, MM_ANONPAGES, MM_SWAPENTS, MM_SHMEMPAGES
$oc->chosen->signal->oom_score_adj and $oc->chosen->mm->pgtables_bytes.counter are part of the variables we saw being used in the OOM badness calculation so we can display that as well

Now let's write a Dockerfile to build an image with bpftrace installed and the updated container_oomkill.bt so that the probe runs from within a container:

FROM debian:stable-slim

RUN apt-get update && \
    apt-get install -y \
    linux-headers-generic \
    bpftrace \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY *.bt /app/

CMD ["bpftrace", "container_oomkill.bt"]

Build command:

docker build -t container-oomkill-probe .

Run command:

docker run --privileged --pid=host -v /sys:/sys:ro container-oomkill-probe

You should see somethings like this:

❯ docker run --privileged --pid=host -v /sys:/sys:ro container-oomkill-probe
Attaching 2 probes...
Tracing oom_kill_process()... Hit Ctrl-C to end.

Looks fine and all, but how do we trigger the probe?

Test or it didn't happen

Here is a simple python script that hogs memory by 4MiB increments:

#!/usr/bin/env python3

import logging
import time

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s: %(message)s")

i = 0
memory = []
while True:
    i += 1
    logging.info(f"Allocating {i * 4} MiB...")
    memory.append(bytearray(4 * 1024 * 1024))
    time.sleep(0.1)

Save it in a new stress-mem/ folder as stress-mem/main.py.

For testing purposes you can run the script in a 128MiB memory constrained Docker container with the container oomkill probe alongside it.

docker-compose.yml file:

services:

  probe:
    build:
      context: .
      dockerfile: Dockerfile
    volumes:
      - /sys:/sys:ro  # required to access /sys/fs/cgroup/docker/CONTAINER_ID
    privileged: true  # required to run bpftrace, as an alternative could try setting CAP_BPF, CAP_PERFMON
    pid: host         # required to trace processes from host

  stress-mem:
    image: python:3-slim
    volumes:
      - ./stress-mem:/app
    working_dir: /app
    command: python3 main.py
    restart: on-failure
    deploy:
      replicas: 1
      resources:
        limits:
          memory: 128MiB
        reservations:
          memory: 64MiB

Run command:

docker compose up --build

Sample output:

But wait, how come if we defined a 128MiB limit our stress-mem container is able to allocate up to 252MiB?

Surely our probe is completely wrong and we've wasted our valuable time on something useless. Unless...

Notice the mm_rss_swapents="32549"? Well, that is swap. And swap isn't disabled by default in Docker. Memory gets "swapped to disk" when memory pressure forces the kernel to do so. You can confirm that the system is struggling seeing the proc_maj_flt count. Major page faults happen when the system reads from disk instead of RAM.

Do we want swap? No we don't. Let's disable that. In our docker-compose.yml we can disable swap by setting memswap_limit equivalent to the memory limit, which is pretty confusing indeed.

  stress-mem:
    # ...
    memswap_limit: 128MiB   # disable swap by setting memswap_limit = mem_limit

Rerun the docker compose up --build command and you should see:

Well that's much better. No major page faults, no swap, the process got OOM killed at 124MiB which is the last increment before it could reach 128MiB. Perfect. Our probe works yay!

Wrapping up

Taking a step back, here is what we did with our new container oomkill probe:

Show container contextualized information from the process memory cgroup and namespace
Expose kernel's internal memory management stats providing additional insights on the reasons a process might have been OOM killed

It's not much, but you are now fully equipped with the tools and knowledge to find out more, build your own eBPF probes, understand how the Linux kernel works.

References and further reading

TL;DR:

Here's the code.

glhf 😉

Top comments (4)

David Hernandez • Jun 17

Cool stuff. Thanks for sharing.
Will try it out myself for sure

I probably can figure out myself but how do you calculate the 252 or 124 MiB?

Arthur Picerna • Jun 17 • Edited

Given i the iteration counter and 4 * 1024 * 1024 the size of our byte array increments we have:

Bytes allocated = i * 4 * 1024 * 1024

As 1 MiB = 1024 KiB = 1024 * 1024 bytes, then:

MiB allocated = i * 4 MiB

Hence the code:

i = 0
memory = []
while True:
    i += 1
    logging.info(f"Allocating {i * 4} MiB...")
    # ...

4 KiB btw is the size of a page in most systems, hence why I originally chose 4 * 1024 but the OOMs weren't happening fast enough so i went for 4 MiB increments x)

David Hernandez • Jun 19

Awesome thanks

I was actually wondering about how did you figure out from the bpftrace script prints that the OOM happened at 124MIB but I played a bit with it and then asked copilot and it gave me the answer. Again related to the famous 4KiB "usual" page size

Btw I tried to replicate your python script but with rust I had a hard time to achieve memory allocation due to some optimizations. I am also new to rust

Also as you might be aware of, it is possible to run the bpftrace script without a container. But running it in a container might be more portable

Arthur Picerna • Jun 20

Ah gotcha! What you want to look at mostly are the mm_rss_* stats which give you total resident set size. mm_rss_anonpages in particular typically gives you the stack + heap space allocated to a running program.

You can indeed run bpftrace directly, if you are on Linux x)
Running it from a container might also be convenient to run it ad-hoc on a kubernetes cluster when conducting an investigation for example, instead of having to exec into the node.