BIPI

Container Escape Techniques and Defenses That Hold

Cloud Security

Container escapes are not exotic. Privileged flags, mounted Docker sockets, and CAP_SYS_ADMIN show up in real workloads. Here is what we exploit and what to put between attackers and the host.

By Arjun Raghavan, Security & Systems Lead, BIPI · February 1, 2025 · 8 min read

#container#escape#linux-security#pentest

Container escape is rarely about a kernel zero-day. It is almost always about a misconfigured runtime: a flag set for convenience, a mount made for debugging, a capability granted because a tutorial said to. Once we have RCE inside a container, we look for these shortcuts before reaching for kernel exploits.

How attackers find this

First we read /proc/self/status for capabilities, /proc/1/cgroup for cgroup membership, and /proc/self/mounts for mount points. kdigger automates this with a checklist that flags privileged mode, exposed sockets, mountable devices, and capability sets in a single run. Then we look outward: is there an /var/run/docker.sock mounted? A kubelet at 10250? Host paths mounted read-write?

Privileged flag (--privileged): equivalent to root on the host; mount /dev devices and pivot.
CAP_SYS_ADMIN without --privileged: still enough for cgroup release_agent escape on older kernels.
Docker socket mounted into the container: docker run --privileged from inside the container is trivial.
Kernel exploits: Dirty Pipe (CVE-2022-0847), OverlayFS bugs, more recently the CVE-2024-1086 nftables flaw.
Cgroup v1 release_agent: write a release_agent path and trigger a notification to execute on the host.
Mounted /proc or host /: read host secrets, write to host crontab.

Methodology in practice

We try the cheapest things first because escape footprints differ. Privileged plus a Docker socket means we can spawn a sibling container mounting host root and pivot in seconds. Capability-only escapes are noisier; they involve writing files to /sys/fs/cgroup. Kernel exploits are last because they are loud and fragile.

Detection

Falco is the standard for this. The default ruleset catches container escape signatures: writes to /proc/sysrq-trigger, modification of release_agent, suspicious mounts, container shell with capabilities elevated. eBPF-based agents (Tetragon, Tracee) give richer telemetry on syscall sequences. CloudTrail and EKS audit logs catch the higher-altitude effects (a node suddenly registering many new pods).

Remediation

Run rootless: containerd, Podman, and Docker all support rootless modes that remove the host-root dependency.
Drop all capabilities by default and add back only what the workload requires; never grant CAP_SYS_ADMIN to application containers.
Apply seccomp profiles; the Docker default profile blocks dozens of syscalls used in escapes.
Apply AppArmor or SELinux MAC profiles; even the default Docker AppArmor profile prevents writes to many sensitive paths.
Never mount the Docker socket into a workload container; if a container needs to build images, use Buildkit rootless or kaniko.
Patch host kernels promptly; subscribe to distribution security feeds and have a defined SLA for kernel CVEs.
In Kubernetes, enforce Pod Security Admission 'restricted' on application namespaces.

The defensive principle is layered defense in depth: the runtime should not trust the container, the host should not trust the runtime, and the network should not trust the host. Any one layer breaking should not be enough to compromise the next.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.