BIPI

Kubernetes Incident Response: kubectl Forensics, Falco Findings, and Container Snapshots

Cybersecurity

A practitioner Kubernetes IR playbook covering kubectl logs and events, audit log analysis, container snapshotting with CRIU, Falco runtime findings, network policy isolation, and the evidence package before destruction.

By Arjun Raghavan, Security & Systems Lead, BIPI · June 15, 2024 · 9 min read

#kubernetes#container#ir

Kubernetes incidents start with a runtime detection or an anomalous audit log entry and end with a question about whether the cluster is salvageable. The runbook below assumes Falco or a similar runtime tool is deployed, audit logging is on with at least metadata level, and you have a forensics namespace ready.

1. The opening triage

When a pod is suspect, the temptation is to kubectl delete it. Resist. A deleted pod is a deleted crime scene. Start with two commands that gather context without changing state.

kubectl describe pod <pod> -n <ns> && kubectl logs <pod> -n <ns> --all-containers --previous --timestamps > pod-logs-IR.txt

Then pull the audit log entries for the pod's service account over the last 24 hours. If audit logs route to a log aggregator, query there; otherwise the API server's structured log file is your source.

2. Audit log scoping

Kubernetes audit logs are structured JSON with verb, objectRef, user, sourceIPs, and responseStatus. The verbs that matter during IR are create, update, patch, delete, and exec. The exec verb captures kubectl exec sessions; if your alleged attacker shelled into a pod, this is where you confirm it.

jq 'select(.verb=="create" and .objectRef.resource=="pods/exec")' /var/log/kubernetes/audit.log

For service account compromise, filter by user.username equals system:serviceaccount:<ns>:<name> and look at what API calls it made beyond its normal pattern. A service account that suddenly lists secrets across namespaces is the textbook lateral movement signal.

3. Falco and runtime evidence

Falco's default ruleset catches the high-value events: shell in container, sensitive file open (etc/shadow, kubeconfig), unexpected network connections, write to root filesystem, and crypto miner indicators. During an active incident, the Falco JSON output is your timeline of what the container actually did.

Stream Falco events for the suspect pod and look for the first event that breaks the container's normal behavior pattern. That is your patient-zero moment.

4. Isolate without deleting

Containment is two steps. First, network-isolate the pod with a NetworkPolicy that denies all ingress and egress except your forensics access path. Second, prevent further scheduling on the affected node by cordoning.

kubectl label pod <pod> -n <ns> quarantine=true && kubectl apply -f netpol-quarantine.yaml && kubectl cordon <node>

The NetworkPolicy selector matches quarantine=true. The pod is now alive, accessible to you, and cannot reach the rest of the cluster or the internet. Falco continues to record anything the attacker tries to do.

5. Container snapshotting with CRIU

For evidence preservation, snapshot the running container's memory and disk state. CRIU (Checkpoint/Restore in Userspace) can checkpoint a running container if the container runtime supports it; containerd has experimental support, podman has stable support.

Crictl checkpoint <containerID> -o /forensics/<pod>-checkpoint.tar dumps process state, open files, and memory.
Copy the container's writable layer: ctr -n k8s.io snapshots view <snapshot> /forensics/<pod>-layer/
Capture environment and command-line via kubectl get pod <pod> -o yaml > <pod>-spec.yaml

If CRIU is not available, the minimum forensic capture is the writable layer, the pod manifest, the runtime container logs, the Falco events, and the audit log entries for the pod's service account.

6. Secret rotation and recovery

Every secret mounted into the compromised pod is treated as exposed. Service account tokens, image pull secrets, mounted configmaps with credentials, environment variables sourced from secrets, and any external secret manager bindings. Rotate before re-deploying.

List the pod's mounts and env: kubectl get pod <pod> -o jsonpath='{.spec.containers[*].env[*].valueFrom}{.spec.volumes[*]}'
For each secret, rotate the underlying credential at the source (DB password, cloud key, API token).
Update the Kubernetes Secret, then restart consumers via rollout.
Audit RoleBindings and ClusterRoleBindings referencing the compromised service account.
If the attacker reached the API server, assume etcd is compromised; rotate all certificates and re-issue join tokens.

7. The evidence package before destruction

When the investigation reaches the point where the pod, the node, or the cluster needs to be rebuilt, the evidence package is what you keep. Our standard package: pod manifest, container layer tarball, container logs, CRIU checkpoint if available, Falco event JSONL, audit log slice, network policy YAML at time of incident, and a written timeline correlating the above. Stored in object storage with retention lock and a SHA-256 manifest.

~2 GB/day per 100 nodes

Kubernetes audit log size at metadata level

5 hrs

Median K8s IR with Falco deployed

1-3 days

Without runtime tooling

Kubernetes IR is one of the harder cloud-native runbooks because the abstractions move fast. The constants are audit logs, runtime detection, and evidence discipline. Build those into the cluster before the first incident, not in the middle of one.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.