BIPI

Service Mesh mTLS: East-West Zero Trust That Actually Works

Cloud Security

Service meshes get sold as the foundation of zero trust, but the value is narrower than the pitch. mTLS plus identity-based authz is real. Most other mesh features are operational debt.

By Arjun Raghavan, Security & Systems Lead, BIPI · February 13, 2024 · 7 min read

#service-mesh#mtls#zero-trust#istio

Zero trust east-west traffic is the most defensible reason to deploy a service mesh. Without mesh, services in a Kubernetes cluster talk over plaintext HTTP and authorize based on IP addresses or shared secrets baked into config maps. With mesh, every service-to-service call is mutually authenticated TLS with workload-level identity, and authorization can reference 'service A may call service B method C' rather than IP ranges.

That is real value. Most of the other features sold with service meshes (traffic shifting, retries, circuit breaking) can be done elsewhere and are not worth the operational cost of running Istio.

mTLS via SPIFFE identity

Both Istio and Linkerd implement SPIFFE-style workload identity. Each pod gets a short-lived certificate (rotated every 24 hours in default Istio, every 24 hours in Linkerd) tied to its Kubernetes service account. Certificates are issued by the mesh control plane's CA, which itself can chain to an external CA (AWS Private CA, HashiCorp Vault) for organizations that need that.

Pod-to-pod traffic is automatically encrypted and authenticated. No code changes. The sidecar (or in Istio Ambient mode, the per-node ztunnel) handles it transparently. Applications continue to call http://service-name:8080 and the mesh transparently upgrades to mTLS.

Authorization policies based on identity

The second leg of east-west zero trust is allow-listing which services can call which. In Istio, an AuthorizationPolicy looks like:

Allow GET /api/v1/orders from spiffe://cluster.local/ns/storefront/sa/web. Deny everything else.

Authorization is enforced in the sidecar, not the application. An attacker who compromises one pod cannot pivot to call other services that the compromised workload does not have explicit authorization for. This is dramatically stronger than network-policy-based authorization, which trusts that the source IP equates to the source identity.

Certificate lifecycle matters

Mesh CAs are sensitive. If the CA is compromised, the attacker can issue valid mesh certificates for any service. We recommend:

External CA integration (AWS Private CA or Vault PKI) so the mesh root key does not live in the cluster.
Short-lived workload certificates (default 24h is fine, do not extend).
Separate trust domains per cluster, with cross-cluster federation via SPIFFE federation rather than shared CA.
Audit and alert on mesh CA configuration changes.

Performance overhead

Istio sidecar mode adds 1-3ms of latency per hop and consumes 50-200MB of memory per sidecar. For a service mesh with 200 pods, that is 10-40GB of memory just for sidecars. Linkerd's Rust-based proxy is meaningfully lighter (10-30MB per proxy). Istio Ambient mode (still beta in early 2024) removes the per-pod sidecar in favor of per-node ztunnels and per-namespace waypoint proxies, cutting overhead substantially.

For latency-sensitive workloads (sub-10ms p99 SLOs), the overhead matters. For most application workloads, it is invisible.

When mesh is overkill

Service mesh is overkill if:

Your cluster has under 20 services and they rarely change.
All inter-service authorization is already handled at the application layer (e.g., OAuth tokens with scopes).
You do not have operators dedicated to mesh operations. Istio in particular is not a fire-and-forget tool.
Your workloads are not in Kubernetes (mesh value drops outside Kubernetes orchestration).

For small clusters, network policies plus application-layer auth (mTLS terminated at ingress, JWT validation in services) gets you most of the security benefit without the operational cost.

Rollout pattern that works

We never enable strict mTLS on day one. The pattern:

Install mesh in permissive mode. Both mTLS and plaintext are accepted.
Inject sidecars per-namespace, validating traffic still flows.
Monitor metrics for plaintext connections. Track them down service by service.
Switch namespace-by-namespace to strict mTLS once plaintext drops to zero.
Add authorization policies incrementally, starting with allow-all in audit mode, then deny-all defaults.

Cluster-wide strict mTLS on day one breaks every legacy workload that talks HTTP from outside the mesh, and the resulting incident sets the mesh project back six months.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.