Database Connection Pooling: The Math, the Modes, and the Serverless Problem
Digital Engineering
Connection pools fail in interesting ways. PgBouncer transaction mode versus session mode, the right pool size math, and why serverless changed the rules. Real production tuning, not theory.
By Arjun Raghavan, Security & Systems Lead, BIPI · July 16, 2024 · 8 min read
We have spent more time tuning PgBouncer in the last two years than in the previous five. Not because PgBouncer got harder, but because the workload mix changed. Serverless functions, edge runtimes, and containerized workloads with high churn all stress connection pooling in ways the original design did not anticipate. Here is the production reality.
The math: how big should the pool be
There is a real formula and there is a useful formula. The real one is from PostgreSQL's own documentation: connections = ((core_count * 2) + effective_spindle_count). For an 8-core RDS instance with SSD, that is 16 to 20 connections of useful concurrency at the database. Beyond that, additional connections fight for CPU and lock contention overwhelms throughput.
The useful formula: start at 2x your CPU count for the database tier, measure, adjust. We routinely see clients with pool sizes of 200 against an 8-core database, wondering why p99 latency spikes when traffic doubles. The pool is fine. The database is at 100 percent CPU because 200 concurrent queries are fighting for 8 cores.
PgBouncer modes: transaction beats session for most workloads
Three modes:
- Session mode: client gets a server connection for the life of their connection. Most permissive, least efficient
- Transaction mode: client gets a server connection for the life of a transaction. Multiplexes well, breaks features that span transactions
- Statement mode: per-statement multiplexing. Maximum density, breaks more features
Transaction mode is the right default for web applications. The features it breaks (session-level temp tables, prepared statements that span transactions, advisory locks held across transactions) are things most applications do not actually need. The features it enables (10x or more multiplexing of clients to server connections) are nearly always worth the tradeoff.
The gotcha: prepared statements. Most ORMs use prepared statements by default. In PgBouncer transaction mode, those prepared statements live on the server connection, which is shared across clients. Either disable prepared statements at the ORM (statement_cache_mode = 'none' for asyncpg, 'prepareThreshold = 0' for some JDBC drivers) or use PgBouncer 1.21+ which has prepared statement support in transaction mode.
The connection storm pattern
Production incident pattern we see twice a year: an upstream cache fails, traffic shifts to the database, every application instance opens its full pool against PgBouncer simultaneously, PgBouncer in turn tries to open every server connection at once, and the database CPU pegs at 100 percent answering authentication requests instead of queries. The site is down for 10 minutes while everything settles.
The mitigations:
- Configure server_login_retry and server_connect_timeout aggressively at PgBouncer
- Set max_db_connections at PgBouncer to less than the database's actual capacity (account for replication, maintenance, monitoring tools)
- Use connection ramp-up at the application layer (do not open the full pool at startup)
- Add a circuit breaker on the cache layer that does not let a cache failure send 100 percent of traffic to the database
Serverless changed everything
Lambda, Cloud Run, Fargate spin up containers on demand. Each container wants its own pool. With 200 concurrent functions, you have 200 mini-pools, each opening 5 connections, each tearing down at scale-in, each fighting for the 16-connection useful concurrency at your database. The math does not work.
The patterns that work for serverless:
- A persistent PgBouncer (or RDS Proxy, or Supabase Pooler) layer that handles the pooling. Functions hold one or two connections, not pools
- Connection-per-function-invocation models with aggressive connection lifetimes (close after one query is wasteful but sometimes correct)
- Data API patterns (RDS Data API, Neon serverless driver) that bypass connection pooling entirely with HTTP-based query execution
- Edge runtimes typically must use HTTP-based access. Workers cannot maintain persistent TCP connections to your database in any sensible way
What to actually monitor
The dashboard for connection pool health:
- pool_mode and pool_size at PgBouncer (cl_active, cl_waiting, sv_active, sv_idle)
- Average wait time at the pool layer (pgbouncer's avg_xact_time and avg_query_time)
- Database connection count, capped against max_connections
- Long-running queries (over 30s) at the database (pg_stat_activity)
- Authentication failures (a leading indicator of misconfigured app instances)
If cl_waiting is consistently above zero, your pool is undersized or your database is too slow. If sv_idle is consistently the bulk of your pool, your pool is oversized. If average wait time spikes during traffic surges, you have a connection storm waiting to happen.
Connection pooling is not glamorous infrastructure. It is also where 30 percent of the production incidents we investigate end up rooted. The math is simple, the modes are well-documented, and the serverless story is tractable. Most teams just have not invested the hour required to set it up correctly for their actual workload.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.