A deploy rolled out clean, every pod reported ready, and we still dropped requests for two minutes. The readiness probe was lying to us.

The probe checked the wrong thing

Our probe only hit /healthz, which returned 200 as soon as the HTTP server bound to its port — before the process had finished warming its connection pool to the database. Ready to accept traffic and ready to serve it turned out to be two different things.

The fix

We split the check into /healthz (process alive) and /readyz (dependencies warm, pool primed), and pointed the readiness probe at the latter. Liveness and readiness answering the same question was the actual bug.

The lesson

“The pod is ready” should mean “this pod won’t make things worse,” not “the process started.” Worth auditing every readiness probe in a cluster against that bar.