A deadlock that never once reproduced in staging showed up in production the day we doubled write throughput. This is the postmortem.

The setup

Two code paths updated the same accounts row in different orders depending on which request arrived first: a payout job locked accounts then ledger_entries, while a refund job locked them in the opposite order. Classic lock-order inversion — invisible until enough concurrent traffic made the interleaving likely.

Why staging never caught it

Staging traffic was serial enough that the two code paths never truly overlapped. Production concurrency was the missing ingredient, which is a hard thing to fake in a load test without modeling real request timing.

The fix

We enforced a single, table-name-alphabetical lock order across every code path that touches more than one of these tables, and added a Postgres deadlock counter to our dashboards so the next one pages someone instead of getting rediscovered a week later.