We had forty-plus cron jobs feeding a nightly batch monolith. Migrating to an event-driven pipeline without a big-bang rewrite took most of a quarter.

Strangling the monolith

Rather than rewrite everything at once, we let new event types flow through SQS while old cron jobs kept running untouched. Each job got migrated only when its downstream consumer was ready, so the two systems ran side by side for weeks at a time.

Dead-letter handling from day one

Every queue got a dead-letter queue and an alert on it from the start — retrofitting DLQs after the first stuck message is much more painful than building them in up front.

The payoff

Nightly batch latency dropped from hours to minutes for the jobs that moved, and the ones that stayed on cron kept working exactly as before, with zero coordination required between the two.