We ran self-hosted Kafka for two years before moving to a managed service. Here’s what would have saved us the most pain.
Partition count is a one-way door
Kafka lets you increase partitions on a topic, but never decrease them without recreating it — and increasing them after the fact breaks key-based ordering for any consumer relying on it. We should have over-provisioned partitions from day one instead of “right-sizing” early.
Consumer lag is the metric that matters
CPU and memory on the brokers looked fine right up until checkout started timing out. Consumer lag was the metric that actually predicted the incident, and we didn’t have it on a dashboard yet.
Rebalances can cascade
A single slow consumer triggering a rebalance storm took down checkout for six minutes — every consumer in the group paused processing during the rebalance, and our timeouts were tuned tight enough that a slow rebalance looked like a full outage downstream. Static membership fixed it.