It started as a routine schema migration: add a column, backfill, carry on. Instead, we triggered a row-exclusive lock that never released, and Postgres quietly escalated it to a table-level lock. For three full hours, every read and write to the rides table queued up behind that lock — effectively freezing our ride-hailing platform.
What went wrong
We hadn't set lock_timeout or statement_timeout, so the offending transaction waited forever, blocking everyone else. On top of that, the lock halted autovacuum. Dead tuples piled up, but without monitoring on pg_stat_user_tables, we stayed blind until connections maxed out. And to make things worse, the migration script opened its lock during the lunch rush. With thousands of active rides, the backlog ballooned in minutes.
What we changed
We added safe defaults — lock_timeout = '10s' and statement_timeout = '60s' on every service connection. We set up alerts on autovacuum stalls and dead-tuple thresholds via Grafana. We adopted online migration patterns, rolling out schema changes in phases: ADD NULLABLE, then batched backfill, then SET NOT NULL. And we introduced a migration window — all DDL changes happen during the lowest-traffic hour, with a traffic guard that aborts if concurrency spikes.
The cultural lesson
Instead of pointing fingers, we gathered everyone involved for a blameless post-mortem. We asked "what in our process allowed this to happen?" rather than "who caused it?" That mindset kept the team open and honest, surfaced hidden assumptions, and turned a painful outage into a shared learning experience. From then on, "fix the system, not the person" became a standing principle in our incident reviews.
Three hours of silence from the database felt like an eternity. But the incident burned these habits into muscle memory — time-box your locks, monitor what matters, treat migrations like production releases, and learn without blame.