A brutal outage once took us 90 minutes to diagnose. We were flying blind — no clear dashboards, alerts that cried wolf, and a team scrambling through logs with no shared playbook.
So we instrumented everything, added Grafana dashboards that actually pointed to root causes, and pruned the noisy alerts that had trained everyone to ignore them.
Months later a similar bug surfaced. This time we fixed it in under 10 minutes. Same team, same systems — just better observability. Turns out chaos becomes a checklist when you can actually see what's happening.