Our Gojek mobile app talks to a single BFF, and that BFF fans out to half a dozen downstream services — Driver-Location, Pricing, Promotions, Payments, you name it. When even one of those services sneezed, the whole ride-booking flow caught a cold. Loaders spun, customers retried, and on-call phones buzzed.
The fix that made us wonder why we didn't do it sooner
Most failures were momentary — GC pauses, brief network drops. A simple retry with jittered back-off, spaced out with a bit of random delay, succeeded 80-90% of the time. For the persistent failures, we added a circuit breaker per dependency. After N consecutive failures, we opened the breaker, stopped hammering the sick service, and served a graceful fallback instead. Healthy services kept answering. The whole app no longer froze because one spoke in the wheel locked up.
What stuck with me
Retries turn flakes into non-events. Two or three attempts rescue most transient blips. Circuit breakers protect the herd — fail fast, fall back, let downstreams recover. But the real lesson was about user trust. A single spinner feels like an eternity to someone trying to book a ride. Silent resilience feels like magic.
A handful of resilience patterns — less code than the promo banner widget — gave millions of riders a smoother experience and gave on-call engineers their weekends back.