Retry & Circuit Breakers: Keeping the BFF Breathing

April 7, 2021

Our Gojek mobile app talks to a single BFF, and that BFF fans out to half a dozen downstream services — Driver-Location, Pricing, Promotions, Payments, you name it. When even one of those services sneezed, the whole ride-booking flow caught a cold. Loaders spun, customers retried, and on-call phones buzzed.

The fix that made us wonder why we didn't do it sooner

Most failures were momentary — GC pauses, brief network drops. A simple retry with jittered back-off, spaced out with a bit of random delay, succeeded 80-90% of the time. For the persistent failures, we added a circuit breaker per dependency. After N consecutive failures, we opened the breaker, stopped hammering the sick service, and served a graceful fallback instead. Healthy services kept answering. The whole app no longer froze because one spoke in the wheel locked up.

What stuck with me

Retries turn flakes into non-events. Two or three attempts rescue most transient blips. Circuit breakers protect the herd — fail fast, fall back, let downstreams recover. But the real lesson was about user trust. A single spinner feels like an eternity to someone trying to book a ride. Silent resilience feels like magic.

A handful of resilience patterns — less code than the promo banner widget — gave millions of riders a smoother experience and gave on-call engineers their weekends back.