You’re right. Panicking can be a lazy way to do error handling. However, that’s not what I'm advocating for here. Panicing under specific unrecoverable situations is OK.
To be clear panicking does not skip the deferred code — that is why we use a
defer to perform a
recover(). As mentioned in my reply to Dele, you must have a recover at a the top level request (or item of work) so that failures do not have any effect on other requests, jobs, etc. And certainly not the service health itself.
I understand what you’re saying about how there’s always a always a way to handle an error. In theory, yes. In practicality, usually no. Going back to the core of my argument, if you have a way to recover, great, do that. However, by doing so you have just made it an expected error. So panicking is no longer acceptable for these situations.
To me, things like exponential backoff, API retries, etc inside the work itself make for more brittle, unpredictable and hard to test production code. Let’s use the example of the database being flakey. Let’s say 1 in 10 requests time out on the database because it’s under too much load. A retry might sound reasonable, but this presents a few new problems (like whack a mole):
- If the database is under too much load, retrying doesn’t alleviate that, it might just retry indefinitely and fail every time. So what do you do after 3 failures, 10 failures? Eventually you will have to determine a point where the work cannot be done. This puts you back in the same spot.
- Let’s say you use exponential back off to throttle database load. The problem hasn’t simply gone away, it’s moved. For example, you’re service might receive a huge spike in parallel jobs, as they’re all waiting to retry. Eventually your service will snowball and die for other reasons.
- Clients may not want (or have the ability) to wait forever. If the request cannot be done now, fail, repond quickly and move on.
- If a SQL statement fails, retrying may not even be possible. If that particular part of the request is running in a transaction, the database will immediatly fails all new SQL statements until the end of the transaction anyway.
It’s absolutely OK to use these techniques (retries) to make your software more resilient. However, that code needs to sit at a tier above performing the work. Within the request/work itself, you must fail fast and move on.
On a final note; logs, metrics, call stacks, etc will endure after a panic because they will be sent to somewhere outside of the scope of the work.
I hope that provides you with some more context on the idea I was trying to convey.