Saturday 13 December 2008

Failure Modes

Whenever we're building a product, we've got to keep in mind what might go wrong, rather than just catering mindlessly to the functional spec. That's because specifications are largely written for a parallel universe where everything goes as planned and nothing ever breaks; while we must write software that runs right here in this universe, with all it's unpredictability, unintended consequences, and poorly behaved users.

Oh and as I've said in the past, if you have a business owner that actually specs for failure modes, kiss them passionately now and never let them go. But for the rest of us, maybe it would help us keep failure modes in the forefront of our minds as we worked if we came up with some simplified categories to keep track of. How about internal, external, and human?

I'm not going to say too much about internal failure modes, because they are both the most commonly considered types and they have the most existing solutions out there.

You could sum up internal failures by imagining your code operating autonomously in a closed environment. What might go wrong? You are essentially catering for quality here, and we have all sorts of test environments and unit tests to combat defects we might accidentally introduce through our own artifacts.

The key difference between external and internal failure modes is precisely what I said above - you are imagining that your code is operating in perfect isolation. If you are reading this, then I sincerely hope you rolled your eyes at that thought.

Let's assume that integration is part of internal, and we only start talking external forces when our product is out there running online. What might go wrong?

Occasionally I meet teams that are pretty good detecting and reacting to external failures and it pleases me greatly. Let's consider some examples; what if an external price list that your system refers to goes down? How about if a service intended to validate addresses becomes a black hole? What if you lose your entire internet connection?

Those examples are all about blackouts - total and obvious removal of service - so things are conspicuous by their absence. For bonus points, how are you at spotting brownouts? That's when things are 'up' but still broken in a very critical way, and the results can sometimes cost you far more than a blackout, as they can go undetected for a while...

Easy example - you subscribe to a feed for up-to-the-minute foreign exchange rates. For performance reasons, you probably store the most recent values for each currency you use in a cache or database, and read it from there per transaction. What happens if you stop receiving the feed? You could keep transacting for a very long time before you notice, and you will have either disadvantaged yourself or your customers by using out of date rates - neither of which is desirable.

Perhaps the feed didn't even stop. Perhaps the schema changed, in which case you'd still see a regular drop of data if you were monitoring the consuming interface, but you'd have unusable data - or worse - be inserting the wrong values against each currency.

Human failure modes are the least catered for in our profession, regardless of the fact they're just as inevitable and just as expensive. You could argue that 'human' is just another type of external failure, but I consider it fundamentally different due to one simple word - "oops".

To err is human and all that junk. We do stuff like set parameters incorrectly, turn off the wrong server, pull out the wrong disk, plug in the wrong cable, ignore system requirements etc - all with the best of intentions.

So what would happen if, say, a live application server is misconfigured to use a development database and then you unknowingly unleash real users upon it? You could spend a very long time troubleshooting it, or worse still it might actually work - and thinking about brownouts - how long will it be before you noticed? For users who'd attached to that node, where will all their changes be, and how will you merge that back into the 'real' live data?

Humans can also accidentally not do things which have consequences for our system too. Consider our feed example - perhaps we just forgot to renew the subscription, and so we're getting stale or no data even though the system has done everything it was designed to do. Hang on, who was in charge of updating those SSL certificates?

Perhaps we don't think about maintenance mistakes up front because whenever we build something, we always picture ourselves performing the operational tasks. And to us, the steps are obvious and we're performing them, in a simplified world in our heads, without any other distractions competing for our attention. Again - not real life.

And so...
All of these things can be monitored, tested for, and caught. In our forex example, you might check the age of the data every time you read the exchange rate value in preparation for a transaction, and fail it if it exceeds a certain threshold (or just watch the age in a separate process).

In our live server with test data example, you might mandate that systems and data sources state what mode they're in (test, live, demo, etc) in their connection string - better yet generate an alert if there is a state mismatch in the stack (or segment your network so communication is not possible).

The question isn't are there solutions; the question is how far is far enough?

As long as you think about failure modes in whatever way works for you, and make a pragmatic judgement on each risk using likelihood and impact to determine how many monitors and fail-safes it's worth building in, then you'll have done your job significantly better than the vast majority of engineers out there - and your customers will thank you for it with their business.

No comments: