Thursday 2 October 2008

Scheduled Reboots and Natures Way

One of the basic aspects of a biological computing mindset is the appreciation that nothing lasts forever. Everything degrades, corrupts, and dies over time - and that is perfectly normal, because it's duly replaced by a fresh-faced youngster, eager to service the rest of the organism [system] from a nice, fresh cellular structure [empty memory space].

This applies to systems in exactly the same way as it does to organisms. How many issues can you recall where memory leaks, counter errors, and freaky edge conditions all occurred after servers have been running exactly X long, or when a service has processed more than Y connections. I'm sure we could swap tales of woe late into the evening.

This being the case, why do we feel this rottweiler-like dedication to keeping individual devices going for the longest possible duration? I think there is 2 sources; a kind of point scoring pride effect engendered by the output of the "uptime" command, and good old fashioned poor system design. Perhaps one even leads to the other...

So - we design systems poorly. If we want a product to be available, why do we build it in a way that it's availability depends upon a piece of tin that we accept is inherently unreliable? So now the application is the server. This means the only way we can increase its availability is by increasing the availability of the underlying hardware. Not only is this expensive, it's doomed to failure because, as we accepted, servers grow old. So we spend a lot of time and money trying to achieve something we already decided that we cannot. No wonder we're so excited when that uptime counter rolls over to a nice big number!

Do you know what would be better? Accepting that product availability - the uptime of the whole system overall - is what we're really reaching for, and besides, it's how our customers will measure us. Next we need to apply this philosophy to how we design systems, let go of our attachment to keeping individual servers on life support, and put together services that don't rely on any one node, network, or storage device in order to serve our customers.

If you can master that arcane art, then you'll be able to arbitrarily recycle resources, anytime, when there is absolutely nothing whatsoever wrong at all - because this helps keep it that way.

Oh and you'll never be that guy with the box thats been going so long he's scared to reboot it just in case it doesn't come back!

No comments: