The Fletcher Project: Hot Maintaining Communication Systems

Thursday, 7 August 2008

Hot Maintaining Communication Systems

A while ago Joe Armstrong posted a few simple ideas for hot maintenance on his blog (and I heart hot maintenance). It starts from a position of a fair few assumptions, such as your clustering service being OK with nodes arbitrarily joining and leaving, and any central data or state storage being OK with heterogeneous nodes connecting to it. But hey - good design principles anyway, so for arguments sake let's just call those assumptions validated...

Someone posted a very relevant question on the comments - what about upgrades that require changes to the protocol between nodes? This is a vital issue to address, because if you can't come up with a good solution for hot maintaining your communications glue, you have a limited amount of overall hot maintenance you can practically achieve.

The answers to this question are as varied as there are design patterns, but using the same set of assumptions we read into Joe's original post, here are 3 techniques that should be fairly portable:

Dual-headed. If you're changing transports you can introduce nodes that communicate using both protocols. You duplicate traffic (if you do it fairly unscientifically) but it's viable for a rolling upgrade.

Versioned interfaces. If you're changing the message format, version the messages. This will let you gradually move nodes onto the new build and give you some added benefits like A/B testing, faster future upgrades and rollbacks.

Translation gateway. So far we've assumed a group of nodes in a single location. For a distributed system with collections of nodes in distinct places, a gateway that speaks protocol 'old' on one side and protocol 'new' on the other might work best, letting you upgrade cluster-by-cluster without taking down the whole system.

To get the right answer for any given system the first thing to do is study the communication pattern - how much is management vs. service, what is node to node vs. node to DB, and how much crosses geographical locations? This gives you your architectural options, beyond which you just need to keep your state tracking heterogeneous, your data partitionable, and (to make your own life easier) your inconsistency window as long as the business rules allow.