Sunday 7 March 2010

More Meaningful SLAs

Establishing internal service levels with the rest of the business is a difficult process - there are so many variables that can be measured and, as we all know, you change what you measure by measuring it. For example, if you express your SLA exclusively in terms of system uptime, then you improve all the activities around keeping your system available. The flipside of this is that you often discourage the activities around effecting change in the system - after all, any releases or upgrades or new features always carry some risk to availability and that's what they're measured on...

The place to start is to work out what's important to the organisation. Performance and availability are critical to us (a latency sensitive transactional platform with variable usage patterns) but so is change (a content driven web application correlated with events in the real world). We decided that performance, availability, change, and support response were the key metrics for us - nothing unique so far, and next we had to make an interpretation of each of these that was relevant to our various systems.

A basic principle here is recognising that it isn't just the raw numbers that should be appropriate to each individual product, but what is being measured too. Throwing an overall value at the problem (for example 99% availability across the board) makes the job of putting together your SLA easier, but is it a true reflection of your infrastructure? Whenever I've seen this coarse-grained approach used it has always led to less than acceptable uptime for the most critical applications and wasted investment propping up others that are realistically less important.

Another way to make sure your SLAs really closely matches business need is to introduce the dimension of time. In many systems and many organisations demand - and the cost of downtime - varies over time. For example, how many accounts and payroll systems are used around the clock? If you can trade off to 'best endeavors' over weekends and evenings then you shouldn't have too much trouble meeting a five nines commitment during business hours between Monday and Friday.

For our website we have a flat availability target (such is the nature of a 24x7 site) and performance we interpreted in a latency metric for price publishing and order placement. For reporting systems - which do not experience the same round-the-clock demands - we have different availability targets during business and after hours. Performance in the context of those systems is interpreted as a certain set of daily reports delivered by a fixed time each morning and a message delivery SLA on alerts on certain events. SLA's around change and product delivery are much more complicated and fraught with subjective measures. We've gone with measuring development projects iteration-by-iteration; what got delivered vs. what was committed during that sprint's planning. It's objective and encourages good estimation and strict control of scope creep during a sprint.

Making SLAs commensurate with what the business genuinely demands from a given piece of technology is important. Setting your sights high can seem like a good idea on the surface but, when you consider the frightening magnitude of difference in cost between 99.5% and 99.9% uptime, that couple of points can only ever be described as waste if they are not intimately linked to the organisations success.

No comments: