Wednesday, 2 July 2008

Teach Your Computer to Say No

As good engineers we always try to build systems that have enough capacity to service all our customers requests, all the time, anytime.  It's an admirable goal and very commercially sensible, but there is no such thing as unlimited capacity and sooner or later we'll get our scalability wrong, suffer partial failure or maybe it won't be cost effective to scale up to those rare usage peaks - what will the user experience be then?

By default it will be things like; page cannot be displayed, connection timed out or our staple favorites 404 and 500 - basically the middle finger of the internet.  You have to think about how your systems will behave under unexpected load and what you'd want your customers to see when it happens.  There are a few things you can keep in mind when building your products that, if properly addressed, will lead to a very different user experience when things get tough:

Customise your error messages

Those unexpected peaks or partial failures are always going to happen sooner or later, so why not serve up a nicely formatted (but lightweight) page with alternative links or customer service contact details rather than a nasty, cryptic error message?  It is so easy and cheap to do and makes such a huge difference to user perception that there really is no reason not to.

Queue or throttle excess traffic

What's better than a nicer error message?  Some kind of service - even if it doesn't match our ideals.  Queuing is a nice solution, it protects your in-flight users just like denial would and provides a better experience for those new arrivals who would otherwise push you over the edge and get an error.

A holding page, an alternative URL or maybe a nice countdown with an automatic refresh is all you need.  These things can be a little more time consuming to build because your application needs to have some awareness of it's environment if you expect it to make decisions about what to serve up based on remaining capacity.  There are some very easy ways to buy this back in hardware if you're using load balancers like F5, Netscaler, or Redline - a little experience will tell you how many users/connections/Mbps each node can tolerate, and you can configure an alternative page or redirect for anything above that threshold.  Depending on what you have installed this can even be served directly from the device cache, making it even lower impact for your over-busy system.  Queuing is just like it's real-life namesake; "line up here and wait for access to the system".

Throttling might suit certain applications better - particularly API's with a heavy non-human user population, as client applications might not understand queuing messages, holding pages or redirects returned from the oversubscribed feature but they'll be less sensitive to a general slowdown.  Throttling, as I use it here, is about reducing the maximum share of total system capacity any individual can consume in the hope that this will allow more total individuals to use the system concurrently.  The theory is that once we hit a certain threshold, if we flatten out everyone's usage, the big consumers will be held back a little (but still served) and the rest of us can still have a turn too.  Like most ideas here, this can be implemented in a variety of places in your stack; queries/sec or concurrent data sets searchable in the database, concurrent logons in the middle tier, TPS in the application, GETs/POSTs on the front end or even number of concurrent connections per unique client IP address at the network edge.

Where is the right place to impose planned limits?  I'd suggest multipoint coverage, but to start with think about your most critical constraint (what dies first as usage climbs?) and then do it at the layer above that one!

Fail gracefully

This is one of the toughest things to do with a distributed system; your features need to be aware of their environment both locally and on remote instances and they need a way to bring up/take down/recycle themselves and each other.  If your system suffers a partial failure, or load is climbing towards the point marked "instability", then you need to make a decision.  Do you bring up or repurpose more nodes to handle the same feature?  Redirect a percentage of requests to another cluster/DC or trigger any of the other techniques we talked about above?  Great if you can, but if you can't, it might be time to use those [refactored to be friendly] error messages.

Turning away users by conscious decision may rub us up the wrong way at first glance, but depending on how your system behaves under excessive load it might actually be for the best.  If you are at maximum capacity and serving all currently connected users, would new user connections result in degradation of service to those already in the system?  Maybe saying no to additional connections might be frustrating for those users being turned away, but you have to weigh that against kicking off users midway through transactions, or worse still, crashing the entire system!

Graceful failure is essentially the art of predicting the immediate future of your system and handling what are likely to be excess users/transactions/connections in a premeditated fashion.  First you need to know what your thresholds are; then you need very fine resolution instrumentation to tell you how close you are to them on a second-by-second basis, and finally, you need an automated way to respond to impending trouble - because humans are too slow to prevent poor customer service in busy systems; we're better at cleaning up once the trouble has been and gone.

For example let's say you're aware of a memory utilization threshold or a maximum number of users/web node - wouldn't it be better to actively restart a service or show a holding page before you lost control of the stack?  If you haven't managed to build in some kind of queuing or throttling then you might be showing an error or denying service but is that better than losing the whole system.

The last part of graceful failure somewhat overlaps with recovery-oriented computing; make sure that, upon death, the last thing you ever do is take a snapshot of what you were doing and what the environment was like.  If your processes do this, then you are able to have watchdog processes (or monitoring systems) that know whether or not it's safe to restart that failed instance (or take an alternative action based on the data), you'll have an easier time diagnosing faults, and the data generated will help you keep a rolling benchmark of the thresholds in a system with a high rate of change.

Wednesday, 25 June 2008

Of Multicultural Offices

Have a look at this interview with Horacio Falcao from INSEAD on cross-cultural negotiation.

I found the points about overestimating and underestimating proximity in relationships particularly interesting - Horacio says that we often make too many assumptions when dealing with people from similar backgrounds and nationalities and this can end up costing us.  When dealing with those we perceive to be obviously different we take extra care to ensure we explicitly state everything up front, a valuable practice we can take foregranted when we strongly relate with the other party from the start.

Saturday, 21 June 2008

The Worlds Biggest Marketing Deadline

I am never a fan of deadline-oriented architecture and sometimes you get a whopper; this is most certainly my biggest one to date.


Well, it was get our Euro 2008 features out in time or mow the worlds largest lawn. Marketing eh?

Wednesday, 18 June 2008

CAP

A couple of months ago I wrote a little about the architectural concepts ACID and BASE, two descriptions of two very different systems.  In a company like ours, the business (and it's pseudo-techie product managers) fail to recognize the mutual exclusivity involved in various combinations of ACID and BASE, desiring the benefits of both concurrently.  This is a pretty vast comprehension chasm to cross without a good tool to help us explain the tradeoffs - enter Eric Brewer's CAP theorem.

CAP stands for Consistency, Availability and tolerance to network Partitions and works a little like the great software triangle (scope, cost, time) in that you may only have 2 of the 3 properties in any given implementation.  Note that we talk about an implementation here because it is perfectly valid, and in many cases quite sensible, to build different features within a single system to different CAP tradeoffs.

Consider a system with high availability requirements.  From this starting point you may chose to design in strong consistency (the data is always the same from any perspective) but you will not be able to distribute the system across any network boundary.  Your other choice would be network tolerance (it will run nicely geographically separated) but you will have to accept a window of inconsistency in both normal and failure modes.  If you have the option of doing away with your availability requirement then you might build something partitioned and consistent but you'll always have to fail to guarantee consistency through any network event.

Trying to keep a widely distributed data set highly available and 100% consistent at any given moment will bring you up against certain laws of physics.  Good luck with that.

Saturday, 14 June 2008

S**t Happens

I talk a lot about failure, how to build for it and recover from it.  Of all the things that will happen to your system during its lifetime failure of some sort is one of the few inevitable events.

A lot can go wrong with computers, but surely their best-known weakness has to be their fundamental incompatibility with water.

Focusing on building systems that survive individual node failure is an excellent discipline, but as you can see from that clip, you can't count on your datacenter to always be there.  That means distributing your system across servers in the same location will protect you from a number of (the most common) failure scenarios but if it's really, really important that you are always up then it needs to be in more than one place.

Think electricity.  Think connectivity.  Think geography.

Thursday, 12 June 2008

Rocket Powered Horse Trials

Pictured here in flight at our secret testing centre in Royal Ascot. I'm sure Bert would back this, he's always been a fan of new tech.

Stopped time - part 5

Monday, 9 June 2008

Process Improvement

I'm hearing a lot of talk about continuous improvement these days.  I'm all for it, but there are 2 really common shortcomings in most peoples implementation:

  1. It isn't only about adding steps/gates/processes.  Sometimes a process can be improved by removing a step, or perhaps the organisation can be better served by abandoning the process altogether.
  2. Improve your process improvement process.  It's part of your organisation just like any other process, and as such, should be subject to a bit of continuous improvement.

Making sure you do a bit of number 2 gives you some controls to ensure enough of number 1 happens.

Friday, 6 June 2008

Rack Mount 1, Technician 0

Failure is coming to get you, but we're getting better at predicting the scenarios and coding for them.  We think a lot about servers dying, losing network connectivity, power cuts, and how to respond to critical bugs.  These things are essentially unexpected technical events, but there is a whole other category at play in real life - human error.

Imagine this server has your data on it...

What will your customers see while that gets put back together?  How are you going to get the data back?

Despite the comedy value of that clip, this is exactly the sort of thing that happens in real life - people make mistakes.  But even when this kind of maintenance is less clownishly executed, it still needs to happen - and you need to decide what effect you're going to let planned maintenance events have on your revenue stream.

Wednesday, 4 June 2008

The Rules About Rules

What we do as technical teams needs to have some rules.  It's necessary for any group of people who need to work together to have a common frame of reference that gives them an idea of how to interact with one another and what to expect from each other.

But let's not forget that, in many cases, that's all rules are - a framework to get started with.  There are the odd few that are a little more material, for example things that deal with regulatory compliance or safety, but generally (in our industry) they're in the minority and they're obvious with a little experience.

It can be easy to get hung up on the wrong stuff with rules and the key is to always think about the why.  Why did we make that rule in the first place?  What was the reason - the principle - behind the rule?  If you know why then you can weigh the rule against the benefit of the action you want to take.  For example, we [used to - we're braver now] have a rule that we don't make changes on a Friday.  The principles behind this one are risk management and practicality; the weekends are the busiest times for our trading exchange and we have skeleton coverage Saturday and Sunday.  Now let's say we've got a really nifty feature finished that we're sure will give us a reasonable revenue uplift over the weekend, but, we just finished it on Friday.  So do we just forego the benefits entirely because of change control?  The right thing to do is consider the system impact of the release and, if it doesn't change any core components, why not do it?  After all, the controlling the risks was what we were after doing in the first place.

My "rule on rules" is to be rigid about the principles behind rules but flexible on the rules themselves, and remember, you shouldn't be in charge of enforcing rules if you don't know when it's best not to!

Tuesday, 3 June 2008

SAI 25

A while back the SAI 25 was published on Silicon Alley Insider and we're sitting at number 4.  There is a reasonably scientific approach to how these companies are measured and, based on that, I can see how we're climbing the list.  Growth, margin and market share are all strengths of ours.  Kick ass.