Wednesday 30 July 2008

My Demotivational Poster

I just has to make one - and what better than my favorite topic (credit to whoever photoshopped tetris):

image

Helmuth von Moltke (a Prussian/German military strategist in the 1800's) once wisely observed that "no battle plan survives contact with the enemy".  I think you can say the same thing about software - no matter how elegant or simple your feature is, it will never survive contact with users.

That's why response and recovery plans are just as important as the code they are in place to support - you will need them sooner or later...

Monday 28 July 2008

Tweet Tweet

Today I decided I'm going to play with twitter; so I signed up like so.  I can't really picture myself having the time to twit (or tweet?) every few hours, I signed up for slightly less conventional reasons...

It all stems from how I use this blog.  When I started blogging, I foresaw a channel for my original thoughts, a way for me to share my experience in the industry - real life problems and solutions from the world of running webscale engineering.  It was my opinion and what's worked for me; I didn't want to get into simply posting links to other people's opinion, unless I can substantially build on them, and thus add some value to the idea being discussed.  I wanted there to be some substance, some usefulness and some kind of conclusion to what you read here.

I have stayed true to this vision, but these days I am increasingly coming across content I want to share in a briefer 'check this link out' kind of format.  This of course poses the question; do I dilute the purpose of this blog by posting shorter, less meaningful messages with links or embedded external content or do I find another way (or, that oft-forgotten option we always have, do nothing)?

A tool that seemed fit for purpose to me was micro-blogging.  Small snippets of text pushed out as regularly as you see fit and a whole culture which prohibits verbosity (how will I cope).

So I'm going to play whack-a-link on twitter.  Anytime I see something I like or agree with (but don't want to more formally expand on) I'm going to demonstrate my support for it by posting the link into my feed.

For me this is one of those kind of experimental things - start using the technology and see what value emerges.

While we're talking about twitter I want to give the fail whale an honorable mention for achieving the pinnacle of error message accomplishment - being a popular sight.  No kidding.

Failure joins death and taxes in the hallowed halls of unavoidable inevitability.  I spend a lot of time working out how to detect it, avoid it, and recover from it but sooner or later it gets us all.  This is where architecture stops operations and customer service starts; personally I consider the likes of web server 404's and 500's to be the middle finger of the internet - if you can turn these into disarming, apologetic messages then you'll at least have a chance to keep your customers on your side while you work out your issues.

The fail whale is almost too good at this.  It's grown into some kind of phenomenon of it's own.  People have made fail whale models, you can buy fail whale t-shirts and mugs, and there is even a fan club.  Remember; this is a holding page they show when their site is down!

Thursday 24 July 2008

Eachan's Famous Interview Questions

Building the right team must be among the primary concerns of any senior management role - without high quality people backing you up it doesn't matter how brilliant your strategy is, you'll always struggle. Recruitment is important.

It isn't possible to see every individual engineer joining the company, but I always personally see any candidates applying for leadership positions, roles with a big impact on the team, or roles with a lot of decision making responsibility.

Set your recruitment process up so that you are only ever choosing from the subset of individuals that can actually do the job - your value is in picking the person that will have the right influence on the team, not testing their knowledge of Java or Ruby.

Once you have someone with the right technical skills sitting in front of you, how do you predict the impact that person will have on the organisation? The long-forgotten and much-undervalued art of conversation will tell you. Here are a few of my favorite discussion-starters that I consider useful in working out how something thinks:

  1. For any given problem, do you prefer the best possible solution or the solution the team knows the best? This will give you a measure of where on the 'most reliable delivery vs stretching/growing the team' continuum they feel comfortable.
  2. What is the difference between a good engineer and a great engineer? This will let you know how they judge talent and what qualities they look for in their teammates.
  3. Assuming you are the successful applicant, what would you expect from us as management to help you make it successful? This gives you a feel for how well they'd represent their team and what their upward management would be like.
  4. Role play - it sounds a bit 'tired' these days, but another thing I like to do is ask them to swap places and try to hire me. Get them to explain why I'd like working here and what the company is like, what they products are etc. This will tell you how much homework they've done before meeting you and give you some insight into how they'd go about recruiting.
  5. [updated] Play a little word game. I get a lot of insight into how people think by picking a handful of related (but different) words and asking them to arrange them in order. One of my favourite sets is; potential, qualification, skill, experience, and wisdom. There are no right or wrong answers to this, but (in this example) I can tell if someone values accreditation over practical experience; potentially greater performance tomorrow over potentially lesser performance today.
  6. Sell to them. Not really a question but a useful idea nonetheless. My theory is that anyone who walks in should walk out wanting the job, whether or not you want to offer it to them. This turns unsuccessful candidates into mini marketing agents and gets the word out about what you do.

Attracting and retaining the right talent is so critical and you can never underestimate the impact on the team of a bad recruitment decision. You must make these activities a high priority and, if you rely on agencies for any part of the process, monitor their performance very closely - they don't have to live with the mistakes!

Sunday 13 July 2008

What Makes a Good Leader?

I recently had a bit of "homework" to do as part of our professional development process.  Nothing profound, but nonetheless a few very interesting questions to address.  My favorite this time around was what the 5 most important qualities of a leader are.  Here is what I thought:

1.  Vision – they have to know the goal, know the results and be able to effectively communicate this and get buy in.

2.  Foresight – they have to be able to see the future coming, plan for it and know the key activities that, over time, will realize the vision.

3.  Respect – they must set the example and engender confidence from the team, people must want to seek their opinion and emulate their behavior.

4.  Impact – they must be seen to make a difference, to make tough decisions and stand by them, to wield whatever power they have to effect visible change.

5.  Determination – they must have the grit and the resolve to show the team that they stick by their principles and stay cool and clear even when the environment is at it’s harshest.

There are as many [correct] answers to this as there are leaders in action, drop me a line with yours.

Wednesday 2 July 2008

Teach Your Computer to Say No

As good engineers we always try to build systems that have enough capacity to service all our customers requests, all the time, anytime.  It's an admirable goal and very commercially sensible, but there is no such thing as unlimited capacity and sooner or later we'll get our scalability wrong, suffer partial failure or maybe it won't be cost effective to scale up to those rare usage peaks - what will the user experience be then?

By default it will be things like; page cannot be displayed, connection timed out or our staple favorites 404 and 500 - basically the middle finger of the internet.  You have to think about how your systems will behave under unexpected load and what you'd want your customers to see when it happens.  There are a few things you can keep in mind when building your products that, if properly addressed, will lead to a very different user experience when things get tough:

Customise your error messages

Those unexpected peaks or partial failures are always going to happen sooner or later, so why not serve up a nicely formatted (but lightweight) page with alternative links or customer service contact details rather than a nasty, cryptic error message?  It is so easy and cheap to do and makes such a huge difference to user perception that there really is no reason not to.

Queue or throttle excess traffic

What's better than a nicer error message?  Some kind of service - even if it doesn't match our ideals.  Queuing is a nice solution, it protects your in-flight users just like denial would and provides a better experience for those new arrivals who would otherwise push you over the edge and get an error.

A holding page, an alternative URL or maybe a nice countdown with an automatic refresh is all you need.  These things can be a little more time consuming to build because your application needs to have some awareness of it's environment if you expect it to make decisions about what to serve up based on remaining capacity.  There are some very easy ways to buy this back in hardware if you're using load balancers like F5, Netscaler, or Redline - a little experience will tell you how many users/connections/Mbps each node can tolerate, and you can configure an alternative page or redirect for anything above that threshold.  Depending on what you have installed this can even be served directly from the device cache, making it even lower impact for your over-busy system.  Queuing is just like it's real-life namesake; "line up here and wait for access to the system".

Throttling might suit certain applications better - particularly API's with a heavy non-human user population, as client applications might not understand queuing messages, holding pages or redirects returned from the oversubscribed feature but they'll be less sensitive to a general slowdown.  Throttling, as I use it here, is about reducing the maximum share of total system capacity any individual can consume in the hope that this will allow more total individuals to use the system concurrently.  The theory is that once we hit a certain threshold, if we flatten out everyone's usage, the big consumers will be held back a little (but still served) and the rest of us can still have a turn too.  Like most ideas here, this can be implemented in a variety of places in your stack; queries/sec or concurrent data sets searchable in the database, concurrent logons in the middle tier, TPS in the application, GETs/POSTs on the front end or even number of concurrent connections per unique client IP address at the network edge.

Where is the right place to impose planned limits?  I'd suggest multipoint coverage, but to start with think about your most critical constraint (what dies first as usage climbs?) and then do it at the layer above that one!

Fail gracefully

This is one of the toughest things to do with a distributed system; your features need to be aware of their environment both locally and on remote instances and they need a way to bring up/take down/recycle themselves and each other.  If your system suffers a partial failure, or load is climbing towards the point marked "instability", then you need to make a decision.  Do you bring up or repurpose more nodes to handle the same feature?  Redirect a percentage of requests to another cluster/DC or trigger any of the other techniques we talked about above?  Great if you can, but if you can't, it might be time to use those [refactored to be friendly] error messages.

Turning away users by conscious decision may rub us up the wrong way at first glance, but depending on how your system behaves under excessive load it might actually be for the best.  If you are at maximum capacity and serving all currently connected users, would new user connections result in degradation of service to those already in the system?  Maybe saying no to additional connections might be frustrating for those users being turned away, but you have to weigh that against kicking off users midway through transactions, or worse still, crashing the entire system!

Graceful failure is essentially the art of predicting the immediate future of your system and handling what are likely to be excess users/transactions/connections in a premeditated fashion.  First you need to know what your thresholds are; then you need very fine resolution instrumentation to tell you how close you are to them on a second-by-second basis, and finally, you need an automated way to respond to impending trouble - because humans are too slow to prevent poor customer service in busy systems; we're better at cleaning up once the trouble has been and gone.

For example let's say you're aware of a memory utilization threshold or a maximum number of users/web node - wouldn't it be better to actively restart a service or show a holding page before you lost control of the stack?  If you haven't managed to build in some kind of queuing or throttling then you might be showing an error or denying service but is that better than losing the whole system.

The last part of graceful failure somewhat overlaps with recovery-oriented computing; make sure that, upon death, the last thing you ever do is take a snapshot of what you were doing and what the environment was like.  If your processes do this, then you are able to have watchdog processes (or monitoring systems) that know whether or not it's safe to restart that failed instance (or take an alternative action based on the data), you'll have an easier time diagnosing faults, and the data generated will help you keep a rolling benchmark of the thresholds in a system with a high rate of change.