Saturday 21 February 2009

Scalability is not just a technical problem

There is so much content out there about how to scale out web sites, platforms, and databases – but it all focuses on the production system architecture.  Do a Google search for scalability – go on, I’ll wait for you...  See what I mean?

Now I know that’s the fun bit to talk about, but being practical for a minute, if you’re starting to scale out systems using techniques like Digg, Flickr, Xbox Live, or the usual suspects like Amazon, Google, and eBay, then chances are that you’ve got a whole lot more scalability challenges than just the product.

If you’re dealing with at least hundreds of thousands of daily actives, then we can probably deduce a bunch of other stuff about your circumstances.  We can guess that you’re after reasonably frequent feature drops, have a significant amount of horizontal distribution, a healthy sized engineering organisation, and a strong bias towards availability.  And if even half of that stuff is true, then what are the other scalability challenges will you be up against?

How about multiple concurrent projects?  If your estate is divided into more than one product then you will more than likely be working on more than one new feature in parallel.  This gives rise to all sorts of version control and regression test problems, which demand process and infrastructure quite different to a single effort.

What about disparate teams?  You might have people in a number of locations branching, or depending on, the same codebase.  That’s a communication barrier which can be tough to solve.  Large enough organisations also tend to sprout specialist disciplines, such as user experience and IA – this changes the nature of how teams engage and how work is specified, estimated, and delivered.

And how do you manage environments, tools, and documentation?  A complex production architecture begets a complex development infrastructure, as there is a lot more interoperability to test for.  Don't forget that with more teams working concurrently, managing contention for these expensive environments also becomes a tricky balancing act.  As your products increase in popularity (a good proxy for profitability on the web) NFRs like performance and capacity will become more important and will require specialised tools to measure.

I’d like to see us sharing a little more about our experiences with this side of highly scalable systems – it might not be as sexy as memcached, CAP, and Gossip, but the reality is it is just as important a part of the solution nonetheless.

No comments: