Tuesday 20 May 2008

Isolating Failure

A few weeks ago we launched our Sportsbook, our flagship risk-taking product, into the Italian market.  It's a very strategic product both because of how it fits into our international expansion plans and because of how it's built.  We intended for this system to get big fast so we built it wide - distributed, message based and API driven.

Just days after launch the one thing we can always guarantee will happen happened - part of the system failed.  Our bet placement engine hung [bad news] but the rest of the system continued to work [good news] so while our customers couldn't place any bets, they could hit the site, view the markets, register, login, deposit, manage risk... everything else basically.  This kind of failure scenario would see a lot of more monolithic web systems firing out 404s or 500s in the blink of an eye.

This is exactly the benefits you look for when you build decoupled functionality and minimize dependencies, consciously seeking to isolate features from each other.  The next step for us is automatic detection and repair - working to minimize human intervention.

No comments: