The Fletcher Project: scale

Showing posts with label scale. Show all posts

Saturday, 21 February 2009

Scalability is not just a technical problem

There is so much content out there about how to scale out web sites, platforms, and databases – but it all focuses on the production system architecture. Do a Google search for scalability – go on, I’ll wait for you... See what I mean?

Now I know that’s the fun bit to talk about, but being practical for a minute, if you’re starting to scale out systems using techniques like Digg, Flickr, Xbox Live, or the usual suspects like Amazon, Google, and eBay, then chances are that you’ve got a whole lot more scalability challenges than just the product.

If you’re dealing with at least hundreds of thousands of daily actives, then we can probably deduce a bunch of other stuff about your circumstances. We can guess that you’re after reasonably frequent feature drops, have a significant amount of horizontal distribution, a healthy sized engineering organisation, and a strong bias towards availability. And if even half of that stuff is true, then what are the other scalability challenges will you be up against?

How about multiple concurrent projects? If your estate is divided into more than one product then you will more than likely be working on more than one new feature in parallel. This gives rise to all sorts of version control and regression test problems, which demand process and infrastructure quite different to a single effort.

What about disparate teams? You might have people in a number of locations branching, or depending on, the same codebase. That’s a communication barrier which can be tough to solve. Large enough organisations also tend to sprout specialist disciplines, such as user experience and IA – this changes the nature of how teams engage and how work is specified, estimated, and delivered.

And how do you manage environments, tools, and documentation? A complex production architecture begets a complex development infrastructure, as there is a lot more interoperability to test for. Don't forget that with more teams working concurrently, managing contention for these expensive environments also becomes a tricky balancing act. As your products increase in popularity (a good proxy for profitability on the web) NFRs like performance and capacity will become more important and will require specialised tools to measure.

I’d like to see us sharing a little more about our experiences with this side of highly scalable systems – it might not be as sexy as memcached, CAP, and Gossip, but the reality is it is just as important a part of the solution nonetheless.

Friday, 7 November 2008

Cost in the Cloud

Cost is slated as benefit number 1 in most of the cloud fanboy buzz, and they're mostly right, usage-based and CPU-time billing models do mean you don't have tons of up front capital assets to buy - but that's not the same thing as saying all you cost problems are magically solved. You should still be concerned about cost - except now you're thinking about expensive operations and excess load.

Code efficiency sometimes isn't as acute a concern on a traditional hardware platform because you have to buy all the computers you'll need to meet peak load, and keep them running even when you're not at peak. This way you usually have an amount of free capacity floating around to absorb less-than-efficient code, and of course when you're at capacity there is a natural ceiling right there anyway.

Not so in the cloud. That runaway process is no longer hidden away inside a fixed cost, it is now directly costing you, for example, 40c an hour. If that doesn't scare you, then consider it as $3504 per year - that's for once instance, how about a bigger system of 10 or 15 instances? Now you're easily besting $35K and $52K for a process that isn't adding proportionate (or at worst, any) value to your business.

Yikes. So stay on guard against rogue process, think carefully about regularly scheduled jobs, and don't create expensive operations that are triggered by cheap events (like multiple reads from multiple databases for a simple page view) if you can avoid it. When you are designing a system to run on a cloud platform, your decisions will have a significant impact on the cost of running the software.

Thursday, 16 October 2008

What Your Network Guy Knows

So you're getting into distributed systems; maybe you've got some real scalability issues on the horizon, or perhaps you want to better isolate failure, or be able to cope with more concurrent change in the system. So how do you do this webscale thing then?

Time for some homework. Listening to some vendor pitches, maybe reading some books, or getting an expensive consultant or two in for a while (I'll take your money from you if that's what you want) might possibly do it. But before all this gets out of hand, did you realize you're probably sitting right next to a distributed systems fountain of knowledge? You already have someone, right there in your team, who has spent their entire career working with the largest eventually consistent multi-master distributed systems in the world - the trick is they might not even know it themselves - and that someone is your network guy.

Let's test this assertion against a couple of technologies that network engineers deal with every day, and look at what we can take from them into our distributed systems thinking.

How about something fundament - routing protocols. Networking gurus have a small army of acronyms at their disposal here; OSPF, EIGRP, IS-IS, BGP, and the sinister sounding RIP. These are essentially applications that run on network devices (and sometimes hosts themselves), map out the network topology, and provide data for devices to make packet forwarding decisions.

So what can we import from this technology?
1. Partitioning - networks are broken down into manageable chucks (subnetworks) which scope load (broadcasts), ringfence groups of systems for security, and limit traffic across slow and expensive links.
2. Scalability - routing protocols allow massive global networks to be established by summarizing contiguous groups of networks again and again, and ensuring any node can establish end-to-end connectivity without having to understand every single path in the network (just a default route).
3. Failure isolation - subnets are bordered by routing protocols, which form a natural boundary to most forms of network malarky. In the event that a network becomes unpredictable (flapping), some routing protocols are able to mark them down for predetermined time, which aids in local stabilization and prevents issues spilling over into healthy networks.
4. Self healing - when a failure in a network or a link between networks occurs, routing protocols observe the problem (by missing hellos or interfaces going down) and take action to to reestablish reachability (work around the problem using alternate paths etc). Each node will recompute it's understanding of the networks it knows how to reach, learn who it's neighbors are and the networks they can reach, and then return to business as usual via a process called convergence (this is a really simple study in eventual consistency and variable consistency windows).
5. Management - for the most part, networks separate their control messages from the data they transport. A good practice, especially when combined with techniques like QoS, because it significantly reduces the risk of losing control of the infrastructure under exceptional load conditions.

Now let's look at something application layer - DNS. This should be a somewhat more familiar tool (or you're kind of reading the wrong blog) and we touch it quite regularly but probably don't appreciate what goes on in the background. At it's most basic level, DNS is a client/server system for providing a mapping between human-readable hostnames and machine-friendly IP addresses. Oh but it's so much more...

So what can we import from this technology?
1. Partitioning - DNS is hideously, frighteningly big, there are hundreds of thousands of nodes in this system, from the dozen or so root servers all the way down to the corporate internet access edge servers. It is a good example of dividing up a problem; to find us you'd work right to left through a fully qualified domain name, starting with the "." (root), we're in the "com" container (hosted by a registrar), then the "betfair" container (hosted by us), and finally you'd get back some data from a record matching "www" and arrive at our place.
2. Scalability - did I mention DNS is big? DNS uses a classic combination of master/slave nodes and caching on the client and server side to scale out. At the corporate edge, DNS proxies resolve addresses on behalf of internal clients and keep answers in a local cache, ISPs and those who run their own zones keep a number of slaves (many read only) and spread queries out amongst them, and finally an expiry timestamp (TTL) is set on query results permitting client side caching.
3. Resilience - clients can be configured with a list of servers, which they will cycle through should they receive no answer. Additionally, the DNS protocol is stateless, making it easy to move servers around hot and load balance using simple, lightweight algorithms.
4. CAP - DNS definitely prefers availability over consistency, the window for an updated record to be propagated around the internet being ~24hrs in most cases. It's also highly tolerant to network segmentation, individual servers being happy to live separated from the rest of the DNS infrastructure for long periods of time, answering queries, and then catch up with all the changes in the zones they host once connectivity is reestablished.
5. Operations - the hierarchical way the namespace is organized is perfectly matched to how authority is delegated. If you're going to have a massive system spread around the globe, you've got to think about how you're going to operate it, and the DNS model for this is based on allocating administration with ownership. This gives complete flexibility and control to namespace owners without risking the integrity of the system as a whole and let's us operate the biggest distributed system in the world without employing the biggest IT team in the world.

So buy your network guy a coffee. Ask him how his world works. If you can draw the philosophical parallels, it might be the most valuable couple of hours you've spent in ages.

Oh and by the way - distributed systems are all about the network, so you're going to need a friend here anyway...

Sunday, 28 September 2008

Parallel vs Distributed

The difference between parallel computing and distributed computing is another important piece of theory to keep in mind when designing a system. The concepts are significantly different, but far from mutually exclusive - for example you can run a number of parallel computing tasks on different nodes inside a distributed system.

The confusion, if it exists, arises from what the parallel and distributed concepts share in common - the division of a problem into multiple smaller units of work that can be independently solved with a degree of autonomy.

So what makes distributed distributed and parallel parallel? Both involve doing smaller units of processing on multiple separate CPUs, thusly contributing to a larger overall job. The key difference is in where those CPUs reside (and note that we'll treat "CPU" and "core" as synonymous for our purposes today). Simple answer:

Parallel is work divided amongst CPUs within a single host.

Distributed is work divided amongst CPUs in separate hosts.

How you break down work so that parts of it can be done concurrently, whether parallel or distributed, is largely governed by a single constraint - data dependency. Way back in the day a systems architect at IBM came up with a set of guidelines for assessing the degree to which this can be achieved, and a way to estimate the maximum benefit it will deliver. This simple rule bears his name today.

The key design considerations around parallel or distributed processing are in how you tackle this data dependency. In parallel computing, you need to use synchronization and blocking techniques to manage the access to common memory by the various threads you've split your problem up amongst. Solving the same issue with distributed computing simplifies your memory/thread management within each host, but you put the complexity back into state tracking, cluster management, and data storage.

It's arguably fair to say that, as a rule, parallel computing is more performant and distributed computing is more scalable. When crunching through a lot of work via many threads in one box, everything is done at silicon speeds, your only physical throttle being memory bandwidth and the pins between cores. The downside here being a hard limit to the amount of work you can do concurrently, which pretty much maps to the number of cores you can fit into your system - and scaling that up gets pricy. Doing the same work in a distributed system faces only theoretical constraints to how much work can be done concurrently, the question being how scalable your network and cluster management is, and it's usually cheap to add more systems and hence cores. The downside here being latency, as messages need to traverse networks many times slower than internal system buses, and of course you need a process to collect and reassemble results from all your nodes before you can confidently write your answers down to disk.

Like most technology, there are problems to which one is more suitable than the other, and also like most technology, there are many times when it is simply a matter of taste. Some of us are from big box school and feel more comfortable managing threads and memory space within a vast, single environment. Some of us are from cloud school, at rest amongst a dynamic mesh of cheap, disposable nodes, investing ourselves in the communications fabric between them.

Wednesday, 28 May 2008

Definition of Scalable

We talk a lot about scalability but what is it that we really mean when we refer to a system or service as scalable?

"A service is said to be scalable if an increase in system resources results in a proportional increase in performance."

To webscale computing increased performance typically means serving more units of work (pages, TPS) but it can also mean larger units of work (bigger datasets, many-where-clauses).

The main reason I like this definition of scalability is it separates the scaled from the scalable. I've seen plenty of really ugly systems go big - but vertically and at ludicrous capital costs. Yes, you managed to squeeze scale that thing but that doesn't automatically earn you the right to refer to it as scalable.

To me, scalability is an economic thing as much as it is a technical thing. You have to build wide and grow complex IP across a commodity platform - but if you can't maintain (reduce!!!) marginal cost while you're at it then you still haven't earned the right to call your system truly scalable.

Thursday, 15 May 2008

A Better Way to Say 'State'

I wrote a little bit about state in this post and recently came up with a much simpler description. Here goes:

Tracking state is necessary when a part of my application needs to make a decision based on your previous activity. That previous activity could be a trail of things you did along the way (collected items in your basket) or a more binary prerequisite test you'll either pass or fail (logged in or not).

Something closely related to state is session, and I picked this up from Jeff Atwood while I was trawling around for something totally unrelated. I think it's one of the best plain-English explanations I've read on the topic and, since he explains it better than me, I'd encourage you to take a look.

So that's all pretty simple but, as I always say, the basics are what everything's built on - and where this starts to get interesting is when you're building distributed systems. It's easy to partition stateless functionality but what do you do when you need to track state or keep persistent session information? Well that's easy, you share your state from a centralised place - that'll see you through for a little while but what about when you need to scale that horizontally?

There are a lot of systems already doing this (AFS, DNS, LDAP, NFS) but there are no standard solutions for distributed state management, these systems all implement their own unique consistency and conflict resolution methods. We're now seeing a lot of webscale businesses hitting these scalability walls - requirements are forcing the production of customized infrastructure services like Google's Bigtable + Chubby and Amazon's S3. We're balancing on the edge of this wall ourselves and given what a difficult but rewarding challenge it is, I feel fear and anticipation in equal measure!

Thursday, 8 May 2008

Nice Threads

Last month I saw this post on codinghorror. I wanted to wait and see what sort of discussion it kicked off among developers before I picked it up - and it looks like it's got a fair trail of comments now.

Firstly, let me just say I think it is good (and long overdue) to see software engineers getting this interested in hardware. Abstract all you want but the systems we build all eventually run on hardware. A fundamental understanding of IO, of how computers store and retrieve data, perform logical operations and access memory will make the difference between good software and great software.

Secondly, I'm not sure I agree with the statement "dual-core CPUs protect you from badly written software" but I think the right sentiment is there if you add a "certain failure conditions" qualifier. There is a central topic this post skirts around without really nailing, and that is how vital resource management in a system is.

What we're talking about is good threading behavior. I was hoping to see a lot more discussion about limiting the lifetime of threads, dynamically managing thread count according to capacity, creating affinity (making threads sticky to certain cores) and assumptions about environment (how much of a CPU/core can you safely say is yours?). This kind of discussion usually leaks into good memory and disk management - which is excellent - these are the 2 other key physical constraints your systems are bound by.

As a final point I think it's worth noting the difference between a core and a CPU. Most people will tell you the answer is nothing, but there is one key difference - pins. In most architectures a core is essentially a CPU via all the right components being present and correct; CU, ALU, FPU, registers etc but a core always shares pins (the little copper legs that connect the CPU to the motherboard) with all other cores on that die. As a CPU your pins are how you get data on and off the die - so anytime you reach out to memory, non-integrated L2 cache, disk or network etc you do so against contention with all the other cores resident on the same die. Subtle but critical when bus becomes an issue, and maybe you'll need to think more about affinity when your application gets to that scale.

Friday, 14 March 2008

Epistle to the Vendors

Before doing what I do now I was always in system integrators/consultancies. Now that I'm on the other side of the boardroom table I've seen the whole picture and I have to say - we're playing right into the hands of the resellers, integrators and consultancies. If we're going to do this anyway we may as well be a bit more open about it. Let's go in up front and do it properly in a more formal, planned manner rather than stumbling from one bad decision to the next.

Therefore I have written this open letter to vendors everywhere outlining exactly what they need to do to take us to the cleaners:

Dear Sir/Madam

We're a web enterprise just starting out and we're teetering on the brink of profitability - we have a popular product and the business is expanding at a fearsome rate. We're running around like madmen trying to work out how to get the right balance between scaling our system, shipping more features and protecting our time to market. You've come in to meet us disguised as help - nice touch - and you're trying to work out how you can squeeze the largest possible cross-section of what you sell into our requirements over the longest possible timeline. Using these 3 big phases you can really stretch it out.

Phase 1 - hardware (AKA brute force)

Popular product, lots of customers, increasing load - get some bigger computers, we'll buy those. Next sell us some faster network, we'll need gigabit everywhere, some bigger firewalls and then how about load balancers? Let's go brand crazy and throw together some F5, Netscaler, Cisco and Foundry. Load increasing a bit more? Next size up computers please. Now we're a little scared too (evil hackers!) so let's have some Top Layer, maybe Netscreen and Juniper too. Uh-oh load is up again so more memory, more CPU please. Wait a minute, we're now out of bigger boxes in the x86 space? Can't address any more memory in a single node? I guess we'd better move into something like SPARC, jumping a whole class of computer - ka-ching!

Phase 2 - middleware (AKA short sighted)

OK we're still hitting capacity ceilings but now we worry equally about uptime. How can you help [yourself to our cash] here? The best tactic is going to be some really expensively licensed software that contorts scale and availability from products that just weren't built for it. Ideally we're looking for some second-rate results by forcing artificial distribution using things like GemStone, Tangosol, Veritas, CA, TimesTen. It'll also get us some failover (yay) but without delivering too much more actual availability (boo) since our application is blissfully unaware of the state of it's infrastructure. Tricky but lucrative times.

Phase 3 - consultancy (AKA told-you-so)

Now's the best bit - you get to sell us your most expensive, highest margin product - consultancy [meanwhile in the real world we finally agree on something; we're in a bind we can't buy our way out of]. So what's your advice? Decouple and distribute, scale horizontally, version data and services, isolate features and automate the hell out of the whole thing. Thanks for that (and ka-ching again). The most important thing to remember is not to ever let on that you knew this all along. Oh and if you're clever you can usually also work in some sort of licenses or something in this phase as you never really want to completely wean us off you.

So there you have it, think of this as your 12-step programme compressed into 3 easy stages - just do your bit (turn up with PDF's and nicely ironed shirts) and we'll do ours (make appalling decisions in our early product development) and together we'll milk us dry.

Yours insincerely

N. E. webscale business

Hang on this all sounds pretty sweet, why did I ever leave consultancy? But it's not too late! You don't have to put Capgemini, Dimension Data or Logica's children through college if you don't want to! You just have to think differently now. You're making a product you intend to be successful right? You intend for it to be popular, right? Well decouple, distribute, scale horizontally, version data and services, isolate features and automate the hell out of the whole thing now, today! You will end up doing that (or die trying) anyway, it's just a matter of how much it costs you on the way.

The Fletcher Project