Sunday 28 September 2008

Parallel vs Distributed

The difference between parallel computing and distributed computing is another important piece of theory to keep in mind when designing a system. The concepts are significantly different, but far from mutually exclusive - for example you can run a number of parallel computing tasks on different nodes inside a distributed system.

The confusion, if it exists, arises from what the parallel and distributed concepts share in common - the division of a problem into multiple smaller units of work that can be independently solved with a degree of autonomy.

So what makes distributed distributed and parallel parallel? Both involve doing smaller units of processing on multiple separate CPUs, thusly contributing to a larger overall job. The key difference is in where those CPUs reside (and note that we'll treat "CPU" and "core" as synonymous for our purposes today). Simple answer:

Parallel is work divided amongst CPUs within a single host.

Distributed is work divided amongst CPUs in separate hosts.

How you break down work so that parts of it can be done concurrently, whether parallel or distributed, is largely governed by a single constraint - data dependency. Way back in the day a systems architect at IBM came up with a set of guidelines for assessing the degree to which this can be achieved, and a way to estimate the maximum benefit it will deliver. This simple rule bears his name today.

The key design considerations around parallel or distributed processing are in how you tackle this data dependency. In parallel computing, you need to use synchronization and blocking techniques to manage the access to common memory by the various threads you've split your problem up amongst. Solving the same issue with distributed computing simplifies your memory/thread management within each host, but you put the complexity back into state tracking, cluster management, and data storage.

It's arguably fair to say that, as a rule, parallel computing is more performant and distributed computing is more scalable. When crunching through a lot of work via many threads in one box, everything is done at silicon speeds, your only physical throttle being memory bandwidth and the pins between cores. The downside here being a hard limit to the amount of work you can do concurrently, which pretty much maps to the number of cores you can fit into your system - and scaling that up gets pricy. Doing the same work in a distributed system faces only theoretical constraints to how much work can be done concurrently, the question being how scalable your network and cluster management is, and it's usually cheap to add more systems and hence cores. The downside here being latency, as messages need to traverse networks many times slower than internal system buses, and of course you need a process to collect and reassemble results from all your nodes before you can confidently write your answers down to disk.

Like most technology, there are problems to which one is more suitable than the other, and also like most technology, there are many times when it is simply a matter of taste. Some of us are from big box school and feel more comfortable managing threads and memory space within a vast, single environment. Some of us are from cloud school, at rest amongst a dynamic mesh of cheap, disposable nodes, investing ourselves in the communications fabric between them.

1 comment:

PetrolHead said...

Parallel starts to look quite a lot more like distributed when your big box is a NUMA architecture because memory access speed is no longer uniform.

Factor in support in modern OS'en for disabling broken CPUs and your now dealing with dynamic re-balancing of load etc.