The things I should have learned from years of putting together people and technology to get successful product online. We’ll probably talk about strategy, distributed systems, agile development, webscale computing and of course how to manage those most complicated of all machines – the human being – in our quest to expose the most business value in the least expensive way.
Saturday, 30 August 2008
There's something about Darwin
I'm going to argue that there are 2 fundamental tenets upon which systems can be designed and built; mechanical and biological. Both start from a different mindset and both exhibit different basic capabilities that constrain what sort of things can be achieved within a given effort. I'm also going to postulate (triple word score!) that a system exhibiting biological characteristics is much more suitable to web-scale computing than it's mechanical counterpart.
Before we get into the meat of it, let's do some simple word association:
Forgetting completely about software for a minute (we'll deal with the analogies later), what comes to mind when you think about a mechanical system? Probably things like gears, cogs meshed together, reliability, tight integration, close supervision, a single contiguous chain, unthinking automation, dependency, predictability, and control.
Again leaving software out of the picture, what comes to mind when you think about a biological system? Probably things like change, iteration, flexibility, loose relationships between units, unpredictability, variance, the hive over the individual, response to environment, awareness, and cooperation.
The mechanical analogy is how we've all traditionally built computer systems. It kind of makes sense, we've been building mechanisms for a very long time and the logical, mathematical basis of computer science lends itself to this kind of thinking - representing in software a chain of cogs, each coupled to the next, powered by and dependent upon it's neighbor. Look inside an old watch. Now take something small and seemingly insignificant out. Still working? Thought not.
A mechanical type of system has a number of 'moving parts', all of which must be functioning for any of them to function. The system is very easy to observe and very easy to predict - you'll pretty much be able to guarantee what state it will be in under any given circumstances. Information will flow through this 'production line' from one step to the next, operated on with monotonous exactitude. You'll have to keep a close eye on it though, because no steps in the chain will ever be aware of any others, and therefore can't make a decision about whether or not it's safe to pass their output on to the next step (will it be lost?) or whether they can trust the input of the previous step (is it corrupt?). Luckily it's pretty easy to observe, but unluckily it will more than likely need manual intervention to 'realign the teeth' when a gear goes bad. Scaling is tough too, you can always use bigger cogs but everything will only ever go as fast as the slowest wheel.
Recently we've recognized some patterns of behavior that occur in the natural world that would be of benefit in running a large scale computer system, and at the cost of increased complexity, we can replicate these in software just as we replicated a chain of hardwired moving parts.
The biological analogy is a way to build systems that are aware of their environments and their fellows, possess a degree of self-governance, and can respond 'intelligently' to changing circumstances. Look at a line of ants. Block their path, take some away, move their food source. Still working? You bet.
There are a couple of equally valid analogies to draw with the natural world; firstly observe how a single organism works on the cellular level. What matters is the whole animal functions competitively, individual cells are irrelevant, but they all work together to keep the 'system' alive. They have a way to replicate themselves, kill off corruption, heal 'faults' and modify their role as requirements change. They don't need externally monitored or controlled by a central point - they all have a little bit of this distributed amongst their number. Now imagine that animal is your system and the cells are servers, nodes, units of functionality - whatever it is that makes up your system. The other way to think about biological computing is to study social insects, cooperative hunters, and hives/swarms. You end up in pretty much the same place, a group of individuals which make up a whole with the whole being more valuable than the individuals, I think which one you prefer depends on where you went to school (or how much animal planet you've watched). It can be easier to think about bees making conscious (instinctive?) choices, communicating and reaching a consensus, than trying to divine meaning in the more mysterious chemical reactions that govern the cellular world.
A biological type of system is made up of a collection of disposable nodes, the sum of which is the whole system. Nothing is more important than the survival of the system, and these 'cells' must change purpose and even 'die' to keep the system going. There is no central government or master nodes, all the 'cells' are peers and the administrative tasks in the system are distributed amongst them. The system is capable of making simple decisions for itself, responding to environmental stimulus like load, failure conditions and available upgrades. Many of these decisions are reached by consensus, using protocols which create a 'hive mind' pulling together the opinion of every relevant individual and answering questions such as; is a feature down? Where am I? Am I the right version? What are my neighbors doing? Who will service the next request? Is this safe to eat?
Biological systems are more difficult to observe and their behavior becomes harder to predict the further into the future you look. Their decisions are only as good as the data they have and the rules you give them to assess it with - so you can end up worse off than if humans manually configured everything. A cellular system is composed of a number of small, totally independent pieces of functionality, and the upsides to this are scalability and partial failure. Scale comes through being able to arbitrarily add more 'cells' to your 'organism' as it needs to get bigger and bottlenecks are easier to solve as every individual component is able to operate as fast as it's legs will carry it (these designs are usually asynchronous). Partial failure means that even if all the nodes that make up a certain feature are down, your system as a whole will still work sans that one bit of functionality. Self healing is when a system is aware of what it should look like (as a gecko I should have a tail) and is able to recognize when it does not (ouch I don't anymore) and take some corrective action (better grow it back). This itself is a double edged sword, imagine you're intentionally taking something down because of a security flaw; you have to give yourself some way to prevent the page springing back up somewhere else. The 'split brain' problem becomes even more significant; usually in a split brain you face potential inconsistency with data being written in more than one place. With a system designed to repair itself, you might just end up with 2 complete copies working independently - that may not be all bad (depending on how your system works) but the ability to kill of duplication once connectivity is returned is something that needs addressed.
So now the previously promised postulation (I am good at scrabble). We know we can build systems whose behavior is easy to predict, they are easy to observe but they are rigid and require a lot of manual oversight and intervention. The mechanical way. We know we can build systems that are flexible, automatically resilient to environmental change and inherently scalable but they are difficult to accurately predict and can run away with themselves if not given the right information to from which to make deductions about state. The biological way. In a web-scale environment where availability, scalability, and the capability to ship frequently are such vital attributes of any product, it seems to me that we benefit more from thinking in a loosely coupled, compartmentalized, organic way than an interlocked, highly dependent, production line way.
Monday, 25 August 2008
The Law of Conservation of Complexity
A couple of real life examples:
Something customer facing. Let's say we're selling car parts online and we need to check availability from manufacturers before we accept orders, and we'd quite like to do this in real time so we don't set our customers expectations wrongly. We could do this in the back end by making complimentary calls from our application to the manufacturers inventory system as we receive requests from our customers. This might be a smart solution because we'd be able to do some short-term caching of results for popularly looked-up items. We could do this in the front end by getting our users browsers to fetch the stock levels directly from the manufacturers website or (if we're lucky) API. This might be a smart solution because it would reduce the load our application takes.
So we can simplify the front end at the expense of the back end, or we can simplify the back end at the expense of the front end. Either way we can't escape the need to make that request somehow without changing the functionality of the system [being aware of stock levels].
Something platform. Let's say we've got a distributed application that we need to provide strongly consistent data storage for. Since consistency is one of our requirements, and assuming we don't want to centralize the system, we need a way to make sure that when we write to a piece of data in one place, a contradictory update is not being made in another. We could do this on the storage side, by using a distributed locking algorithm or electing certain partitions to be the designated 'writer' for certain data items. This might be a smart solution because it's more portable. We could do this in the application by making the storage client responsible for locking every copy of a datum, or sending a message to all it's peers advising them of updates it wants to make. This might be a smart solution because it simplifies our data administration.
So we can simplify the application at the expense of the storage system, or we can simplify the storage system at the expense of the application. Either way we can't escape the need to govern writes somehow without changing the functionality of the system [keeping consistent data].
Just as energy is never lost, the minimum amount of complexity an entire system must have to achieve its goals can never be reduced, it can only be moved around.
Thursday, 21 August 2008
Are you getting the most out of Quality?
Testing != Quality.
Now that we've gotten that out the way, let's talk about why. Testing is an essential activity in any quality system, but it should never be the focus. Testing should be used to verify that your quality is working, not to verify that your product is working - because if you have a quality process fueled by quality input then you can predict a certain level of quality product.
There is something to be learnt from looking at the history of quality, long before we needed to apply it to software. Before the industrial revolution which saw mass manufacturing on production lines, all we had was testing. Quality of output was a function of the individual craftsman - pieces were all made and inspected manually. Clearly this process couldn't scale to outputs in the tens of thousands of units per day, as some production lines were capable of, and so we needed a way to ensure the quality of our output without having to individually inspect every single unit. And thus, quality was born.
The key was focusing on the process; ensuring that materials, practices, and tools adhered to a certain standard to create consistent output. This way, inspecting the output became the way we tested the process, not the units, and we could afford to take samples instead of reviewing every individual item.
I am surprised by how many organisations still have a 'testing the output' approach to quality, they could be spending their time so much more wisely. There are a whole bunch of things that make up quality in software projects; the architecture, the specifications, the documentation, the environments, the project management, and the testing which proves it was all done adequately.
Invest in quality, not in testing, inspect samples only, and automate a complete exercise of all your functionality (what it should do and what it should not do) before you ship.
Monday, 18 August 2008
Leading by Example
Another basic forgotten by so many. Early in a switch to management, it slowly dawns on you that have a very different job to that of your team. That's exactly as it should be, but just because you don't write [much] code anymore don't get sucked into thinking you can't still set the example! You no longer perform the same daily tasks, but there are some basic professional conducts your team will always observe in you.
As a leader, your actions tell everyone what’s important to you and what you expect from everyone else. It doesn’t matter if what you say is in direct contradiction with what you do – people will follow the walk you walk, not the talk you talk (risking an old cliche).
Think about the way you conduct yourself. The times you arrive and leave, the dress code you follow, the type and tone of emails you send and your use of company resources. If you’d be happy with your team behaving in exactly the same way then great, you nailed it. If not, then think carefully about what you’re telling them...
Friday, 15 August 2008
Specialisation vs. Pooling
A while back I posted about ways to structure teams that will get the most appropriate input into your product development inside the smallest possible communication loop. My argument was essentially for organizing by product or service - deconstructing your domain to the smallest units of functionality that can be owned end-to-end by a multidisciplinary team of 5-10 engineers. I haven't yet come across a webscale system that this can't be done with.
Of course there are benefits to organizing by skill (java team, DBA team, QA team etc) such as ensuring consistency in how tasks are performed and creating a sense of community among engineers in similar roles. Grouping people by the role they perform tends to make you really good at executing those particular skillsets. Grouping by the product or service you take to market tends to make you really good at the subset of those skillsets that make the most difference to your product - what particular part of system administration is most critical to keeping a trading exchange online? What particular part of testing finds the most critical issues in a deposit and withdrawal service?
Like most things one isn't the 'right' answer; it's really two different sets of benefits and you need to ask yourself which are more valuable to your company.
If you use off-the-shelf commodity systems to support traditional business processes then you're probably better off organizing by trade - chances are you'll get the most competitive advantage by keeping JD Edwards online, making sure your file servers don't run out of disk space and your CRM system is always up. If you're are a technology-led company then you really should consider organizing by product or service. You'll still always have the business of running the business (you'll need some corporate systems) but chances are you will create the most competitive advantage by being the best at those particular engineering tasks that are specific to your product.
Monday, 11 August 2008
The Business Value of the Back End
In my experience, most companies are good at recognizing the value of customer facing, revenue generating features and not so good at recognizing the value of back end, platform features. Which is a real shame, because without the platform there can be no customer facing features, and beyond that, if you invest in platform you can squeeze even more value from it through quicker time to market, more availability, and growth at better marginal cost.
I can see why we're in this situation - it's easy to judge the business value of customer facing features; you can count revenue, wallet share, registrations/conversion, ARPU and stickiness quite simply. Capability, which is the primary deliverable of platform, is tougher to describe in measurable ways.
Perhaps at this point we could just agree to disagree, but it's not that simple. As engineers, we know that there is a certain amount of effort we have to spend on platform just to keep the business running. We also know there is a certain amount of effort we should spend on platform, which would result in tools that lead to better quality. And, to be fair, we also know there is time we could spend on platform that would take us into diminishing returns. Ensuring the business commits to the have to effort and helping to get better trade-offs for the should do investment is why we can't settle for mutual incomprehension - we need to get better at selling this.
It should be easy, because there is business value in the platform things that we, as engineers, want to do - after all, if there wasn't, we wouldn't be proposing them. To me, the answer is in how we describe what we want to do. If we can articulate the business value better then we level the playing field between features and platform. This enables decision makers to compare apples with apples when deciding how to spend the organisations resources.
Some simplified examples:
- Refactoring some old products into loosely coupled services = a more granular failure model = more consistent service to customers (you can't take revenue through a 404).
- Improving the throughput of a database server = quicker market settlements = customer funds returned sooner (they'll trade again more times that day).
- Separating UI components and content management = more front end flexibility and editorial control = more marketing and promotional opportunities without software changes.
- Providing a simple, abstracted data storage and retrieval interface = less complexity for engineers building new products = quicker time to market for new features.
- Implementing a messaging infrastructure which presents core components as APIs = more reuse and isolation of features = more concurrent workstreams and easier outsourcing (because of the natural boundaries created).
We can never guarantee people will make the decisions we consider "right" but if we use the language of the business when we're describing the platform capabilities we want to build we'll at least have removed the primary barrier to understanding.
For any web business to work, everything has to be a balance of customer facing features and back end capability - the key is that balance being determined by a true comparison of value. I think we'll know that we're on equal footing when capability gains are celebrated when released, and pursued when missed, with the same ferocity and dedication customer facing features are.
Thursday, 7 August 2008
Hot Maintaining Communication Systems
A while ago Joe Armstrong posted a few simple ideas for hot maintenance on his blog (and I heart hot maintenance). It starts from a position of a fair few assumptions, such as your clustering service being OK with nodes arbitrarily joining and leaving, and any central data or state storage being OK with heterogeneous nodes connecting to it. But hey - good design principles anyway, so for arguments sake let's just call those assumptions validated...
Someone posted a very relevant question on the comments - what about upgrades that require changes to the protocol between nodes? This is a vital issue to address, because if you can't come up with a good solution for hot maintaining your communications glue, you have a limited amount of overall hot maintenance you can practically achieve.
The answers to this question are as varied as there are design patterns, but using the same set of assumptions we read into Joe's original post, here are 3 techniques that should be fairly portable:
Dual-headed. If you're changing transports you can introduce nodes that communicate using both protocols. You duplicate traffic (if you do it fairly unscientifically) but it's viable for a rolling upgrade.
Versioned interfaces. If you're changing the message format, version the messages. This will let you gradually move nodes onto the new build and give you some added benefits like A/B testing, faster future upgrades and rollbacks.
Translation gateway. So far we've assumed a group of nodes in a single location. For a distributed system with collections of nodes in distinct places, a gateway that speaks protocol 'old' on one side and protocol 'new' on the other might work best, letting you upgrade cluster-by-cluster without taking down the whole system.
To get the right answer for any given system the first thing to do is study the communication pattern - how much is management vs. service, what is node to node vs. node to DB, and how much crosses geographical locations? This gives you your architectural options, beyond which you just need to keep your state tracking heterogeneous, your data partitionable, and (to make your own life easier) your inconsistency window as long as the business rules allow.
Monday, 4 August 2008
Why is it that after so many software projects we have that familiar "discussion" about scope creep and deadlines? It took longer than we estimated, we say requirements changed or were added, stakeholders say they weren't, rinse and repeat. I really drilled into this on a few recent projects and I'm starting to think the answer is more about semantics and less about passing the buck.
So, into the field.
"Tell my why you believe so strongly that requirements did not change" I asked the stakeholders. They told me that they asked for, say, a gambling website at the start of the project and at the end of the project they still just wanted a gambling website. The label 'gambling website' has some basic defining properties; you can sign up, you can deposit funds, you can gamble these funds and you can withdraw winnings. We didn't add anything else to that. WTF?
"Tell my why you believe so strongly that requirements did change" I asked the engineers. They told me that we started off needing to register accounts, but then added age verification, geolocation, and integration to another product. We needed to build an odds manager, but then we added price feeds, client side push, and autopopulation of recurring markets. WTF again?
What's interesting here is that both sides are answering a slightly different question - and both are, in fact, correct. The stakeholders are actually saying that their goals didn't change and the vision didn't change. True. The engineers are actually saying the details changed and the specifications grew. Also true. So which one is synonymous with "requirements"? When we say requirements, do we mean goals or details? Do we mean vision or specification? It's become clear to me that business-types mean vision and goals and engineering-types mean details and specifications. Now we're onto something...
Goals and vision are what stakeholders use to assess the commercial viability of a project and to communicate their high level idea to a delivery team. This is all quite correct, but where we go wrong is in trying to live up to delivery guesstimates made from such information. Because, unfortunately, what dictates how much something actually costs to build is the detailed functional and architectural specifications - not high level vision statements.
A working example.
Payments system - pretty easy to describe in a vision; money in, money out, a variety of deposit and withdrawal channels and a bit of diligence around anti-money laundering etc. You can make a bunch of quite valid assumptions about what you need to build to put this together, but when your product management starts to expand on exactly how the system will perform in real life, you can't forget to also expand on exactly what it takes to build it - these things evolve in unison.
Imagine you're building a house, you can take a guess about what that's going to cost. Then you might talk about the materials you want to use; bricks, wood, stone - but you'd need to update your cost estimate. How about the number of bedrooms, bathrooms, ensuits etc? Update again. What about the number of levels you want to build, gradient of the land, and distance from utility trunks? Again you'd need to rethink cost to avoid surprises later. This is such common sense in the practical, tactile world of construction so why does it escape us in other engineering disciplines?