Friday 21 November 2008

Root Cause Analysis

To help me kill some time at an airport (which seems to be my second job these days), let me reach into my wardrobe of soap-box issues and pick something out. Ah, root cause analysis, here we go.

In my opinion, proper root cause analysis is the most important part of any operational support process.

Having a professional, predictable response and the skills to restore service quickly are critical - but you have to ensure that your support processes don't stop there. If they do, then you're simply doomed to let history repeat itself, and this means more downtime, more reactive firefighting, and less satisfied customers.


Good root cause analysis takes into account the entire event - systemwide conditions, the teams response, the available data, the policies applied - not just the technical issue which triggered the fault, and looks for ways to reduce the likelihood of recurrence.

Doing root cause analysis properly can be expensive, because you don't need to get to the bottom of why it happened this time, it's why it keeps happening, and why the system was susceptible to the issue in the first place that you need to uncover to really add future value. Think of the time spent on it as an investment in availability, freeing up your team to work more strategically (as well as enjoy their jobs more), and happier users (which oddly seems to make happier engineers).

But what you learn by doing this isn't really worth the time you spend on it without the organizational discipline to follow up with real changes. If you're truly tracing issues back to their root, you'd be surprised how many are the result of a chain of events that could stretch right back to the earliest phases in projects. This needs commitment.

If you make money out of responding to problems then you'll probably want to ignore my advice. There is a whole industry of IT suppliers whose core business lives here, and while it's an admirable pursuit, don't take the habit with you when you join an internal team!

Wednesday 19 November 2008

The Confidence-o-Meter

A while back we had a project that just seemed destined to face every type of adversity right from the outset (we've all had at least one). Being a new line of business for us, we didn't even have a customer team when work needed to begin! It was going downhill, and with strict regulatory deadlines to meet, we needed to get it back on track. Additional complications arose because the team was new, and as such were still gelling together as they tackled the work. Let's throw in an unusually high dosage of the usual incomplete specifications, changing requirements, and unclear ownership that regularly plague software projects and you have a recipe for a pretty epic train smash.

They say necessity is the mother of invention (Plato?) and there was certainly no shortage of necessity here, so we got on with some serious invention.

The Problem

We needed a way to bring the issues out in the open, in a way that a newly forming team can take ownership of them without accusation and defensiveness creeping in. More urgently still, we had conflicting views on the status of each of the workstreams.

It seemed sensible to start by simply getting in the same room and getting on the same page. To give the time some basic structure, we borrowed some concepts from the planning poker process. There were some basic ideas that made sense for us - getting the team together around a problem, gathering opinion in a way that prevents strong personalities dominating the outcome, and using outliers to zero in on hidden information. As an added bonus, the quasi-familiarity of the process gave the team a sense of purpose and went some way to dispel hostility in a high pressure environment.

The Solution

We started by scheduling a weekly session and sticking to it. Sounds simple, but when the world is coming down around your ears, it is too easy to get caught up in all the reactivity and not make the space to think through what you're doing.

We set aside some time at the end of each week, and our format for the session was fairly simple:
• All the members of the delivery team state their level of confidence that the project will hit its deadline by calling out a number between 1 and 10, 1 being definitely not, 10 being definitely so.
• We record the numbers, then give the lowest and highest numbers an opportunity to speak, uncovering what they know that makes them so much more or less confident than the rest of us. This way the whole group learned about (or dispelled) issues and opportunities they may have been unaware of.
• In light of the new information from the outliers, everyone has an opportunity to revise their confidence estimate. This is recorded in a spreadsheet which lends this post its title.
• Finally we took some time to talk over the most painful issues or obvious opportunities in the project and the team, then picked the most critical one and committed to changing it over the coming week. We also reviewed what we promised ourselves we'd change last week.

Through this discipline the team made, in real time, a bunch of very positive changes to themselves and how they work together without having to stop working and reorganize. We also had a very important index that we could use to gauge exactly how worried we should be - which is pretty important from a risk mitigation perspective. The trend was important too - we were able to observe confidence rise over time, and respond rapidly to the dips which indicated that the team had encountered something that they were worried would prevent them from delivering.

The Outcome

This exercise gained us a much more stable view of a chaotic process, and let us start picking off improvements as we worked. By making the time to do this and being transparent with the discussions and decisions, the team felt more confident and in control of their work - which always helps morale.

Because we were able to give the business a coherent, accurate assessment of where we were at, we were able to confidently ask for the right support - it was easy to show where the rest of the organization could help, and demonstrate the impact if they didn't meet their obligations to us.

In summary, we got our issues out in the open and got our project back on the rails. And by the time we got together for our post-implimentation retrospective, we were pleasantly surprised by how many of our most critical problems we'd already identified and fixed. If you're in a tough spot with a significant piece of work and a fixed deadline, consider giving something like this a try - I think it will work alongside any development methodology.

Monday 17 November 2008

Cloud Computing Isn't...

Thought for the day - when does a new idea gain enough structure to graduate from meme to tangible concept? Is there some quorum of 'experts' that need to agree on its shape? Perhaps we need a minimum number of books written on the topic, or a certain number of vendors packaging the idea up for sale? Or maybe it is as simple as all of us trying it our own way for long enough for observable patterns to emerge?

We might have already crossed this bridge with cloud computing thanks to the accelerated uptake of robust platforms such as EC2 and App Engine (and the adherence to theme of newer offerings like Azure), but there is still a lot of residual confusion that we might start to mop up if we were so inclined.

The first thing we might stop doing is retrospectively claiming any sort of non-local activity as cloud computing. What's that? You've been using Gmail or Hotmail for years? No. sorry. You are not an ahead-of-the-curve early adopter, you are just a guy who has been using a free web based email system for a while.

Before the inevitable torrent of angry emails rains down upon my inbox, let's pause to think about what we're trying to achieve here. Does classifying the likes of Hotmail and - well, insert your favorite SaaS here - as cloud computing help or hinder the adoption and development of cloud technology? I think we probably establish these analogies because we believe that the familiarity we create by associating a trusted old favorite with a radical new concept may add comfort to perceived risks. But what about the downside of such a broad classification?

These systems are typically associated with a very narrow band of functionality (for example, sending and receiving email or storing and displaying photos) and are freely available (supported by advertising or other 2nd order revenue). They tend to lack the flexibility, identity, and SLA that an enterprise demands. This analogy may well be restricting adoption in the popular mind. Besides, where do you draw the line? Reading this blog? Clicking 'submit' on a web form? Accessing a resource in another office on your corporate WAN? I'm not knocking anyone's SaaS, in fact the noble pursuits that are our traditional online hosted email and storage systems have been significant contributing forces in the development of the platforms that made the whole cloud computing idea possible.

So, if a lot of our common everyday garden variety SaaS != a good way to talk about cloud computing, then what is?

Let's consider cloud computing from the perspective of the paradigm shift we're trying to create. How about cloud computing as taking resources (compute power and data) typically executed and stored in the corporate-owned datacenter, and deploying them into a shared (but not necessarily public) platform which abstracts access to, and responsibility for, the lower layers of computing.

That may well be the winner of Most Cumbersome Sentence 2008, but I feel like it captures the essence to a certain degree. Let's test our monster sentence against some of the other attributes of cloud computing - again from the perspective of what you're actually doing in a commercial and operational sense when you build a system on a cloud platform:

• Outsourcing concern over cooling, power, bandwidth, and all the other computer room primitives.
• Outsourcing the basic maintenance of the underlying operating systems and hardware.
• Converting a fixed capital outlay into a variable operational expense.
• Moving to a designed ignorance of the infrastructure (from a software environment perspective).
• Leveraging someone else's existing investment in capacity, reach, availability, bandwidth, and CPU power.
• Running a system in which cost of ownership can grow and shrink inline with it's popularity.

I think talking about the cloud in this way not only tells us what it is, but also a little about what we can do with it and why we'd want to. If you read this far and still disagree, then enable your caps lock and fire away!

Tuesday 11 November 2008

My agile construction yard

I have been successfully practicing agile for quite a while now, and I've always believed that, given a pragmatic application of the principles behind it, it can be used to manage any process. Mind you, having only ever tried to deliver software with agile, this remained personally unproven. Gauntlet?

So I figured that if I am going to keep going on about it, I am going to have to put my money (and my house) where my mouth is. So I did, and this is the story...

The requirements

I wanted a bunch of work done on my house - extending a room, replacing the whole fence line, building a new retaining wall, laying a stonework patio, new roof drainage, building a new BBQ area, and some interior layout changes - and I thought it's now or never. I spoke to the builder and told him all about agile, lean thinking, project management practices like SCRUM and XP, and how it can benefit both of us in delivering House 2.0. He asked if he could speak to the responsible adult who looks after me. "Great, a waterfall builder" I say to myself as I try not to be offended by the 'responsible adult' quip.

But we strike a deal; he's going to do big up front drawings and a quote, and we'll proceed my way at my own risk and responsibility. The game is on.


The first thing we do is run through all the things I want done, which ones are most important to me, and roughly how I want them all to look. I guess you could call this the vision setting. Then my contractor asks me the few big questions; the materials I want to use, budget, and when I want it ready by. He makes some high level estimations on time and cost, based on which I rearrange my priorities to get a few quick wins. We have a backlog.

The project

Our first 'sprint' is the fence line. We come across our first unknown already - the uprights that hold the fence into the ground are concreted in and we either have to take twice as long to tear them out, or build the new fence on the old posts. Direct contact throughout the process and transparency of information ensures that we make the decision together, as customer and delivery team, so that neither of us is left with the consequences of unforeseen situations. I want the benefit of new posts so I'm quite happy to eat the costs put forward.

Next we do the retaining wall, and we have a quick standup to go over the details - we need to decide on a few details up exact height and the type of plants growing across the top. Since the fence has been done I go with some sandy bricks that match the uprights and the wall is constructed without incident. The next thing we're going to tackle is the BBQ area; however, beyond that the roadmap calls for the room extension and so we need to apply for planning consent in order to get the approval in time. Agile doesn't mean no paperwork and no planning, it means doing just enough just in time for when you need it.

Now we hit our first dependency - the patio must be laid first before we can build the BBQ area. That's cool, and through our brief daily catchups, we come up with an ideal layout and pick out some nifty blocks. A bit of bad weather slows down the cementing phase slightly, but we're expecting that - this is England. We use the opportunity to draft some drawings for the room extension and get the consent application lodged.

It's BBQ building time. I've been thinking about it since we started the project, and I decided I wanted to change it. The grill was originally going up against one wall, but wouldn't it be much more fun if it was right in the middle so everyone could stand 360 around it and grill their own meat? You bet it would. We built a couple of examples out of loose bricks (prototypes?) and then settled on a final design. It takes a bit more stone than the original idea, but it's way more awesome.

Then our project suffered it's first major setback - the planning consent process uncovers that a whole lot of structural reinforcement will be needed if they're going to approve the extension. That pretty much triples the cost of adding the extra space. Is it still worth it at triple price? Not to me. Lucky we didn't invest in a lot of architect's drawings and interior design ideas, they'd be wasted now (specifications are inventory and inventory is a liability). So we start talking about alternatives, and come up with a plan to create the new space as storage and wardrobes - not exactly what I had in mind up front, but at less than half the original cost it still delivers 'business value'.

The retrospective

So how did it all turn out? Well, as a customer, I felt involved and informed throughout the whole process, and the immediate future was usually quite predictable. Throughout the project I had the opportunity to adjust and refine my requirements as I saw the work progress, and I always made informed tradeoffs whenever issues arose. I am happy, the builder is happy, and I got exactly what I wanted - even though what I really wanted ended up quite different to what I thought I wanted when we started.

Oh and if anyone wants a good building contractor in Surrey...

Friday 7 November 2008

Cost in the Cloud

Cost is slated as benefit number 1 in most of the cloud fanboy buzz, and they're mostly right, usage-based and CPU-time billing models do mean you don't have tons of up front capital assets to buy - but that's not the same thing as saying all you cost problems are magically solved. You should still be concerned about cost - except now you're thinking about expensive operations and excess load.

Code efficiency sometimes isn't as acute a concern on a traditional hardware platform because you have to buy all the computers you'll need to meet peak load, and keep them running even when you're not at peak. This way you usually have an amount of free capacity floating around to absorb less-than-efficient code, and of course when you're at capacity there is a natural ceiling right there anyway.

Not so in the cloud. That runaway process is no longer hidden away inside a fixed cost, it is now directly costing you, for example, 40c an hour. If that doesn't scare you, then consider it as $3504 per year - that's for once instance, how about a bigger system of 10 or 15 instances? Now you're easily besting $35K and $52K for a process that isn't adding proportionate (or at worst, any) value to your business.

Yikes. So stay on guard against rogue process, think carefully about regularly scheduled jobs, and don't create expensive operations that are triggered by cheap events (like multiple reads from multiple databases for a simple page view) if you can avoid it. When you are designing a system to run on a cloud platform, your decisions will have a significant impact on the cost of running the software.

Monday 3 November 2008

Eachan's Famous Interview Questions - Part II

A while back I posted some engineering manager/leader interview questions that I frequently use - designed to test how someone think, what their priorities are, and how they'd approach the job - rather than whether or not they can do the job at all. As I said back then, if you're at a senior level and you're testing to see if someone is capable of the basic job all, then you're doing it wrong (rely on robust screening at the widest point - your time is valuable).

Like everything else, this is subject to continuous improvement (agile interviewing - we're getting better at it by doing it) and with more repetition you tend to develop more ways of sizing people up in a 1 hour meeting. So here is iteration 2:

1. What is the role of a project leader? Depending on your favorite SDLC, that might be a project manager, SCRUM master, or team leader - but what you're looking for is a distinction between line management (the maintenance of the team) and project management (the delivery of the work).

[You might not make such distinctions in your organization, it is important to note that all these questions are intended to highlight what an individuals natural style is, not to outline a 'right' way to do it.]

2. Walk through the key events in the SDLC and explain the importance of each step. It is unlikely any candidate (except maybe an internal applicant) is going to nail down every detail of your SDLC, but what you're hoping to see is a solid, basic understanding of how ideas are converted into working software. It seems overly simple, but you'd be surprised how many people, even those who have been in the industry many years, are really uneasy about this. Award extra credit for 'importances' that benefit the team as well as the business (for example - product demos are good for the team's morale etc).

3. Who is in a team? Another dead simple one, and what you are testing for is engagement, inclusion, and transparency. Everyone will tell you developers, and usually testers, but do they include NFRs like architects? Supporting trades like business analysts and project managers? How about the customer him/herself?

4. What is velocity, how would you calculate it, and why would you want to know? Their ability to judge what their team is capable of is the key factual basis to the promises they'll make and how they'll monitor teams performance and be able to help them improve over iterations.

5. Explain the software triangle. This is another one of my favorites - because the fundamental relationship between time, scope, and cost is as real a law as gravity yet so many engineering professionals still seem to live in some kind of weird denial. Perhaps afraid of falling off the edge of the earth? Nonetheless, someone who won't get swept along on a romanticized story of One Man's Heroic Triumph Over Project X will make sure you keep a sustainable team and not fall into the sarlacc pit of over-promising and under-delivering. You can also use this question as a springboard to explore how they'd negotiate tradeoffs with customers and how they'd make the costs of decisions visible.

6. How would you handle a team coming off a failed project? No one will ever preside over a flawless team that never drops anything, so being able to handle this effectively is a critical skill. For me, the ideal candidates have some answers to both 'what can we do to recover morale and re-motivate the team?' and 'what went wrong and how can we sidestep it next time?'

7. What's the definition of done? You need you're on definition of done, but I'm always looking for people who include testing, documentation, successful build and integration, failure scenarios, maintenance plans etc in their definitions. How about as far as commercial success? You can easily wander into estimation from here - protecting time to build sustainable software is a vital prerequisite to actually doing it.

8. Who are your stakeholders? Another one that varies terrifically from place to place. Don't let them get away with 'the business' because remember, you're testing for a depth of understanding of How It's Done. Do they include system administrators? How about operators? Customers themselves? Do they prefer to work in a close, personal way with these individuals, or to abstract them away behind business analysts and product managers? It is all valuable decision making data for you.

9. Imagine you could wave a magic wand and either make your products recover from failure 50% quicker, or make them 50% less likely to fail in the first place - which would you choose? A bit of a wily questions, but one that will expose their strategic vs operational bias. More interesting is the discussion around why they chose the way they chose.

10. Imagine you have a black box which fails regularly. You may chose to have basic observation in real time or vastly detailed statistics on a 24 hour delay - which would you choose? Alternatively, you can ask this one in a less course way by looking for examples of different types of system and the circumstances in which each choice might be appropriate. This type of question, along with number 9, can also demonstrate their ability to theorize and generalize (while appreciating that they're doing so) without studying the details of a specific example. This is usually indicative of experience.

There are no 'right' or 'wrong' answers to most of these questions (although I would argue there are 'better' answers to some), just answers that will suit you well, and answers that are less compatible. Ultimately, exploring people in this way will help you predict how they'll perform given autonomy - and why give yourself more governance than you need to do?