The Fletcher Project: 2008

Monday 22 December 2008

Christmas, Ad Rev, and Charity

It's nearly Christmas (again!) and I hope you're way too busy enjoying the best this season has to offer to be reading this! With any luck, the systems and people you look after will all be humming along just fine, leaving you able to spend plenty of time with family and friends.

At times like this, it can be easy to forget that there are a lot of people out there who don't have it as sweet as we do. Where running water is a luxury, we'd find it difficult getting sympathy for our lengthy outstanding bugs list or early morning release woes!

I'm no Bono, but I like to contribute somehow whenever I come across a cause I believe in, but doing something specifically at this time of year has the extra bonus of reminding you of exactly how lucky you are during what can be a time of excesses.

Now, this blog has always been about the free exchange of ideas - a place where I can capture my original thoughts and experiences from the work I do, in the hope that it will help others with the work that they do. Whatever else it is, it's never been a money making exercise - and that's why I've never before tried to monetize the content, despite a growing readership (hello to both of you).

However, as of today, keener-eyed observers will notice a Google Ads box lurking surreptitiously in the right hand panel, ready to corrupt my benevolent posts with it's raw, unbridled, marketing vivacity. But don't worry - I haven't sold out yet! Here is the plan...

I'll run Google AdSense for the coming year, and around this time next year, donate all the revenue to charity. Every cent, and a moderate top up from myself.

It is impractical to pick a charity 12 months ahead, but the money will be donated via Global Giving or Just Giving (or maybe both) so that you can keep me honest. I'm thinking about supporting something which helps to establish education in 3rd world countries - that appeals to my "help those who are prepared to help themselves" philosophy - but will take suggestions from the community.

As well as cheques being posted next December, you can expect a running total, say, quarterly - and heck, if things go really well maybe we'll have a mid-year checkpoint and make a donation then too.

So, Merry Christmas and all the best for 2009, and remember, if you see something that interests you on the right hand side then give it a click - it's for a good cause!

See you next year
Eachan

Sunday 21 December 2008

When is the right time to launch?

Ignoring commercial concerns such as marketing campaigns, complimentary events, and market conditions, this essentially comes down to weighing the quality/completeness of the product against getting to market sooner.

The quality over time to market side has been advocated a few times in recent episodes of Paul Boag's podcast, so I figured I would speak up for the other side. But before I do, let me just say that I don't think there is a right answer to this and, as Paul also conceded, it depends on if your application's success requires a land grab or not.

I am, by nature, an "early and often" man - and that's kind of a vote for time to market over quality. I say kind of because I think it would be more accurate to say that it's a vote for time to market over perfection.

For me, the "often" part is inextricable linked to the "early" part. If you can show a pattern of frequent, regular improvement and feature releases, then you can afford to ship with less on day one. Users can often be more forgiving when see things turned around quickly, and new things regularly appearing can even become a reason in itself to return to the site more often.

Quality is still a factor, but it isn't as black and white as I've been hearing it presented. In the early days of a new product, I think where you spend your time is more important than overall quality. You should be very confident in anything that deals with accounts, real money, or the custody of user's data before shipping. I would argue that getting those areas right for launch at the expense of other parts of the system is better than a more even systemwide standard of quality. You always have limited resources. Spend the most where it counts the most.

And finally, openness. Be truthful and transparent with your users. Start a blog about development progress and the issues you've run into. Provide feedback mechanisms, and actually get back to people who've taken the time to share their thoughts - with something material too, not just an automated thankyou. Send out proactive notifications ahead of impactful changes and after unplanned events. Stick 'beta' labels on things you're shipping early - it'll keep user's blood pressure down, and you might be surprised by how much of the community is prepared to help.

I am aware that I haven't actually answered the question that lends this post its title. I don't know if there even is an off-the-shelf answer, but I hope that I've at least given you some more ideas on how to make the right decision for yourself.

Friday 19 December 2008

Constraints and Creativity

Does the greatest creativity come from unlimited choice; a totally blank canvas with every possible option and no restrictions, or from difficult circumstances; where pressure is high, the problems are acute and all the easy avenues are closed?

History has proven necessity to be the mother of invention time and time again, and I've personally seen a lot of otherwise-excellent people become paralyzed when faced with too much opportunity.

The current economic circumstances are bringing this into somewhat sharp relief. From time to time I get approached by other entrepreneurs with an idea they want to develop - usually only a couple of times a year, but in the last couple of quarters I've seen half a dozen. There is something about this credit crunch...

Conditions are right for the sorts of things I've seen; internet based, self-service, person to person applications that look for efficiency by cutting out middlemen and using smarter fulfillment. They're all great ideas and I really wish I had the time to help them all out.

Returning to the point, would you get this kind thinking in times of plenty? Who knows, maybe something good will come out of this credit crunch after all.

Wednesday 17 December 2008

Tradeoffs - You're Doing It Wrong

Our time is always our scarcest resource, and most of the web companies I know are perpetually bristling with more ideas than they could ever manage to ship. We always wish we could do them all, but at the same time we know deep down that it isn't realistic. Under these circumstances, tradeoffs are an inevitable and perfectly normal part of life.

The secret to doing this well is in what things you weigh against each other when deciding how to spend the resources you have.

It's so easy to start down the slippery slope of trading features off against the underlying technologies that support them. Fall into this trap and you'll get a small surge of customer-facing things out quickly, but stay with the pattern (easy to do - as it does reward you well in the short term) and you'll soon find yourself paralyzed by technical debt; struggling against continuously increasing time to market and building on top of a shaky, unreliable platform.

Instead, try trading off a feature and it's supporting technology against another feature and it's supporting technology. This isn't a carte blanche ticket for building hopelessly over-provisioned systems - it's simply about setting yourself up in the most appropriate way for future requirements and scalability, rather than foregoing whatever platform work that would have been diligent and starting off behind the curve.

Trading features off against their underlying technology is not just short sighted, it's also unrealistic. You see, the fact is that all customer-facing features have to be maintained and supported anyway, no matter how badly we wish we could spend the time on other things. And right up front, when we first build each feature, we have the most control we're ever going to have over how difficult and expensive that's going to be for us.

This is a philosophy that has to make it all the way back to the start of the decision making process. When considering options, knowing the true cost of ownership is as vital as the revenue opportunity and the launch cost. Remember - you build something once, but you maintain it for life.

Saturday 13 December 2008

Failure Modes

Whenever we're building a product, we've got to keep in mind what might go wrong, rather than just catering mindlessly to the functional spec. That's because specifications are largely written for a parallel universe where everything goes as planned and nothing ever breaks; while we must write software that runs right here in this universe, with all it's unpredictability, unintended consequences, and poorly behaved users.

Oh and as I've said in the past, if you have a business owner that actually specs for failure modes, kiss them passionately now and never let them go. But for the rest of us, maybe it would help us keep failure modes in the forefront of our minds as we worked if we came up with some simplified categories to keep track of. How about internal, external, and human?

Internal
I'm not going to say too much about internal failure modes, because they are both the most commonly considered types and they have the most existing solutions out there.

You could sum up internal failures by imagining your code operating autonomously in a closed environment. What might go wrong? You are essentially catering for quality here, and we have all sorts of test environments and unit tests to combat defects we might accidentally introduce through our own artifacts.

External
The key difference between external and internal failure modes is precisely what I said above - you are imagining that your code is operating in perfect isolation. If you are reading this, then I sincerely hope you rolled your eyes at that thought.

Let's assume that integration is part of internal, and we only start talking external forces when our product is out there running online. What might go wrong?

Occasionally I meet teams that are pretty good detecting and reacting to external failures and it pleases me greatly. Let's consider some examples; what if an external price list that your system refers to goes down? How about if a service intended to validate addresses becomes a black hole? What if you lose your entire internet connection?

Those examples are all about blackouts - total and obvious removal of service - so things are conspicuous by their absence. For bonus points, how are you at spotting brownouts? That's when things are 'up' but still broken in a very critical way, and the results can sometimes cost you far more than a blackout, as they can go undetected for a while...

Easy example - you subscribe to a feed for up-to-the-minute foreign exchange rates. For performance reasons, you probably store the most recent values for each currency you use in a cache or database, and read it from there per transaction. What happens if you stop receiving the feed? You could keep transacting for a very long time before you notice, and you will have either disadvantaged yourself or your customers by using out of date rates - neither of which is desirable.

Perhaps the feed didn't even stop. Perhaps the schema changed, in which case you'd still see a regular drop of data if you were monitoring the consuming interface, but you'd have unusable data - or worse - be inserting the wrong values against each currency.

Human
Human failure modes are the least catered for in our profession, regardless of the fact they're just as inevitable and just as expensive. You could argue that 'human' is just another type of external failure, but I consider it fundamentally different due to one simple word - "oops".

To err is human and all that junk. We do stuff like set parameters incorrectly, turn off the wrong server, pull out the wrong disk, plug in the wrong cable, ignore system requirements etc - all with the best of intentions.

So what would happen if, say, a live application server is misconfigured to use a development database and then you unknowingly unleash real users upon it? You could spend a very long time troubleshooting it, or worse still it might actually work - and thinking about brownouts - how long will it be before you noticed? For users who'd attached to that node, where will all their changes be, and how will you merge that back into the 'real' live data?

Humans can also accidentally not do things which have consequences for our system too. Consider our feed example - perhaps we just forgot to renew the subscription, and so we're getting stale or no data even though the system has done everything it was designed to do. Hang on, who was in charge of updating those SSL certificates?

Perhaps we don't think about maintenance mistakes up front because whenever we build something, we always picture ourselves performing the operational tasks. And to us, the steps are obvious and we're performing them, in a simplified world in our heads, without any other distractions competing for our attention. Again - not real life.

And so...
All of these things can be monitored, tested for, and caught. In our forex example, you might check the age of the data every time you read the exchange rate value in preparation for a transaction, and fail it if it exceeds a certain threshold (or just watch the age in a separate process).

In our live server with test data example, you might mandate that systems and data sources state what mode they're in (test, live, demo, etc) in their connection string - better yet generate an alert if there is a state mismatch in the stack (or segment your network so communication is not possible).

The question isn't are there solutions; the question is how far is far enough?

As long as you think about failure modes in whatever way works for you, and make a pragmatic judgement on each risk using likelihood and impact to determine how many monitors and fail-safes it's worth building in, then you'll have done your job significantly better than the vast majority of engineers out there - and your customers will thank you for it with their business.

Tuesday 2 December 2008

Speed is in the Eye of the Beholder

Yesterday I was coming through Heathrow again, and while I was waiting for my luggage to surprise me by actually turning up, I was ruminating on performance, user experience, and the iris scanner (old BBC link). For those who haven't had the pleasure, we're talking about a biometric border security system based on the unique patterns in an individual's eye, implemented at certain UK borders as an optional alternative to traverse passport security in a self-service way.

Flying internationally at least once a month for the last couple of years, I've been a regular user of the iris scanner since it's early trials. Like all new technology, it went through a period of unreliability and rapid change in its early days, but its been pretty good for a while now. Step into booth. Have picture of eyes taken. Enter the UK. Awesome.

The only problem is that...
It...
Is...
Just...
Too...
Slow...

Don't get me wrong - you don't have to queue up for it (yet) and you do spend less time overall at the border - it's just that, for a high tech biometric solution, it seems to take an awful long time to look me up once it snaps what it considers to be a usable image of my retinas. You know, in real time it isn't even that expensive - probably an average of about 5 or 6 seconds - but it's just long enough for you to want to keep moving, hesitate, grumble to yourself, briefly wonder if it's broken, and then the gates open up.

It occurred to me that maybe my expectations as a user were perhaps unfair, but then quick as a flash I realized - hey, I'm the user here, so don't I set the standard?

So that's when I started to think about how the performance of this system could be improved, and caught myself falling straight into rookie trap number 1:

I started making assumptions about the implementation and thinking about what solutions could be put in place. I figured there must be some processing time incurred when the retinal image is captured as the machine picks out the patterns in my eyes. That's a local CPU power constraint. Once my retinal patterns are converted into a machine-readable representation, there must be some kind of DB lookup to find a match. Once I've been uniquely identified, I imagine there will be some other data sources to be consulted - checking against the various lists we use to keep terrorists and whatnot at bay.

Well this all sounds very read intensive, so it's a good case for having lots of local replicas (the iris machines are spread out over a number of terminals). Each unique 'eye print' is rarely going to come through more often than days or weeks apart, so most forms of request caching won't help us much with that. Of course there is also write load - we've got to keep track of who crossed the border when and what the result was - but we can delay those without much penalty, as it is the reads that keep us standing in that booth. Maybe we could even periodically import our border security lists to a local service if we observe a significant network cost in checking against them for each individual scanned through the gate (assuming they are currently maintained remotely by other agencies).

Ignoring the fact that I'm making all sorts of horrible guesses about how this system currently works, these seem like reasonably sensible patterns to me, so what's the rookie trap?

Simply that I didn't start by understanding the basic problem I am trying to solve, instead rolling up my sleeves and diving straight into the technical complexities. In doing so, I might have overlooked a plain, simple solution and that's usually bad - why solve a problem in a complex, expensive way when a simple, cheap way will do just as well?

In this example, the problem is not that 5 seconds is an unacceptable time to process the image, compare the data against the registered users, and then pass all the additional border security checks - the problem is that 5 seconds feels like a long time when you're standing in a glass box waiting for it all to happen!

So what else might we have done instead of a major re-architecture of the iris back end? How about just lengthening the booth to form slightly more of a corridor, with the scanner at one end and the gates at the other? With this simple trick, it still takes 5 seconds before the gates can open, but it doesn't feel like it - you haven't been standing still waiting, there is a sensation of progress.

Same level of system performance, very different user experience.

General engineering lesson 1 - customer experience is king. Monitoring and other data driven metrics are important, but how it really looks and feels to your users matters way more than whatever you can prove with data. They'll judge you by their own experiences of your system, not by your reports.

General engineering lesson 2 - you don't have to solve every problem. Sometimes it's better/cheaper/faster to neutralize or work around the issue instead. You might not be able to build a different floor layout, but you can do things like transfer data in the background (not locking out the UI) and show progress 'holding' pages during long searches etc.

Oh and just for the record, I really have no idea how the back end of the iris system works - you are supposed to be seeing the analogy here...

Monday 1 December 2008

Change Control

With another December rolling around already, we head into that risky territory that we must navigate once a year - seasonal trading for many companies is picking up, yet now is the time when most support staff are trying as hard as possible to be on holiday. A tricky predicament - and what better time to talk about change control?

A lot of people - particularly fellow agilists - regard change control as a pointless, work-creationist, bureaucratic impediment to doing actual work. If it's irresponsibly applied, then I'd have to agree with them, but there are ways to implement change control that will add value to what you do without progress grinding to a halt amid kilometers of red tape.

Firstly, let's talk about why we'd bother in the first place. What's in it for us, and what's in it for the organization, to have some form of change control in place? Talking about it from this perspective (i.e. what we want to get out of it) means that whatever you do for change control is much more likely to deliver the benefits - because you have a goal in mind.

Here's what I look for in a change control process:
• The discipline of documenting a plan, even in rough steps, forces people to think through what they're doing and can uncover gotchas before they bite.
• Making the proposed change visible to other teams exposes any dependencies and technical/resource conflicts with parallel work.
• Making the proposed changes visible to the business makes sure the true impact to customers is taken into consideration and appropriate communication planned.
• Keeping simple records (such as plan vs actual steps taken) can contribute significantly to knowledge bases about the system and how to own it.
• Capturing basic information about the proposed change and circulating it to stakeholders makes sure balanced risk assessments are made when we need to decide when and how to implement something, and how much to spend on mitigations.

Ultimately, this all adds up to confidence in the activities the team are undertaking, and over time, will lead to less late nights and less reactive work.

And here are my rules of thumb for how change control should be implemented:
• Never let any process get in the way of doing the bloody obvious. If someone's on fire, you don't go and get the first aid manual and look up 'F' for fire.
• Change control can be granular, with stricter controls on more critical elements (like settlements), and a more flexible approach on lower impact or easier to restore elements (like content and feeds).
• Don't just take a off the shelf or copy another organization verbatim - this is the kind of thing that got change control the reputation it has - think about what you need and do something appropriate.
• Start small and grow up - it's easy to add more diligence where it proves necessary, but much more difficult to relax controls on areas where progress is pointlessly restricted.

So what do you actually do? As I said above, start off lightweight and cheap - a spreadsheet should do it, there isn't always the need for a huge workflow management database. Make a simple template and make sure you circulate it the way information is best disseminated in your organization (email, intranet, pinned on the wall - whatever gets it seen). Borrow ideas from your industry peers, but keep in mind the outcomes that best serve your circumstances. Most of all, identify the right stakeholders for each area of the system, appreciate the different requirements the applications under your stewardship have, and get into the habit of weighting risk and thinking before you act.

Here's to peace of mind - let's spend December at christmas parties, not postmortems!

Friday 21 November 2008

Root Cause Analysis

To help me kill some time at an airport (which seems to be my second job these days), let me reach into my wardrobe of soap-box issues and pick something out. Ah, root cause analysis, here we go.

In my opinion, proper root cause analysis is the most important part of any operational support process.

Having a professional, predictable response and the skills to restore service quickly are critical - but you have to ensure that your support processes don't stop there. If they do, then you're simply doomed to let history repeat itself, and this means more downtime, more reactive firefighting, and less satisfied customers.

Good root cause analysis takes into account the entire event - systemwide conditions, the teams response, the available data, the policies applied - not just the technical issue which triggered the fault, and looks for ways to reduce the likelihood of recurrence.

Doing root cause analysis properly can be expensive, because you don't need to get to the bottom of why it happened this time, it's why it keeps happening, and why the system was susceptible to the issue in the first place that you need to uncover to really add future value. Think of the time spent on it as an investment in availability, freeing up your team to work more strategically (as well as enjoy their jobs more), and happier users (which oddly seems to make happier engineers).

But what you learn by doing this isn't really worth the time you spend on it without the organizational discipline to follow up with real changes. If you're truly tracing issues back to their root, you'd be surprised how many are the result of a chain of events that could stretch right back to the earliest phases in projects. This needs commitment.

If you make money out of responding to problems then you'll probably want to ignore my advice. There is a whole industry of IT suppliers whose core business lives here, and while it's an admirable pursuit, don't take the habit with you when you join an internal team!

Wednesday 19 November 2008

The Confidence-o-Meter

A while back we had a project that just seemed destined to face every type of adversity right from the outset (we've all had at least one). Being a new line of business for us, we didn't even have a customer team when work needed to begin! It was going downhill, and with strict regulatory deadlines to meet, we needed to get it back on track. Additional complications arose because the team was new, and as such were still gelling together as they tackled the work. Let's throw in an unusually high dosage of the usual incomplete specifications, changing requirements, and unclear ownership that regularly plague software projects and you have a recipe for a pretty epic train smash.

They say necessity is the mother of invention (Plato?) and there was certainly no shortage of necessity here, so we got on with some serious invention.

The Problem

We needed a way to bring the issues out in the open, in a way that a newly forming team can take ownership of them without accusation and defensiveness creeping in. More urgently still, we had conflicting views on the status of each of the workstreams.

It seemed sensible to start by simply getting in the same room and getting on the same page. To give the time some basic structure, we borrowed some concepts from the planning poker process. There were some basic ideas that made sense for us - getting the team together around a problem, gathering opinion in a way that prevents strong personalities dominating the outcome, and using outliers to zero in on hidden information. As an added bonus, the quasi-familiarity of the process gave the team a sense of purpose and went some way to dispel hostility in a high pressure environment.

The Solution

We started by scheduling a weekly session and sticking to it. Sounds simple, but when the world is coming down around your ears, it is too easy to get caught up in all the reactivity and not make the space to think through what you're doing.

We set aside some time at the end of each week, and our format for the session was fairly simple:
• All the members of the delivery team state their level of confidence that the project will hit its deadline by calling out a number between 1 and 10, 1 being definitely not, 10 being definitely so.
• We record the numbers, then give the lowest and highest numbers an opportunity to speak, uncovering what they know that makes them so much more or less confident than the rest of us. This way the whole group learned about (or dispelled) issues and opportunities they may have been unaware of.
• In light of the new information from the outliers, everyone has an opportunity to revise their confidence estimate. This is recorded in a spreadsheet which lends this post its title.
• Finally we took some time to talk over the most painful issues or obvious opportunities in the project and the team, then picked the most critical one and committed to changing it over the coming week. We also reviewed what we promised ourselves we'd change last week.

Through this discipline the team made, in real time, a bunch of very positive changes to themselves and how they work together without having to stop working and reorganize. We also had a very important index that we could use to gauge exactly how worried we should be - which is pretty important from a risk mitigation perspective. The trend was important too - we were able to observe confidence rise over time, and respond rapidly to the dips which indicated that the team had encountered something that they were worried would prevent them from delivering.

The Outcome

This exercise gained us a much more stable view of a chaotic process, and let us start picking off improvements as we worked. By making the time to do this and being transparent with the discussions and decisions, the team felt more confident and in control of their work - which always helps morale.

Because we were able to give the business a coherent, accurate assessment of where we were at, we were able to confidently ask for the right support - it was easy to show where the rest of the organization could help, and demonstrate the impact if they didn't meet their obligations to us.

In summary, we got our issues out in the open and got our project back on the rails. And by the time we got together for our post-implimentation retrospective, we were pleasantly surprised by how many of our most critical problems we'd already identified and fixed. If you're in a tough spot with a significant piece of work and a fixed deadline, consider giving something like this a try - I think it will work alongside any development methodology.

Monday 17 November 2008

Cloud Computing Isn't...

Thought for the day - when does a new idea gain enough structure to graduate from meme to tangible concept? Is there some quorum of 'experts' that need to agree on its shape? Perhaps we need a minimum number of books written on the topic, or a certain number of vendors packaging the idea up for sale? Or maybe it is as simple as all of us trying it our own way for long enough for observable patterns to emerge?

We might have already crossed this bridge with cloud computing thanks to the accelerated uptake of robust platforms such as EC2 and App Engine (and the adherence to theme of newer offerings like Azure), but there is still a lot of residual confusion that we might start to mop up if we were so inclined.

The first thing we might stop doing is retrospectively claiming any sort of non-local activity as cloud computing. What's that? You've been using Gmail or Hotmail for years? No. sorry. You are not an ahead-of-the-curve early adopter, you are just a guy who has been using a free web based email system for a while.

Before the inevitable torrent of angry emails rains down upon my inbox, let's pause to think about what we're trying to achieve here. Does classifying the likes of Hotmail and - well, insert your favorite SaaS here - as cloud computing help or hinder the adoption and development of cloud technology? I think we probably establish these analogies because we believe that the familiarity we create by associating a trusted old favorite with a radical new concept may add comfort to perceived risks. But what about the downside of such a broad classification?

These systems are typically associated with a very narrow band of functionality (for example, sending and receiving email or storing and displaying photos) and are freely available (supported by advertising or other 2nd order revenue). They tend to lack the flexibility, identity, and SLA that an enterprise demands. This analogy may well be restricting adoption in the popular mind. Besides, where do you draw the line? Reading this blog? Clicking 'submit' on a web form? Accessing a resource in another office on your corporate WAN? I'm not knocking anyone's SaaS, in fact the noble pursuits that are our traditional online hosted email and storage systems have been significant contributing forces in the development of the platforms that made the whole cloud computing idea possible.

So, if a lot of our common everyday garden variety SaaS != a good way to talk about cloud computing, then what is?

Let's consider cloud computing from the perspective of the paradigm shift we're trying to create. How about cloud computing as taking resources (compute power and data) typically executed and stored in the corporate-owned datacenter, and deploying them into a shared (but not necessarily public) platform which abstracts access to, and responsibility for, the lower layers of computing.

That may well be the winner of Most Cumbersome Sentence 2008, but I feel like it captures the essence to a certain degree. Let's test our monster sentence against some of the other attributes of cloud computing - again from the perspective of what you're actually doing in a commercial and operational sense when you build a system on a cloud platform:

• Outsourcing concern over cooling, power, bandwidth, and all the other computer room primitives.
• Outsourcing the basic maintenance of the underlying operating systems and hardware.
• Converting a fixed capital outlay into a variable operational expense.
• Moving to a designed ignorance of the infrastructure (from a software environment perspective).
• Leveraging someone else's existing investment in capacity, reach, availability, bandwidth, and CPU power.
• Running a system in which cost of ownership can grow and shrink inline with it's popularity.

I think talking about the cloud in this way not only tells us what it is, but also a little about what we can do with it and why we'd want to. If you read this far and still disagree, then enable your caps lock and fire away!

Tuesday 11 November 2008

My agile construction yard

I have been successfully practicing agile for quite a while now, and I've always believed that, given a pragmatic application of the principles behind it, it can be used to manage any process. Mind you, having only ever tried to deliver software with agile, this remained personally unproven. Gauntlet?

So I figured that if I am going to keep going on about it, I am going to have to put my money (and my house) where my mouth is. So I did, and this is the story...

The requirements

I wanted a bunch of work done on my house - extending a room, replacing the whole fence line, building a new retaining wall, laying a stonework patio, new roof drainage, building a new BBQ area, and some interior layout changes - and I thought it's now or never. I spoke to the builder and told him all about agile, lean thinking, project management practices like SCRUM and XP, and how it can benefit both of us in delivering House 2.0. He asked if he could speak to the responsible adult who looks after me. "Great, a waterfall builder" I say to myself as I try not to be offended by the 'responsible adult' quip.

But we strike a deal; he's going to do big up front drawings and a quote, and we'll proceed my way at my own risk and responsibility. The game is on.

The first thing we do is run through all the things I want done, which ones are most important to me, and roughly how I want them all to look. I guess you could call this the vision setting. Then my contractor asks me the few big questions; the materials I want to use, budget, and when I want it ready by. He makes some high level estimations on time and cost, based on which I rearrange my priorities to get a few quick wins. We have a backlog.

The project

Our first 'sprint' is the fence line. We come across our first unknown already - the uprights that hold the fence into the ground are concreted in and we either have to take twice as long to tear them out, or build the new fence on the old posts. Direct contact throughout the process and transparency of information ensures that we make the decision together, as customer and delivery team, so that neither of us is left with the consequences of unforeseen situations. I want the benefit of new posts so I'm quite happy to eat the costs put forward.

Next we do the retaining wall, and we have a quick standup to go over the details - we need to decide on a few details up exact height and the type of plants growing across the top. Since the fence has been done I go with some sandy bricks that match the uprights and the wall is constructed without incident. The next thing we're going to tackle is the BBQ area; however, beyond that the roadmap calls for the room extension and so we need to apply for planning consent in order to get the approval in time. Agile doesn't mean no paperwork and no planning, it means doing just enough just in time for when you need it.

Now we hit our first dependency - the patio must be laid first before we can build the BBQ area. That's cool, and through our brief daily catchups, we come up with an ideal layout and pick out some nifty blocks. A bit of bad weather slows down the cementing phase slightly, but we're expecting that - this is England. We use the opportunity to draft some drawings for the room extension and get the consent application lodged.

It's BBQ building time. I've been thinking about it since we started the project, and I decided I wanted to change it. The grill was originally going up against one wall, but wouldn't it be much more fun if it was right in the middle so everyone could stand 360 around it and grill their own meat? You bet it would. We built a couple of examples out of loose bricks (prototypes?) and then settled on a final design. It takes a bit more stone than the original idea, but it's way more awesome.

Then our project suffered it's first major setback - the planning consent process uncovers that a whole lot of structural reinforcement will be needed if they're going to approve the extension. That pretty much triples the cost of adding the extra space. Is it still worth it at triple price? Not to me. Lucky we didn't invest in a lot of architect's drawings and interior design ideas, they'd be wasted now (specifications are inventory and inventory is a liability). So we start talking about alternatives, and come up with a plan to create the new space as storage and wardrobes - not exactly what I had in mind up front, but at less than half the original cost it still delivers 'business value'.

The retrospective

So how did it all turn out? Well, as a customer, I felt involved and informed throughout the whole process, and the immediate future was usually quite predictable. Throughout the project I had the opportunity to adjust and refine my requirements as I saw the work progress, and I always made informed tradeoffs whenever issues arose. I am happy, the builder is happy, and I got exactly what I wanted - even though what I really wanted ended up quite different to what I thought I wanted when we started.

Oh and if anyone wants a good building contractor in Surrey...

Friday 7 November 2008

Cost in the Cloud

Cost is slated as benefit number 1 in most of the cloud fanboy buzz, and they're mostly right, usage-based and CPU-time billing models do mean you don't have tons of up front capital assets to buy - but that's not the same thing as saying all you cost problems are magically solved. You should still be concerned about cost - except now you're thinking about expensive operations and excess load.

Code efficiency sometimes isn't as acute a concern on a traditional hardware platform because you have to buy all the computers you'll need to meet peak load, and keep them running even when you're not at peak. This way you usually have an amount of free capacity floating around to absorb less-than-efficient code, and of course when you're at capacity there is a natural ceiling right there anyway.

Not so in the cloud. That runaway process is no longer hidden away inside a fixed cost, it is now directly costing you, for example, 40c an hour. If that doesn't scare you, then consider it as $3504 per year - that's for once instance, how about a bigger system of 10 or 15 instances? Now you're easily besting $35K and $52K for a process that isn't adding proportionate (or at worst, any) value to your business.

Yikes. So stay on guard against rogue process, think carefully about regularly scheduled jobs, and don't create expensive operations that are triggered by cheap events (like multiple reads from multiple databases for a simple page view) if you can avoid it. When you are designing a system to run on a cloud platform, your decisions will have a significant impact on the cost of running the software.

Monday 3 November 2008

Eachan's Famous Interview Questions - Part II

A while back I posted some engineering manager/leader interview questions that I frequently use - designed to test how someone think, what their priorities are, and how they'd approach the job - rather than whether or not they can do the job at all. As I said back then, if you're at a senior level and you're testing to see if someone is capable of the basic job all, then you're doing it wrong (rely on robust screening at the widest point - your time is valuable).

Like everything else, this is subject to continuous improvement (agile interviewing - we're getting better at it by doing it) and with more repetition you tend to develop more ways of sizing people up in a 1 hour meeting. So here is iteration 2:

1. What is the role of a project leader? Depending on your favorite SDLC, that might be a project manager, SCRUM master, or team leader - but what you're looking for is a distinction between line management (the maintenance of the team) and project management (the delivery of the work).

[You might not make such distinctions in your organization, it is important to note that all these questions are intended to highlight what an individuals natural style is, not to outline a 'right' way to do it.]

2. Walk through the key events in the SDLC and explain the importance of each step. It is unlikely any candidate (except maybe an internal applicant) is going to nail down every detail of your SDLC, but what you're hoping to see is a solid, basic understanding of how ideas are converted into working software. It seems overly simple, but you'd be surprised how many people, even those who have been in the industry many years, are really uneasy about this. Award extra credit for 'importances' that benefit the team as well as the business (for example - product demos are good for the team's morale etc).

3. Who is in a team? Another dead simple one, and what you are testing for is engagement, inclusion, and transparency. Everyone will tell you developers, and usually testers, but do they include NFRs like architects? Supporting trades like business analysts and project managers? How about the customer him/herself?

4. What is velocity, how would you calculate it, and why would you want to know? Their ability to judge what their team is capable of is the key factual basis to the promises they'll make and how they'll monitor teams performance and be able to help them improve over iterations.

5. Explain the software triangle. This is another one of my favorites - because the fundamental relationship between time, scope, and cost is as real a law as gravity yet so many engineering professionals still seem to live in some kind of weird denial. Perhaps afraid of falling off the edge of the earth? Nonetheless, someone who won't get swept along on a romanticized story of One Man's Heroic Triumph Over Project X will make sure you keep a sustainable team and not fall into the sarlacc pit of over-promising and under-delivering. You can also use this question as a springboard to explore how they'd negotiate tradeoffs with customers and how they'd make the costs of decisions visible.

6. How would you handle a team coming off a failed project? No one will ever preside over a flawless team that never drops anything, so being able to handle this effectively is a critical skill. For me, the ideal candidates have some answers to both 'what can we do to recover morale and re-motivate the team?' and 'what went wrong and how can we sidestep it next time?'

7. What's the definition of done? You need you're on definition of done, but I'm always looking for people who include testing, documentation, successful build and integration, failure scenarios, maintenance plans etc in their definitions. How about as far as commercial success? You can easily wander into estimation from here - protecting time to build sustainable software is a vital prerequisite to actually doing it.

8. Who are your stakeholders? Another one that varies terrifically from place to place. Don't let them get away with 'the business' because remember, you're testing for a depth of understanding of How It's Done. Do they include system administrators? How about operators? Customers themselves? Do they prefer to work in a close, personal way with these individuals, or to abstract them away behind business analysts and product managers? It is all valuable decision making data for you.

9. Imagine you could wave a magic wand and either make your products recover from failure 50% quicker, or make them 50% less likely to fail in the first place - which would you choose? A bit of a wily questions, but one that will expose their strategic vs operational bias. More interesting is the discussion around why they chose the way they chose.

10. Imagine you have a black box which fails regularly. You may chose to have basic observation in real time or vastly detailed statistics on a 24 hour delay - which would you choose? Alternatively, you can ask this one in a less course way by looking for examples of different types of system and the circumstances in which each choice might be appropriate. This type of question, along with number 9, can also demonstrate their ability to theorize and generalize (while appreciating that they're doing so) without studying the details of a specific example. This is usually indicative of experience.

There are no 'right' or 'wrong' answers to most of these questions (although I would argue there are 'better' answers to some), just answers that will suit you well, and answers that are less compatible. Ultimately, exploring people in this way will help you predict how they'll perform given autonomy - and why give yourself more governance than you need to do?

Monday 27 October 2008

Fail More, Neo

If you expect everything to work all the time, if you believe everything will be perfect first time around, if you think everything you try and every idea you have will always be brilliant - then you are living in some kind of delusional hyper-fantasy. Stay there, trust me, it's better than out here.

But if you decide to take the red pill, and join us here in the real world, there are a few things you should know that will make your integration easier.

Firstly, machines have not taken over the world, and human beings are not just a kind of great big box of Duracell batteries to them. But this is mostly because we can't make machines awesome enough yet, see below...

Out here in the real world, we mostly learn by doing things (until we can make those machines that beam kung-fu directly into our brains). We try something out, observe the results, and then we do it again with some slight variations. Those little tweaks we make as we try again and again are based on the pain we feel each time it doesn't work out. We call this a feedback loop, and we've learned this way for thousands of years - if there was a better way to do it, a way we could just "get it right first time", then trust me, we'd be all over that by now!

We'll be honest with you - you can, even in the real world, get by with very little of such trial-and-error experience if you want. A well-established pattern of mundane mediocrity leading directly to obsolescence is readily available. In fact, you'd be surprised how popular a choice this actually is! Let's call it the grey pill.

Assuming you don't fancy the drab, lackluster, second-rate existence the grey pill guarantees, what else can you do with the humans under your command?

Firstly - and most importantly - allow them to try. Don't expect every idea to be killer, or everything to work out first time around. Allow - no wait - require experimentation and iteration. It is the number 1 way your humans will expand their understanding, and believe it or not, a string of failures is the shortest path to that one thing that does work brilliantly. Tom Watson, an old golf-playing human we have out here, once said “if you want to succeed, double your failure rate”.

You might notice that some of your humans seems a little reluctant to embrace the idea - especially those recently defrosted from 'grey pill' institutions. How can you spruce them up into lean, mean, mistake-making machines?

People often fear the consequences of failure more than failure itself. So the best course of action is to make the consequences of failure something to look forward to - not something to hide from, and cover up. Why not try celebrating failure? If one of your humans has an idea, tries it out, and then brings back some knowledge and experience to the rest of the team, then you are much better off than you were before. Make a big deal out of it, demonstrate to the rest of the team that it's OK to try. Stretching yourself won't be punished.

That doesn't mean rewarding scattergun bravado - what you're trying to encourage is a culture of balanced risk and methodical approach.

I like the old saying "fail fast, fail cheap" because as a statement, it gives permission to try new things, yet it is also prescribes some basic guidelines. Take the shortest path you can to discovering your idea doesn't work, and invest the minimum you need to in order to reach that same point. After all, you'll need those resources for your next idea.

So, thanks for joining us out here in the real world. We really hope you'll make the most of it by embracing failure and trying new things out, because this is the only path to discovery and success - oh and we'll never build those dominant supercomputers to have a nifty war with if we don't believe in innovating!

Maybe this post will be a failure. Maybe my message won't get into the dreamworld (where we don't believe in failure) or grey pill land (where we don't try just in case). But do you know what? I won't mind if it doesn't - because I will have just eliminated one of the ways it doesn't work, and that's a step forward...

Friday 24 October 2008

Availability or Control?

In the web business, we usually consider availability to be paramount - and given that motivation, we're getting pretty good at things like graceful degradation and partial failure. But now that you've pulled your system apart and neatly isolated all the features, how do you cope with the situation where no service is preferable to partial service?

This can be true. Consider, if you will, a trading system operated by a team of risk managers. You have built the system to be fault tolerant and allow partial failures - and this usually works out great - but what happens if a failure in the infrastructure or application results in the risk managers no longer being able to administer the system? It's still running publicly (thanks to you awesome failure isolation) so customers are still buying and selling. You cant change your prices and respond to changing market conditions - uh oh - exposure. What do we do?

One answer is a word we don't like - especially if we just built a reasonably decoupled system - dependency. Yuck, but there is no shame in creating some intentional dependencies that support the business rules. If you never want to execute trades unless you can manage your position, then what is the advantage to running the trading system without the liability tool? Nothing - if anything it's an undesirable risk.

So draw up some service dependencies, or make the applications depend on their monitors at runtime. It might not appeal to how we'd like to run the system, but the truth is it accurately reflects how we'd like to run the business.

Monday 20 October 2008

And the problem with agile is...

I get drawn into a lot of debates about the pros and cons of agile (or maybe I just like a good fight?) and the standard attack pattern around quality is starting to become sooooo passe. So I'm going to tackle it here, in the hope that I can use a one-link defense in future. Kind of like TinyURL but for arguments.

Firstly, let me go on record by saying that I most certainly am an agile proponent, but at the same time I don't believe it is the single, solitary answer to how we should do everything in technology. You have to have more than one tool in your kit.

So let me go ahead and get this off my chest:

Agile != doing a poor job

Agile doesn't mean doing no testing, it means doing just enough testing to ensure appropriate quality.

Agile doesn't mean doing no documentation, it means doing just enough documentation to effectively convey meaning and understanding.

Agile doesn't mean doing no architecture, it means doing just enough design work to understand how to start iterating the solution.

Agile doesn't mean having no requirements, it means having just enough detail for the foreseeable future and embracing change beyond that.

It is also about appreciating that just enough can mean different things in different projects, different business problems, and different parts of the system.

Come on people, these aren't new ideas!

To be fair, I can see where some of this criticism comes from. There are cowboys out there who use agile as an excuse to dispense with necessary diligence or take ill-advised shortcuts. When it all comes crashing down, it does sound better to say "hey, we were doing agile" than "hey, we couldn't really be bothered planing this work properly" for the individuals concerned. The fact is that crappy engineers exist. There is no such thing as an SDLC that turns good engineers into poor engineers, we just started accepting that as an excuse.

If agile does have something to answer for here, it is that this kind of poor work is much more visible much earlier (if concealment is your game). The reality is the same team would make the same (or worse) errors in judgement regardless of the approach they used, you just wouldn't know about it for 12 months - and by then, who knows whose fault that was?

Don't accept output any crappier than you otherwise would just because it has agile stamped on it - if anything you should expect better, because you will have had more opportunities for course correction along the way.

Thursday 16 October 2008

What Your Network Guy Knows

So you're getting into distributed systems; maybe you've got some real scalability issues on the horizon, or perhaps you want to better isolate failure, or be able to cope with more concurrent change in the system. So how do you do this webscale thing then?

Time for some homework. Listening to some vendor pitches, maybe reading some books, or getting an expensive consultant or two in for a while (I'll take your money from you if that's what you want) might possibly do it. But before all this gets out of hand, did you realize you're probably sitting right next to a distributed systems fountain of knowledge? You already have someone, right there in your team, who has spent their entire career working with the largest eventually consistent multi-master distributed systems in the world - the trick is they might not even know it themselves - and that someone is your network guy.

Let's test this assertion against a couple of technologies that network engineers deal with every day, and look at what we can take from them into our distributed systems thinking.

How about something fundament - routing protocols. Networking gurus have a small army of acronyms at their disposal here; OSPF, EIGRP, IS-IS, BGP, and the sinister sounding RIP. These are essentially applications that run on network devices (and sometimes hosts themselves), map out the network topology, and provide data for devices to make packet forwarding decisions.

So what can we import from this technology?
1. Partitioning - networks are broken down into manageable chucks (subnetworks) which scope load (broadcasts), ringfence groups of systems for security, and limit traffic across slow and expensive links.
2. Scalability - routing protocols allow massive global networks to be established by summarizing contiguous groups of networks again and again, and ensuring any node can establish end-to-end connectivity without having to understand every single path in the network (just a default route).
3. Failure isolation - subnets are bordered by routing protocols, which form a natural boundary to most forms of network malarky. In the event that a network becomes unpredictable (flapping), some routing protocols are able to mark them down for predetermined time, which aids in local stabilization and prevents issues spilling over into healthy networks.
4. Self healing - when a failure in a network or a link between networks occurs, routing protocols observe the problem (by missing hellos or interfaces going down) and take action to to reestablish reachability (work around the problem using alternate paths etc). Each node will recompute it's understanding of the networks it knows how to reach, learn who it's neighbors are and the networks they can reach, and then return to business as usual via a process called convergence (this is a really simple study in eventual consistency and variable consistency windows).
5. Management - for the most part, networks separate their control messages from the data they transport. A good practice, especially when combined with techniques like QoS, because it significantly reduces the risk of losing control of the infrastructure under exceptional load conditions.

Now let's look at something application layer - DNS. This should be a somewhat more familiar tool (or you're kind of reading the wrong blog) and we touch it quite regularly but probably don't appreciate what goes on in the background. At it's most basic level, DNS is a client/server system for providing a mapping between human-readable hostnames and machine-friendly IP addresses. Oh but it's so much more...

So what can we import from this technology?
1. Partitioning - DNS is hideously, frighteningly big, there are hundreds of thousands of nodes in this system, from the dozen or so root servers all the way down to the corporate internet access edge servers. It is a good example of dividing up a problem; to find us you'd work right to left through a fully qualified domain name, starting with the "." (root), we're in the "com" container (hosted by a registrar), then the "betfair" container (hosted by us), and finally you'd get back some data from a record matching "www" and arrive at our place.
2. Scalability - did I mention DNS is big? DNS uses a classic combination of master/slave nodes and caching on the client and server side to scale out. At the corporate edge, DNS proxies resolve addresses on behalf of internal clients and keep answers in a local cache, ISPs and those who run their own zones keep a number of slaves (many read only) and spread queries out amongst them, and finally an expiry timestamp (TTL) is set on query results permitting client side caching.
3. Resilience - clients can be configured with a list of servers, which they will cycle through should they receive no answer. Additionally, the DNS protocol is stateless, making it easy to move servers around hot and load balance using simple, lightweight algorithms.
4. CAP - DNS definitely prefers availability over consistency, the window for an updated record to be propagated around the internet being ~24hrs in most cases. It's also highly tolerant to network segmentation, individual servers being happy to live separated from the rest of the DNS infrastructure for long periods of time, answering queries, and then catch up with all the changes in the zones they host once connectivity is reestablished.
5. Operations - the hierarchical way the namespace is organized is perfectly matched to how authority is delegated. If you're going to have a massive system spread around the globe, you've got to think about how you're going to operate it, and the DNS model for this is based on allocating administration with ownership. This gives complete flexibility and control to namespace owners without risking the integrity of the system as a whole and let's us operate the biggest distributed system in the world without employing the biggest IT team in the world.

So buy your network guy a coffee. Ask him how his world works. If you can draw the philosophical parallels, it might be the most valuable couple of hours you've spent in ages.

Oh and by the way - distributed systems are all about the network, so you're going to need a friend here anyway...

Sunday 12 October 2008

Is Leadership a Noun or a Verb?

Is technical leadership something you know or something you do? Is it a finite thing you can point to, and check off a list, or is it an ongoing process, more akin to a journey? Let's look at what it means in practice...

Firstly you have to have some answers - or better yet know the right questions. You need to have a vision, and preferably some ideas on how to break that down into Big Goals. You need to define what success looks like for your team, and know what sorts of things contribute to that success and what sorts of things distract you from it. Maybe this is leadership the noun?

All good work so far, but worth nothing to your organization if no one knows about it - who is it that's supposed to be working towards these goals? Who is it that you depend upon to realize your vision?

So the next thing you need is for everyone to proverbially Get On Board. You have to communicate that vision as simply as possible and help every individual understand exactly how they can contribute to it. The whole team has to get emotional about it, believe in it, and live it through how they do their jobs each day. Perhaps this is leadership the verb?

So perhaps the answer is both? I certainly think so. Share an empty vision or set the wrong goals and you won't get where you need to be. Keep quiet about your plans and how they can be achieved and you rob the organization of huge value.

Oh and don't forget to keep your eye on the horizon - things change and from time to time that means adapting your strategy or losing ground!

Thursday 9 October 2008

IASA Connections - Day 3

Day 1, day 2, and now day 3:

Scott Ambler – Practice Leader in Agile Development, IBM Software Group
• Data modelling from an agile perspective.
• Don’t get too hung up on purist agile, good techniques are good techniques regardless of where they come from.
• Software development is more of a craft or an art than engineering – hence we tend to apply some practices that don’t make things the easiest for us.
• Agile is becoming mainstream – you cannot dodge it – and data is not going to change from being an asset, so how can these things coexist?
• Data and quality are 2 words that are hardly ever seen together, yet this is so important.
• In most “Big Requirements Up Front” projects, 45% of features never used, 19% of features rarely used, so this is effectively a waste of over half the development budget.
• 66% of teams work around internal data groups, most common reasons are difficulty in engagement, responses too slow, and development teams don’t understand the value. So data architects have work to do.
• Modelling is like anything else, it just needs to be good enough for the task at hand (repeatable results, not repeatable processes).
• If you can’t get active stakeholder participation, then cancel the project because all that will happen is disaster
• Clear box testing is just as important as black box testing, because you can test triggers, views, constraints, and referential integrity rather than just observable functionality.
• Do continuous integration of databases – there is nothing special about data so why shouldn’t it be treated the same?
• Make it easy for people to do the right thing – loads of verbose documentation and white papers are less likely to be effective than a small number of run-able assets and some 1:1 support to integrate them.

What I think
• Obviously Scott was wearing his data hat this time, but clearly the whole session was predicated upon the fact that you believe relational databases are themselves the right solution…
• Really like the “repeatable results not repeatable processes” phrase, it is such a powerful idea – I am always battling the ‘1 process to rule them all’ crowd.
• Probably best whiteboard walkthrough of the “karate school” example I’ve ever seen.
• My approach to modelling has always been to define things in order of ‘most difficult to change later’ to ‘least difficult to change later’ so when you run out of steam the unknowns you’re left with are the cheapest ones.
• Abstractions – a lot of the examples in the talk were based around the assumption that applications directly accessed data, we need to think more about how we can build systems that don’t need to know details about schemas etc (access via object etc).
• Totally agree with the active stakeholder participation thing, if it’s important enough for them to expect you to deliver it then its important enough for them to invest themselves in it.

Dr Neil Roodyn - independent consultant, trainer and author
• A session titled “software architecture for real time systems” was about patterns for performance and performance tuning.
• Hard vs. soft real-time distinctions important.
• Time is so often not considered in system design – we think about features and quality and failure but so little about latency.
• Automated failure recovery is so much more important in real time computing because you cannot stop to allow human intervention upon failure.
• There are some strong similarities between real time computing thinking and distributed systems thinking:
1. Consistency
2. Initialisation
3. Processor communication
4. Load distribution
5. Resource allocation
• Asynchronous is kind of “cheating” as it leads to the illusion actions have completed before responses are returned.
• The 3 most important considerations in real time computing are time, time, and time (haha very good Neil).
• Common software tricks and patterns for real time systems (obviously assuming real time performance is your overriding requirement):
1. Use lookup tables for decision making
2. Used fixed size arrays
3. Avoid dynamic memory allocation
4. Reduce number of tasks in the system
5. Avoid multithreaded design
6. Only optimise popular scenarios
7. Search can be more efficient than hash
8. Use state machines
9. Timestamps instead of timers
10. Avoid hooks for future enhancements
11. Avoid bit packed variable length messages
12. Reduce message handshakes
13. Avoid mixed platform support
14. Minimise configurable parameters
• Overall know your system and approach performance tuning scientifically, observe where time is lost and spend energy there, don’t just guess.

What I think
• When we think about SLAs for latency we have to make sure we consider time from the users perspective – if you have a very fast back end but it takes ages for results to render for users, then is it really high performance?
• Even if you have a few processes in your system that need to be real-time, chances are the majority of your system does not, so don’t be afraid to mix memes because if you make the whole thing real time end-to-end you might be unnecessarily sacrificing some scalability or maintainability.
• Totally agree with points on developers needing to know more about the tin their systems run on and how this will lead to better software overall.
• I cant’ help but think we’re getting lazy from our (relative) lack of constraints – back when you had 32K to play with you really thought hard about how you used that memory, when you had to load everything from tape you really planned that storage hit…

Neal Ford – Software Architect and Meme Wrangler, Thoughtworks
• Fairly by-the-book session on “SOA won’t save the world” but via the humorous and engaging analogy of Chuck Norris facts.
• “Peripherally technical manager” types are the key causes for SOA overpromising.
• Good definition of a service as a unit of course grained, self-contained business functionality.
• The tension will always be on between centralised planning and tactical action, so you need to learn how to plan ahead without the planning becoming a constraint on business agility.
• Beware of “cleaning up” the messy communications paths in the enterprise by hiding them inside a box (ESB) – you still have same underlying problems, but you suddenly have less visibility of them.
• Beware of the difference between ‘standards based’ and ‘standardised’ i.e. most vendor ESB solutions share some common basic standard foundations but the functionality has been extended in a proprietary way – so it can still mean lock in.
• Keep track of the number of exceptions you’re granting against the governance you have in place – too many and you might have excessive governance.
• Using ubiquitous language is a must; perhaps even have a formal dictionary for each project.
• The business must be in integration projects not just initiate them.
• We all like a bit of ESB-bashing but they can still be useful for connecting up things like mainframes and 3rd party systems to your nifty REST/WS fabric.
• Exchanging metadata is a great way to negative communication parameters (reliability, security, synchronicity etc) between services and consumers.
• SOA is undebugable – but it is testable - so a good testing discipline is essential.

What I think
• The insurance industry is the place I’ve worked with the most legacy out there (you cant retire certain mainframe systems until all the policyholders die!) and the ESB attachment is just the right milk for this cookie.
• We as engineers contributed to the overselling as much as the Neal’s peripherally technical manager did – I think we got carried away with the interest that a bit of technology was suddenly getting and we were happy to be on the bandwagon as we usually struggle so hard to get technical things on the agenda – how could we not ride this wave?
• There are more benefits to SOA than reuse yet that’s all we ever seem to talk about. How about what it can do for scalability? Failure isolation? Concurrent change? Hot maintenance?
• Yes. BPEL. Terrifying.

The Big Summary
Overall I think this was a very good flagship event, my thanks to the organisers. The turnout was a good size – small enough to feel personal and allow plenty of networking opportunities, yet big enough to ensure a variety of engaging discussions.

The IASA mission is a worthwhile one, and one that I think we don’t do enough about in any technology discipline. Whether we’re talking about architects, system administrators or developers, how much time do we spend investing in our community? In ourselves as a profession? When was the last time you looked for someone to mentor? Went out of your way to share your ideas beyond your organisational boundaries? Visited a university to see what the talent you’ll have to work with in 5 years is actually learning? Investing in ourselves as an industry is undervalued, and I’m happy to be part of a group trying to address this.

If there is 1 thing I would change for next year, it would be the format of the talks. There are already enough conferences you can go to and watch someone meander through a slide deck (this was great meandering though!). If we change the focus so that speakers only use 50% of the allotted time and treat their role as setting the scene, then we could use the other 50% to get a real group discussion going on the topic at hand. I would certainly find that more valuable, and I would suggest that promoting high intensity peer discussions on tough technology and strategy issues would probably better serve the mission of establishing architecture as a profession.

I have been assured that you will eventually be able to find the slides from the formal sessions and keynotes here.

So that was IASA Connections. Tomorrow I’m off to UC Berkeley for the day, so probably best ease back on the beer this evening (it makes it harder to maintain the illusion of cleverness I’ll need). Sayonara.

Wednesday 8 October 2008

IASA Connections - Day 2

Ceteris paribus, what comes after day 1 of a conference? Well, day 2 of course…

David Platt - teacher and author (and mad professor?), Rolling Thunder Computing
• Day 2’s keynote on why software sucks, which was adapted from his book by the same name.
• Lots of statistics on why users hate our software, mostly boiling down to the theory that software is built by geeks for geeks but actually used by regular folks.
• 2 million web users in 1994 has become 1200 million in 2006; and almost all are now laymen (in 1994 almost all were geeks themselves).
• Platt’s rule, “know thy user for he is not thee”
• Users don’t want to use your application, they want to have used it – so understanding that you provide a means to an end rather than an end in itself is a helpful thought pattern.
• Customers don’t buy software, they buy the outcome it’s intended to create, e.g. you don’t buy a drill from B&Q, you are really buying a hole.
• Transferring extra clicks into cash (salary costs of time spent clicking) is a useful way to demonstrate this to management.
• Suggested solutions:
1. Add a virgin to your design team, i.e. get someone never used your system before to try it and observe their challenges.
2. Break convention when needed – for example, why have a save button if you could just have the system save things as you go?
3. Don’t allow edge cases to reach mainstream – if something is really unlikely to be used by more than 1% of visitors, why include it in the main real estate?
4. Instrument carefully – watch user experience, track clicks, see what users actually click and how frequently, how do they arrive at certain pages/features?
5. Always question each design decision, “is it taking us closer to ‘just working’ or further away?”

What I think
• Some of the talk didn’t apply to Betfair per se, as online gaming/gambling etc joins the ranks of things like porn (can I say that?) in software that’s used for the experience, rather than to go through a process to arrive at a destination.
• Most of the examples are based on software that shipped with pointless features or usability issues. What wasn’t addressed was how architects and engineers relate to this – I agree that you can make terrible usability errors, but this is usually a decision making process by product managers and other stakeholders rather than the result of developers work. More material on how engineers can push back on these decisions would help us as a profession.
• Instrument carefully is good advice, and in my opinion is critical to effective A/B testing. Knowing what your customers like best coupled with the ability to try new layouts or versions in a lightweight way is a pretty important to keeping the best software out there.

Ted Neward - independent software development architect and mentor, Neward & Associates
• What are architects really about and what does architecture mean?
• We tend to get carried away with “what if’s” like what if I want to change logging system later so build in 96 logging APIs and pretty much use 1 or 2.
• Do we have architectural styles in the same way as building architecture has styles?
• Defining pragmatism – avoiding idealism in favor of something appropriate.
• There is no such thing as NFR, they are all just requirements.
• What do architects do:
1. Understand (what’s the problem, what’s the constraints)
2. Reassess (changing goals, changing constraints)
3. Explore (more potential solutions, new opportunities)
• Beware of the coolness trap – cool technology doesn’t mean appropriate technology.
• The 3 kinds of architects:
1. Infrastructure – datacenters, networks, storage
2. Enterprise – CTO high level leadership
3. Solutions– the “per project” answers to business questions
• Patterns are not architecture; they are simply a vocabulary we can use to have high-speed discussions about technical ideas.
• When designing systems, make big decisions first, then next most detailed, then next most detailed…
• Idea of architecture catalogue (the things you do as an architect)
1. Communication – transports, exchanges, and formats.
2. Presentation – UI, style, and delivery.
3. State management – durable vs. transient, context vs. process.
4. Processing – implementation, transactions, and shared data.
5. Resource management – location, registry, discovery.
6. Tools – languages, data formats.
• If you don’t explicitly define architecture it will be implicitly defined for you!
• Architecture is a bigger picture view than a developer would typically take.

What I think
• Not sure I agree with the bigger picture assertion, as I think this promotes the ivory tower perception of architects – and besides, successful software demands that everyone involved in building it understands the bigger picture.
• You need to consider future options, but take into account likelihood and difficulty to change and watch for entering diminishing returns!
• One of the reasons traditional architects don’t suffer the same difficulties we do is that the constraints and practicalities in real world construction are so much more visible. We need to put more effort into making these understood.
• I like the NFR idea because so few people talk about scalability and availability etc as business benefits, but they most certainly are!
• Agree with the patterns point, too many people get hung up on them but they do not describe a working solution.
• While I always like to talk about defining architecture as a trade, I am not sure about the catalogue idea – some of those decisions (like presentation, data formats, and languages) and surely best made by those closest to the problem; i.e. the delivery engineers.

Dr Neil Roodyn - independent consultant, trainer and author
• Gave a talk called “it’s all wrong, change it all” a rather provocative title that attracted me to the session…
• We know things are going to change, so how can we put ourselves in a position where this is as easy as possible?
• We’re in a change heavy industry, however, humans are naturally change averse so this is something we have to work hard at.
• Abstractions actually make it harder to change things, because they bind code to a particular abstraction – and people tend to feel emotional attachment to their implementations!
• Big visions are so important - always remember that’s what you’re delivering against, not the details.
• We get carried away with reusability – make sure the expected demand for it to be reusable justifies the cost of building it in a reusable way.
• How you design your system can make it more change-friendly or less change-friendly, this is how architecture can help.
• People are the biggest factor – there is a fine line between continuous improvement and gold plating.
• Don’t underestimate malleability of code.

What I think
• People typically associate change with risk or loss. Part of the trick to getting buy in to ideas and making people happy to adopt changes is to address good-old-fashioned “what’s in it for me?”
• Abstraction layers scares me because we’re not getting smarter at the same rate that we’re abstracting away from the primitives. I think we have a problem as an industry, which is convincing people to learn to be engineers (computer scientists) over quick and dirty coders (learn Java in 28 days and get a job).

Paul Preiss – President, IASA
• Dealing with how we establish architecture as a profession in it’s own right.
• To be a profession something must have:
1. An identifiable body of knowledge
2. Able to enter this trade without having to have been something else first
• Helping to realize architecture as a profession is a key purpose of the IASA group.
• America Institute of Architects formed in 1857 because (sound familiar?):
1. It is difficult to solve real architecture problems due to lack of targeted resources
2. It is difficult to find experienced architects
3. It is difficult to be of maximum value to our employers due to lack of stable job definition
4. It is difficult to tell a qualifies architect from an unqualified one
• Architects tend to be in one of 4 major groups:
1. Corporate
2. Vendors
3. Thought leaders
4. Service integrations
• Professional perspectives and skills taxonomy are resources we’re trying to build up – providing that ‘identifiable body of knowledge’
• The skills one must master to truly deserve the title of architect are:
1. The IT environment
2. Business technology strategy
3. Design
4. Human dynamics
5. Quality attributes
6. Software architecture
7. Infrastructure architecture

What I think
• Working towards formalizing architecture as a practice, a role, and a valuable and necessary part of a team is a massive undertaking – we have work to do here as an industry.
• Perhaps the key to this is working backwards from result? Maybe we need to ask ourselves “how do we test for architecture?” If we can work out what we’d be able to observe in an organization with this nebulous architecture thing in it vs. one that does not, then we’ll have an answer and some benefits to point at!
• In the short term we can’t test for architecture so how do we find them? Well you can test for thinking patterns, values, and philosophies so look for the ones that contribute the most to your organizations definition of success.
• Perhaps we need a charter? Some publically visible set of principles that anyone engaging a software architect should expect to be met?
• Good to see the ‘ities’ discussed as central to what we do.
• I still say it’s science Mr. Preiss – but pragmatic science! We solve real-world business problems by using our skill and experience to apply the right technology.

And with that I must once again apply my problem solving skills to the problem of unconsumed beer. See you tomorrow, same bat-time, same bat-channel...

Tuesday 7 October 2008

IASA Connections - Day 1

This week I skipped across the Atlantic to attend the International Association of Software Architects & Architect Connections conference in San Francisco. Between my failed attempts at getting a chapter off the ground in Romania and being a long-standing member of the London chapter, IASA has been a hugely valuable organization to be associated with. There are regular activities at most major IT hubs in the world, with a strong focus on sharing ideas and meeting other people with similar technology challenges.

For the next 3 days I have a fairly hefty schedule of people to meet and sessions to attend - so I'll try and throw up a brief summary post at the end of each day covering what's discussed. So let's dive straight into day 1:

David Chappell - serial technology writer and speaker, Chappell & Associates
• Gave the first days keynote on cloud computing, covered the basics (this is still a very misunderstood topic) and direction over the next few years.
• Cloud computing falls into 3 main categories; cloud delivered products like windows live, attached services such as spam filtering and email antivirus, and cloud platforms like EC2 and S3.
• There will always be a mix of 'onsite' software and cloud based services in every organization in the future - so clouds will not rules the world!
• Advantages of cloud computing include lower costs and less financial risk (as capacity grows with demand, rather than having to build for peak from day 1).
• Disadvantages of cloud computing include trust issues, data protection, and regulatory requirements.
• Important to distinguish between consumer (free and ad supported and usually less reliable) and business (usually paid directly and covered by an SLA) cloud solutions.
• Underlying message is really about change and how hard it is - and how this relates to making the paradigm shift we need to get the enterprise to accept cloud as a valid technology.

What I think
• The cynical side of me can’t help but think we’re throwing things we've been doing for ages (like hosted email virus and spam checking) into the cloud bucket now that we came up with a new word for it.
• I consider some of the constraints of building on a cloud platform to actually be benefits – forcing developers to abstract away from hardware means you’re building systems that are inherently more scalable, available, and hot maintainable (whether deployed in a cloud or onsite).
• Lack of enterprise strength SLAs are keeping a lot of organizations away from cloud computing. Perhaps we as engineers have an obligation to help the business start to look less at SLA and more at actual performance? Most of the time a cloud infrastructure will trash a fixed internal hardware deployment in overall availability yet he SLAs don’t yet demonstrate this.
• Cost is very much still a factor when moving to a cloud deployment, but you move to a marginal cost consideration – you don’t want to run an expensive piece of code for each hit, since that unnecessarily increases the cost of the hosting. In a traditional model it doesn’t matter so much if you have expensive code running most of the time if you’re using capacity normally idle except during peak – now that costs you every time you execute it.

Dr Neil Roodyn - independent consultant, trainer and author (and big fan of XP)
• The “capital A” Architect vs. the idea of “little a” architecture, talking about what the role really should be and how it works in agile.
• In software we sometimes borrow a little to heavily from industrial engineering.
• Architecture is a practice created to try and put predictability back into a process that is inherently unpredictable (software delivery).
• Traditionally we try to separate design from construction – i.e. have really smart architects design the system and then any idiots can code it – and how this is flawed thinking.
• Change is inevitable in software projects, most of our traditional processes require rigid and stable requirements in order to be successful but this is not real life.
• Do the best developers only end up architects because we romanticize the role too much? Or it is the only way they see they can get career progression? Wouldn’t we rather have the best coders coding?
• Maybe we need to start looking at developers more as craftsmen and less like engineers?
• Be careful how you incentivize people, because that’s what you’ll get!

What I think
• Something omitted here is that the quality of an engineering team’s output can never exceed the quality of it’s input, so there is a strong dependency on working well together with customers to truly understand requirements and make the right tradeoffs.
• Incentive is a very valid point, if you measure architects on creating verbose design documents or UML diagrams then that’s what you’ll get – try measuring them on quality, working software.
• The people that know how best to solve problems are the people closest to their effects, so architecture has to be about harnessing the individual developers skill, experience, and exposure to the domain – not trying to tell them what to do.
• I quite like the label “engineer” because it stands for solving real world problems with technology – but I do agree the process does need to be considered to be more creative than it currently is. I recently read something that likened software development to film making, which I think is a powerful analogy.
• Why does IT have such a bad reputation? I don’t think it is as simple as failed projects, I think it is how they’re ‘washed up’ afterwards too. Working closer with customers along the way makes sure they’re aware of the impact of their decisions (changing/adding requirements) so you can easily end up with projects over time/budget, which are not considered failures by the business – because they expected it and were involved along the way.

Randy Shoup - Distinguished Architect, eBay
• The architectural forces governing eBay are scalability, availability, latency, manageability, and cost.
• Their architectural strategies are partition everything, asynchronous everywhere, automate everything, and remember that everything fails!
• Data partitioning is dividing DBs by functional areas (user, items, feedback) then sharding so that each logical DB is horizontally partitioned (by a key such as itemID) which provides scalability for write heavy data sets.
• CAP - their system prefers availability and network partition tolerance and thus trades away consistency. Variable window for different features.
• Functional segmentation at application server layer follows the same pattern as the databases (selling, search, checkout).
• Most operations are stateless which means the most basic, off the shelf load balancers can be used to horizontally scale application server tier.
• When travelling though the system, what state is generated is accumulated in cookies, URLs, and a scratch (periodically flushed) DB.
• Patterns used:
1. Event dispatch (new item listed, item sold)
2. Periodic batch (auction close process, import 3rd party data)
3. Adaptive configuration (dynamically adjust consumer thread pools and polling frequency based on SLAs)
4. Machine learning (dynamically adapt experience by collecting data for the day on for example recommendations and then try different ones next day)
5. Failure detection (collect data on activity send on multicast message bus and listeners watch for certain error messages)
6. Rollback (entire site redeployed every 2 weeks and no changes cannot be undone, features rolled out in dependency tree order)
7. Graceful degradation (switch off non essentials, defer processing to more guaranteed async messages)
• Message reliability dealt with via framework.
• Code deployment decoupled from feature deployment, so new versions can be rolled out with dormant functionality (they call this wired off) that can then be turned on gradually.
• NOT agile and don’t claim to be!
• Completely redeploy entire site every 2 weeks – although this is more of a constant, rolling upgrade than the big bang it sounds like.

What I think
• A very agreeable talk, particularly around deployment, versioning, and A/B testing.
• The architectural forces are very similar to our 6 principles (availability, scalability, maintainability, quality, security, and user experience) and really are just different words to describe the same good practices needed to meet the challenges we have in common.
• State is the enemy of scalability, and it is refreshing to see someone tackling this well - instead of saying "how can we build a distributed state tracking system this big?" they're saying "how can we build our system so that the functionality does not depend upon a central state system?"

Rob Daigneau - Chief Architect, SynXis
• Anti-patterns - human behavior in software projects and how to deal with it.
• Overdependence on methodology as solution to all problems and smooth project execution – not enough focus on personalities and simple human nature.
• Individual anti-patterns:
1. Cyber addict – using only IM or email etc for communication is very efficient, but not necessarily effective as you can lose the meaning of your message.
2. Fortress – not sharing knowledge and being secretive leads to good old fashioned unsustainable codebases and key man dependencies.
3. Ugly baby – too little care taken with how and when criticism is issues can lead to valid advice and points being ignored to the detriment of the project.
4. Perfectionist – believing that masterpieces are the only way every time can lead to unnecessary gold plating, artificially inflating the cost and time to market beyond what’s beneficial to the organization.
5. Conquistador – being overzealous about "the one way to do it" can mean failure to make good tradeoffs.
6. Workaholic – not having the right work/life balance will only lead to burn out and mistakes made.
• Team anti-patterns:
1. Hazing – throwing people into the deep end with very little coaching or attention to integrating with the existing team often means they take a long time to become productive.
2. Fight club – the organization can become a casualty in pointless battles between architects and developers, developers and testers etc unless teams learn to work together.
3. Wishful thinking – working out some estimates that show a certain milestone cant be reached in a certain timeline means just that, no matter how much you’d like it not to.
4. Firing squad – if people are scared to disagree or the culture doesn’t encourage the introduction of new ideas; the organization can be robbed of great value that stays locked away in engineers’ heads.
5. Too many cooks – if there are no clear boundaries then consensus becomes a huge issue and teams become paralyzed, unable to make progress because of endless circular discussions.
• Leadership anti-patterns:
1. Deer in headlights – if a leader wont (cant?) make a decision then the company stands still and competitors gain ground.
2. Cliff jumpers – taking risks for the sake of risks or making decisions on dangerously incomplete data can waste time, money and patience.
3. Spin zone - talking talk not walking walk gets transparent pretty quick.
4. Drill sergeant – exercising authority in an aggressive way makes very short-term gains and never gets a team behind a goal.
5. False summits - artificial deadlines or objectives can be dangerous, if the activity doesn’t turn out to be committed to, a loss of morale and trust in management can result.

What I think
• Reasonably fundamental stuff for experienced management but good to see it getting airtime in this crowd, architects typically come up via technical tracks and probably don’t spend enough time on their influencing and consensus building skills given how important a part of the role this it.
• This would have been a good opportunity to talk more about incentive and how it drives behavior. A lot of anti-patterns can be traced back to KPIs that encourage/reward the wrong things.
• Preventing burnout and keeping sustainable development going are some key reasons I like SCRUM – working in sprints gives you a cyclic mix of high intensity work interspersed by more passive, thoughtful phases.
• I liked the Gerald Weinberg quite "no matter how it looks at first it is always a people problem".

Ken Spencer – Practice Manager, Solid Quality Mentors
How agile applies to traditional, formal technical projects using a current project for the US government as a case study.
Manufacturing background so most examples look back to this experience.
Looking at why most projects fail – this is because people fundamentally keep doing what they do.
Given that we almost automatically repeat what we know, the good news is that this means being successful once significantly increases the likelihood of being successful again.
Spend commensurate amount of time planning - you need plans but you need to know when to just start work and learn from doing.
Good enough – what are the testing criteria and when are you into diminishing returns?
The ‘love curve’ which is a cycle you go through when taking over a failed project; initially distrusted though bad experience, then looked upon as savior, then reality sets in, then they see sustainable success.
Getting back to basics – when things go off the rails look first at the simplest units of work the team ought to be doing and this is usually where you will find your answers.

What I think
• Largely based on the agile manifesto, so we have a pretty strong philosophical connection here.
• Slightly damming of waterfall, at the end of the day it isn’t all bad and isn’t evil, I think it’s quite suitable for very repeatable, commodity projects as long-horizon predictability is possible and the phase space of all possible features is finite.
• Not sure I like the label ‘formalized’ when used to describe waterfall vs. agile, as this kind of indicates less control, visibility, and organization around agile when in fact I think these are some of its strengths.

Whew! Time for beer now, see you tomorrow…