The Fletcher Project: October 2008

Monday, 27 October 2008

Fail More, Neo

If you expect everything to work all the time, if you believe everything will be perfect first time around, if you think everything you try and every idea you have will always be brilliant - then you are living in some kind of delusional hyper-fantasy. Stay there, trust me, it's better than out here.

But if you decide to take the red pill, and join us here in the real world, there are a few things you should know that will make your integration easier.

Firstly, machines have not taken over the world, and human beings are not just a kind of great big box of Duracell batteries to them. But this is mostly because we can't make machines awesome enough yet, see below...

Out here in the real world, we mostly learn by doing things (until we can make those machines that beam kung-fu directly into our brains). We try something out, observe the results, and then we do it again with some slight variations. Those little tweaks we make as we try again and again are based on the pain we feel each time it doesn't work out. We call this a feedback loop, and we've learned this way for thousands of years - if there was a better way to do it, a way we could just "get it right first time", then trust me, we'd be all over that by now!

We'll be honest with you - you can, even in the real world, get by with very little of such trial-and-error experience if you want. A well-established pattern of mundane mediocrity leading directly to obsolescence is readily available. In fact, you'd be surprised how popular a choice this actually is! Let's call it the grey pill.

Assuming you don't fancy the drab, lackluster, second-rate existence the grey pill guarantees, what else can you do with the humans under your command?

Firstly - and most importantly - allow them to try. Don't expect every idea to be killer, or everything to work out first time around. Allow - no wait - require experimentation and iteration. It is the number 1 way your humans will expand their understanding, and believe it or not, a string of failures is the shortest path to that one thing that does work brilliantly. Tom Watson, an old golf-playing human we have out here, once said “if you want to succeed, double your failure rate”.

You might notice that some of your humans seems a little reluctant to embrace the idea - especially those recently defrosted from 'grey pill' institutions. How can you spruce them up into lean, mean, mistake-making machines?

People often fear the consequences of failure more than failure itself. So the best course of action is to make the consequences of failure something to look forward to - not something to hide from, and cover up. Why not try celebrating failure? If one of your humans has an idea, tries it out, and then brings back some knowledge and experience to the rest of the team, then you are much better off than you were before. Make a big deal out of it, demonstrate to the rest of the team that it's OK to try. Stretching yourself won't be punished.

That doesn't mean rewarding scattergun bravado - what you're trying to encourage is a culture of balanced risk and methodical approach.

I like the old saying "fail fast, fail cheap" because as a statement, it gives permission to try new things, yet it is also prescribes some basic guidelines. Take the shortest path you can to discovering your idea doesn't work, and invest the minimum you need to in order to reach that same point. After all, you'll need those resources for your next idea.

So, thanks for joining us out here in the real world. We really hope you'll make the most of it by embracing failure and trying new things out, because this is the only path to discovery and success - oh and we'll never build those dominant supercomputers to have a nifty war with if we don't believe in innovating!

Maybe this post will be a failure. Maybe my message won't get into the dreamworld (where we don't believe in failure) or grey pill land (where we don't try just in case). But do you know what? I won't mind if it doesn't - because I will have just eliminated one of the ways it doesn't work, and that's a step forward...

Friday, 24 October 2008

Availability or Control?

In the web business, we usually consider availability to be paramount - and given that motivation, we're getting pretty good at things like graceful degradation and partial failure. But now that you've pulled your system apart and neatly isolated all the features, how do you cope with the situation where no service is preferable to partial service?

This can be true. Consider, if you will, a trading system operated by a team of risk managers. You have built the system to be fault tolerant and allow partial failures - and this usually works out great - but what happens if a failure in the infrastructure or application results in the risk managers no longer being able to administer the system? It's still running publicly (thanks to you awesome failure isolation) so customers are still buying and selling. You cant change your prices and respond to changing market conditions - uh oh - exposure. What do we do?

One answer is a word we don't like - especially if we just built a reasonably decoupled system - dependency. Yuck, but there is no shame in creating some intentional dependencies that support the business rules. If you never want to execute trades unless you can manage your position, then what is the advantage to running the trading system without the liability tool? Nothing - if anything it's an undesirable risk.

So draw up some service dependencies, or make the applications depend on their monitors at runtime. It might not appeal to how we'd like to run the system, but the truth is it accurately reflects how we'd like to run the business.

Monday, 20 October 2008

And the problem with agile is...

I get drawn into a lot of debates about the pros and cons of agile (or maybe I just like a good fight?) and the standard attack pattern around quality is starting to become sooooo passe. So I'm going to tackle it here, in the hope that I can use a one-link defense in future. Kind of like TinyURL but for arguments.

Firstly, let me go on record by saying that I most certainly am an agile proponent, but at the same time I don't believe it is the single, solitary answer to how we should do everything in technology. You have to have more than one tool in your kit.

So let me go ahead and get this off my chest:

Agile != doing a poor job

Agile doesn't mean doing no testing, it means doing just enough testing to ensure appropriate quality.

Agile doesn't mean doing no documentation, it means doing just enough documentation to effectively convey meaning and understanding.

Agile doesn't mean doing no architecture, it means doing just enough design work to understand how to start iterating the solution.

Agile doesn't mean having no requirements, it means having just enough detail for the foreseeable future and embracing change beyond that.

It is also about appreciating that just enough can mean different things in different projects, different business problems, and different parts of the system.

Come on people, these aren't new ideas!

To be fair, I can see where some of this criticism comes from. There are cowboys out there who use agile as an excuse to dispense with necessary diligence or take ill-advised shortcuts. When it all comes crashing down, it does sound better to say "hey, we were doing agile" than "hey, we couldn't really be bothered planing this work properly" for the individuals concerned. The fact is that crappy engineers exist. There is no such thing as an SDLC that turns good engineers into poor engineers, we just started accepting that as an excuse.

If agile does have something to answer for here, it is that this kind of poor work is much more visible much earlier (if concealment is your game). The reality is the same team would make the same (or worse) errors in judgement regardless of the approach they used, you just wouldn't know about it for 12 months - and by then, who knows whose fault that was?

Don't accept output any crappier than you otherwise would just because it has agile stamped on it - if anything you should expect better, because you will have had more opportunities for course correction along the way.

Thursday, 16 October 2008

What Your Network Guy Knows

So you're getting into distributed systems; maybe you've got some real scalability issues on the horizon, or perhaps you want to better isolate failure, or be able to cope with more concurrent change in the system. So how do you do this webscale thing then?

Time for some homework. Listening to some vendor pitches, maybe reading some books, or getting an expensive consultant or two in for a while (I'll take your money from you if that's what you want) might possibly do it. But before all this gets out of hand, did you realize you're probably sitting right next to a distributed systems fountain of knowledge? You already have someone, right there in your team, who has spent their entire career working with the largest eventually consistent multi-master distributed systems in the world - the trick is they might not even know it themselves - and that someone is your network guy.

Let's test this assertion against a couple of technologies that network engineers deal with every day, and look at what we can take from them into our distributed systems thinking.

How about something fundament - routing protocols. Networking gurus have a small army of acronyms at their disposal here; OSPF, EIGRP, IS-IS, BGP, and the sinister sounding RIP. These are essentially applications that run on network devices (and sometimes hosts themselves), map out the network topology, and provide data for devices to make packet forwarding decisions.

So what can we import from this technology?
1. Partitioning - networks are broken down into manageable chucks (subnetworks) which scope load (broadcasts), ringfence groups of systems for security, and limit traffic across slow and expensive links.
2. Scalability - routing protocols allow massive global networks to be established by summarizing contiguous groups of networks again and again, and ensuring any node can establish end-to-end connectivity without having to understand every single path in the network (just a default route).
3. Failure isolation - subnets are bordered by routing protocols, which form a natural boundary to most forms of network malarky. In the event that a network becomes unpredictable (flapping), some routing protocols are able to mark them down for predetermined time, which aids in local stabilization and prevents issues spilling over into healthy networks.
4. Self healing - when a failure in a network or a link between networks occurs, routing protocols observe the problem (by missing hellos or interfaces going down) and take action to to reestablish reachability (work around the problem using alternate paths etc). Each node will recompute it's understanding of the networks it knows how to reach, learn who it's neighbors are and the networks they can reach, and then return to business as usual via a process called convergence (this is a really simple study in eventual consistency and variable consistency windows).
5. Management - for the most part, networks separate their control messages from the data they transport. A good practice, especially when combined with techniques like QoS, because it significantly reduces the risk of losing control of the infrastructure under exceptional load conditions.

Now let's look at something application layer - DNS. This should be a somewhat more familiar tool (or you're kind of reading the wrong blog) and we touch it quite regularly but probably don't appreciate what goes on in the background. At it's most basic level, DNS is a client/server system for providing a mapping between human-readable hostnames and machine-friendly IP addresses. Oh but it's so much more...

So what can we import from this technology?
1. Partitioning - DNS is hideously, frighteningly big, there are hundreds of thousands of nodes in this system, from the dozen or so root servers all the way down to the corporate internet access edge servers. It is a good example of dividing up a problem; to find us you'd work right to left through a fully qualified domain name, starting with the "." (root), we're in the "com" container (hosted by a registrar), then the "betfair" container (hosted by us), and finally you'd get back some data from a record matching "www" and arrive at our place.
2. Scalability - did I mention DNS is big? DNS uses a classic combination of master/slave nodes and caching on the client and server side to scale out. At the corporate edge, DNS proxies resolve addresses on behalf of internal clients and keep answers in a local cache, ISPs and those who run their own zones keep a number of slaves (many read only) and spread queries out amongst them, and finally an expiry timestamp (TTL) is set on query results permitting client side caching.
3. Resilience - clients can be configured with a list of servers, which they will cycle through should they receive no answer. Additionally, the DNS protocol is stateless, making it easy to move servers around hot and load balance using simple, lightweight algorithms.
4. CAP - DNS definitely prefers availability over consistency, the window for an updated record to be propagated around the internet being ~24hrs in most cases. It's also highly tolerant to network segmentation, individual servers being happy to live separated from the rest of the DNS infrastructure for long periods of time, answering queries, and then catch up with all the changes in the zones they host once connectivity is reestablished.
5. Operations - the hierarchical way the namespace is organized is perfectly matched to how authority is delegated. If you're going to have a massive system spread around the globe, you've got to think about how you're going to operate it, and the DNS model for this is based on allocating administration with ownership. This gives complete flexibility and control to namespace owners without risking the integrity of the system as a whole and let's us operate the biggest distributed system in the world without employing the biggest IT team in the world.

So buy your network guy a coffee. Ask him how his world works. If you can draw the philosophical parallels, it might be the most valuable couple of hours you've spent in ages.

Oh and by the way - distributed systems are all about the network, so you're going to need a friend here anyway...

Sunday, 12 October 2008

Is Leadership a Noun or a Verb?

Is technical leadership something you know or something you do? Is it a finite thing you can point to, and check off a list, or is it an ongoing process, more akin to a journey? Let's look at what it means in practice...

Firstly you have to have some answers - or better yet know the right questions. You need to have a vision, and preferably some ideas on how to break that down into Big Goals. You need to define what success looks like for your team, and know what sorts of things contribute to that success and what sorts of things distract you from it. Maybe this is leadership the noun?

All good work so far, but worth nothing to your organization if no one knows about it - who is it that's supposed to be working towards these goals? Who is it that you depend upon to realize your vision?

So the next thing you need is for everyone to proverbially Get On Board. You have to communicate that vision as simply as possible and help every individual understand exactly how they can contribute to it. The whole team has to get emotional about it, believe in it, and live it through how they do their jobs each day. Perhaps this is leadership the verb?

So perhaps the answer is both? I certainly think so. Share an empty vision or set the wrong goals and you won't get where you need to be. Keep quiet about your plans and how they can be achieved and you rob the organization of huge value.

Oh and don't forget to keep your eye on the horizon - things change and from time to time that means adapting your strategy or losing ground!

Thursday, 9 October 2008

IASA Connections - Day 3

Day 1, day 2, and now day 3:

Scott Ambler – Practice Leader in Agile Development, IBM Software Group
• Data modelling from an agile perspective.
• Don’t get too hung up on purist agile, good techniques are good techniques regardless of where they come from.
• Software development is more of a craft or an art than engineering – hence we tend to apply some practices that don’t make things the easiest for us.
• Agile is becoming mainstream – you cannot dodge it – and data is not going to change from being an asset, so how can these things coexist?
• Data and quality are 2 words that are hardly ever seen together, yet this is so important.
• In most “Big Requirements Up Front” projects, 45% of features never used, 19% of features rarely used, so this is effectively a waste of over half the development budget.
• 66% of teams work around internal data groups, most common reasons are difficulty in engagement, responses too slow, and development teams don’t understand the value. So data architects have work to do.
• Modelling is like anything else, it just needs to be good enough for the task at hand (repeatable results, not repeatable processes).
• If you can’t get active stakeholder participation, then cancel the project because all that will happen is disaster
• Clear box testing is just as important as black box testing, because you can test triggers, views, constraints, and referential integrity rather than just observable functionality.
• Do continuous integration of databases – there is nothing special about data so why shouldn’t it be treated the same?
• Make it easy for people to do the right thing – loads of verbose documentation and white papers are less likely to be effective than a small number of run-able assets and some 1:1 support to integrate them.

What I think
• Obviously Scott was wearing his data hat this time, but clearly the whole session was predicated upon the fact that you believe relational databases are themselves the right solution…
• Really like the “repeatable results not repeatable processes” phrase, it is such a powerful idea – I am always battling the ‘1 process to rule them all’ crowd.
• Probably best whiteboard walkthrough of the “karate school” example I’ve ever seen.
• My approach to modelling has always been to define things in order of ‘most difficult to change later’ to ‘least difficult to change later’ so when you run out of steam the unknowns you’re left with are the cheapest ones.
• Abstractions – a lot of the examples in the talk were based around the assumption that applications directly accessed data, we need to think more about how we can build systems that don’t need to know details about schemas etc (access via object etc).
• Totally agree with the active stakeholder participation thing, if it’s important enough for them to expect you to deliver it then its important enough for them to invest themselves in it.

Dr Neil Roodyn - independent consultant, trainer and author
• A session titled “software architecture for real time systems” was about patterns for performance and performance tuning.
• Hard vs. soft real-time distinctions important.
• Time is so often not considered in system design – we think about features and quality and failure but so little about latency.
• Automated failure recovery is so much more important in real time computing because you cannot stop to allow human intervention upon failure.
• There are some strong similarities between real time computing thinking and distributed systems thinking:
1. Consistency
2. Initialisation
3. Processor communication
4. Load distribution
5. Resource allocation
• Asynchronous is kind of “cheating” as it leads to the illusion actions have completed before responses are returned.
• The 3 most important considerations in real time computing are time, time, and time (haha very good Neil).
• Common software tricks and patterns for real time systems (obviously assuming real time performance is your overriding requirement):
1. Use lookup tables for decision making
2. Used fixed size arrays
3. Avoid dynamic memory allocation
4. Reduce number of tasks in the system
5. Avoid multithreaded design
6. Only optimise popular scenarios
7. Search can be more efficient than hash
8. Use state machines
9. Timestamps instead of timers
10. Avoid hooks for future enhancements
11. Avoid bit packed variable length messages
12. Reduce message handshakes
13. Avoid mixed platform support
14. Minimise configurable parameters
• Overall know your system and approach performance tuning scientifically, observe where time is lost and spend energy there, don’t just guess.

What I think
• When we think about SLAs for latency we have to make sure we consider time from the users perspective – if you have a very fast back end but it takes ages for results to render for users, then is it really high performance?
• Even if you have a few processes in your system that need to be real-time, chances are the majority of your system does not, so don’t be afraid to mix memes because if you make the whole thing real time end-to-end you might be unnecessarily sacrificing some scalability or maintainability.
• Totally agree with points on developers needing to know more about the tin their systems run on and how this will lead to better software overall.
• I cant’ help but think we’re getting lazy from our (relative) lack of constraints – back when you had 32K to play with you really thought hard about how you used that memory, when you had to load everything from tape you really planned that storage hit…

Neal Ford – Software Architect and Meme Wrangler, Thoughtworks
• Fairly by-the-book session on “SOA won’t save the world” but via the humorous and engaging analogy of Chuck Norris facts.
• “Peripherally technical manager” types are the key causes for SOA overpromising.
• Good definition of a service as a unit of course grained, self-contained business functionality.
• The tension will always be on between centralised planning and tactical action, so you need to learn how to plan ahead without the planning becoming a constraint on business agility.
• Beware of “cleaning up” the messy communications paths in the enterprise by hiding them inside a box (ESB) – you still have same underlying problems, but you suddenly have less visibility of them.
• Beware of the difference between ‘standards based’ and ‘standardised’ i.e. most vendor ESB solutions share some common basic standard foundations but the functionality has been extended in a proprietary way – so it can still mean lock in.
• Keep track of the number of exceptions you’re granting against the governance you have in place – too many and you might have excessive governance.
• Using ubiquitous language is a must; perhaps even have a formal dictionary for each project.
• The business must be in integration projects not just initiate them.
• We all like a bit of ESB-bashing but they can still be useful for connecting up things like mainframes and 3rd party systems to your nifty REST/WS fabric.
• Exchanging metadata is a great way to negative communication parameters (reliability, security, synchronicity etc) between services and consumers.
• SOA is undebugable – but it is testable - so a good testing discipline is essential.

What I think
• The insurance industry is the place I’ve worked with the most legacy out there (you cant retire certain mainframe systems until all the policyholders die!) and the ESB attachment is just the right milk for this cookie.
• We as engineers contributed to the overselling as much as the Neal’s peripherally technical manager did – I think we got carried away with the interest that a bit of technology was suddenly getting and we were happy to be on the bandwagon as we usually struggle so hard to get technical things on the agenda – how could we not ride this wave?
• There are more benefits to SOA than reuse yet that’s all we ever seem to talk about. How about what it can do for scalability? Failure isolation? Concurrent change? Hot maintenance?
• Yes. BPEL. Terrifying.

The Big Summary
Overall I think this was a very good flagship event, my thanks to the organisers. The turnout was a good size – small enough to feel personal and allow plenty of networking opportunities, yet big enough to ensure a variety of engaging discussions.

The IASA mission is a worthwhile one, and one that I think we don’t do enough about in any technology discipline. Whether we’re talking about architects, system administrators or developers, how much time do we spend investing in our community? In ourselves as a profession? When was the last time you looked for someone to mentor? Went out of your way to share your ideas beyond your organisational boundaries? Visited a university to see what the talent you’ll have to work with in 5 years is actually learning? Investing in ourselves as an industry is undervalued, and I’m happy to be part of a group trying to address this.

If there is 1 thing I would change for next year, it would be the format of the talks. There are already enough conferences you can go to and watch someone meander through a slide deck (this was great meandering though!). If we change the focus so that speakers only use 50% of the allotted time and treat their role as setting the scene, then we could use the other 50% to get a real group discussion going on the topic at hand. I would certainly find that more valuable, and I would suggest that promoting high intensity peer discussions on tough technology and strategy issues would probably better serve the mission of establishing architecture as a profession.

I have been assured that you will eventually be able to find the slides from the formal sessions and keynotes here.

So that was IASA Connections. Tomorrow I’m off to UC Berkeley for the day, so probably best ease back on the beer this evening (it makes it harder to maintain the illusion of cleverness I’ll need). Sayonara.

Wednesday, 8 October 2008

IASA Connections - Day 2

Ceteris paribus, what comes after day 1 of a conference? Well, day 2 of course…

David Platt - teacher and author (and mad professor?), Rolling Thunder Computing
• Day 2’s keynote on why software sucks, which was adapted from his book by the same name.
• Lots of statistics on why users hate our software, mostly boiling down to the theory that software is built by geeks for geeks but actually used by regular folks.
• 2 million web users in 1994 has become 1200 million in 2006; and almost all are now laymen (in 1994 almost all were geeks themselves).
• Platt’s rule, “know thy user for he is not thee”
• Users don’t want to use your application, they want to have used it – so understanding that you provide a means to an end rather than an end in itself is a helpful thought pattern.
• Customers don’t buy software, they buy the outcome it’s intended to create, e.g. you don’t buy a drill from B&Q, you are really buying a hole.
• Transferring extra clicks into cash (salary costs of time spent clicking) is a useful way to demonstrate this to management.
• Suggested solutions:
1. Add a virgin to your design team, i.e. get someone never used your system before to try it and observe their challenges.
2. Break convention when needed – for example, why have a save button if you could just have the system save things as you go?
3. Don’t allow edge cases to reach mainstream – if something is really unlikely to be used by more than 1% of visitors, why include it in the main real estate?
4. Instrument carefully – watch user experience, track clicks, see what users actually click and how frequently, how do they arrive at certain pages/features?
5. Always question each design decision, “is it taking us closer to ‘just working’ or further away?”

What I think
• Some of the talk didn’t apply to Betfair per se, as online gaming/gambling etc joins the ranks of things like porn (can I say that?) in software that’s used for the experience, rather than to go through a process to arrive at a destination.
• Most of the examples are based on software that shipped with pointless features or usability issues. What wasn’t addressed was how architects and engineers relate to this – I agree that you can make terrible usability errors, but this is usually a decision making process by product managers and other stakeholders rather than the result of developers work. More material on how engineers can push back on these decisions would help us as a profession.
• Instrument carefully is good advice, and in my opinion is critical to effective A/B testing. Knowing what your customers like best coupled with the ability to try new layouts or versions in a lightweight way is a pretty important to keeping the best software out there.

Ted Neward - independent software development architect and mentor, Neward & Associates
• What are architects really about and what does architecture mean?
• We tend to get carried away with “what if’s” like what if I want to change logging system later so build in 96 logging APIs and pretty much use 1 or 2.
• Do we have architectural styles in the same way as building architecture has styles?
• Defining pragmatism – avoiding idealism in favor of something appropriate.
• There is no such thing as NFR, they are all just requirements.
• What do architects do:
1. Understand (what’s the problem, what’s the constraints)
2. Reassess (changing goals, changing constraints)
3. Explore (more potential solutions, new opportunities)
• Beware of the coolness trap – cool technology doesn’t mean appropriate technology.
• The 3 kinds of architects:
1. Infrastructure – datacenters, networks, storage
2. Enterprise – CTO high level leadership
3. Solutions– the “per project” answers to business questions
• Patterns are not architecture; they are simply a vocabulary we can use to have high-speed discussions about technical ideas.
• When designing systems, make big decisions first, then next most detailed, then next most detailed…
• Idea of architecture catalogue (the things you do as an architect)
1. Communication – transports, exchanges, and formats.
2. Presentation – UI, style, and delivery.
3. State management – durable vs. transient, context vs. process.
4. Processing – implementation, transactions, and shared data.
5. Resource management – location, registry, discovery.
6. Tools – languages, data formats.
• If you don’t explicitly define architecture it will be implicitly defined for you!
• Architecture is a bigger picture view than a developer would typically take.

What I think
• Not sure I agree with the bigger picture assertion, as I think this promotes the ivory tower perception of architects – and besides, successful software demands that everyone involved in building it understands the bigger picture.
• You need to consider future options, but take into account likelihood and difficulty to change and watch for entering diminishing returns!
• One of the reasons traditional architects don’t suffer the same difficulties we do is that the constraints and practicalities in real world construction are so much more visible. We need to put more effort into making these understood.
• I like the NFR idea because so few people talk about scalability and availability etc as business benefits, but they most certainly are!
• Agree with the patterns point, too many people get hung up on them but they do not describe a working solution.
• While I always like to talk about defining architecture as a trade, I am not sure about the catalogue idea – some of those decisions (like presentation, data formats, and languages) and surely best made by those closest to the problem; i.e. the delivery engineers.

Dr Neil Roodyn - independent consultant, trainer and author
• Gave a talk called “it’s all wrong, change it all” a rather provocative title that attracted me to the session…
• We know things are going to change, so how can we put ourselves in a position where this is as easy as possible?
• We’re in a change heavy industry, however, humans are naturally change averse so this is something we have to work hard at.
• Abstractions actually make it harder to change things, because they bind code to a particular abstraction – and people tend to feel emotional attachment to their implementations!
• Big visions are so important - always remember that’s what you’re delivering against, not the details.
• We get carried away with reusability – make sure the expected demand for it to be reusable justifies the cost of building it in a reusable way.
• How you design your system can make it more change-friendly or less change-friendly, this is how architecture can help.
• People are the biggest factor – there is a fine line between continuous improvement and gold plating.
• Don’t underestimate malleability of code.

What I think
• People typically associate change with risk or loss. Part of the trick to getting buy in to ideas and making people happy to adopt changes is to address good-old-fashioned “what’s in it for me?”
• Abstraction layers scares me because we’re not getting smarter at the same rate that we’re abstracting away from the primitives. I think we have a problem as an industry, which is convincing people to learn to be engineers (computer scientists) over quick and dirty coders (learn Java in 28 days and get a job).

Paul Preiss – President, IASA
• Dealing with how we establish architecture as a profession in it’s own right.
• To be a profession something must have:
1. An identifiable body of knowledge
2. Able to enter this trade without having to have been something else first
• Helping to realize architecture as a profession is a key purpose of the IASA group.
• America Institute of Architects formed in 1857 because (sound familiar?):
1. It is difficult to solve real architecture problems due to lack of targeted resources
2. It is difficult to find experienced architects
3. It is difficult to be of maximum value to our employers due to lack of stable job definition
4. It is difficult to tell a qualifies architect from an unqualified one
• Architects tend to be in one of 4 major groups:
1. Corporate
2. Vendors
3. Thought leaders
4. Service integrations
• Professional perspectives and skills taxonomy are resources we’re trying to build up – providing that ‘identifiable body of knowledge’
• The skills one must master to truly deserve the title of architect are:
1. The IT environment
2. Business technology strategy
3. Design
4. Human dynamics
5. Quality attributes
6. Software architecture
7. Infrastructure architecture

What I think
• Working towards formalizing architecture as a practice, a role, and a valuable and necessary part of a team is a massive undertaking – we have work to do here as an industry.
• Perhaps the key to this is working backwards from result? Maybe we need to ask ourselves “how do we test for architecture?” If we can work out what we’d be able to observe in an organization with this nebulous architecture thing in it vs. one that does not, then we’ll have an answer and some benefits to point at!
• In the short term we can’t test for architecture so how do we find them? Well you can test for thinking patterns, values, and philosophies so look for the ones that contribute the most to your organizations definition of success.
• Perhaps we need a charter? Some publically visible set of principles that anyone engaging a software architect should expect to be met?
• Good to see the ‘ities’ discussed as central to what we do.
• I still say it’s science Mr. Preiss – but pragmatic science! We solve real-world business problems by using our skill and experience to apply the right technology.

And with that I must once again apply my problem solving skills to the problem of unconsumed beer. See you tomorrow, same bat-time, same bat-channel...

Tuesday, 7 October 2008

IASA Connections - Day 1

This week I skipped across the Atlantic to attend the International Association of Software Architects & Architect Connections conference in San Francisco. Between my failed attempts at getting a chapter off the ground in Romania and being a long-standing member of the London chapter, IASA has been a hugely valuable organization to be associated with. There are regular activities at most major IT hubs in the world, with a strong focus on sharing ideas and meeting other people with similar technology challenges.

For the next 3 days I have a fairly hefty schedule of people to meet and sessions to attend - so I'll try and throw up a brief summary post at the end of each day covering what's discussed. So let's dive straight into day 1:

David Chappell - serial technology writer and speaker, Chappell & Associates
• Gave the first days keynote on cloud computing, covered the basics (this is still a very misunderstood topic) and direction over the next few years.
• Cloud computing falls into 3 main categories; cloud delivered products like windows live, attached services such as spam filtering and email antivirus, and cloud platforms like EC2 and S3.
• There will always be a mix of 'onsite' software and cloud based services in every organization in the future - so clouds will not rules the world!
• Advantages of cloud computing include lower costs and less financial risk (as capacity grows with demand, rather than having to build for peak from day 1).
• Disadvantages of cloud computing include trust issues, data protection, and regulatory requirements.
• Important to distinguish between consumer (free and ad supported and usually less reliable) and business (usually paid directly and covered by an SLA) cloud solutions.
• Underlying message is really about change and how hard it is - and how this relates to making the paradigm shift we need to get the enterprise to accept cloud as a valid technology.

What I think
• The cynical side of me can’t help but think we’re throwing things we've been doing for ages (like hosted email virus and spam checking) into the cloud bucket now that we came up with a new word for it.
• I consider some of the constraints of building on a cloud platform to actually be benefits – forcing developers to abstract away from hardware means you’re building systems that are inherently more scalable, available, and hot maintainable (whether deployed in a cloud or onsite).
• Lack of enterprise strength SLAs are keeping a lot of organizations away from cloud computing. Perhaps we as engineers have an obligation to help the business start to look less at SLA and more at actual performance? Most of the time a cloud infrastructure will trash a fixed internal hardware deployment in overall availability yet he SLAs don’t yet demonstrate this.
• Cost is very much still a factor when moving to a cloud deployment, but you move to a marginal cost consideration – you don’t want to run an expensive piece of code for each hit, since that unnecessarily increases the cost of the hosting. In a traditional model it doesn’t matter so much if you have expensive code running most of the time if you’re using capacity normally idle except during peak – now that costs you every time you execute it.

Dr Neil Roodyn - independent consultant, trainer and author (and big fan of XP)
• The “capital A” Architect vs. the idea of “little a” architecture, talking about what the role really should be and how it works in agile.
• In software we sometimes borrow a little to heavily from industrial engineering.
• Architecture is a practice created to try and put predictability back into a process that is inherently unpredictable (software delivery).
• Traditionally we try to separate design from construction – i.e. have really smart architects design the system and then any idiots can code it – and how this is flawed thinking.
• Change is inevitable in software projects, most of our traditional processes require rigid and stable requirements in order to be successful but this is not real life.
• Do the best developers only end up architects because we romanticize the role too much? Or it is the only way they see they can get career progression? Wouldn’t we rather have the best coders coding?
• Maybe we need to start looking at developers more as craftsmen and less like engineers?
• Be careful how you incentivize people, because that’s what you’ll get!

What I think
• Something omitted here is that the quality of an engineering team’s output can never exceed the quality of it’s input, so there is a strong dependency on working well together with customers to truly understand requirements and make the right tradeoffs.
• Incentive is a very valid point, if you measure architects on creating verbose design documents or UML diagrams then that’s what you’ll get – try measuring them on quality, working software.
• The people that know how best to solve problems are the people closest to their effects, so architecture has to be about harnessing the individual developers skill, experience, and exposure to the domain – not trying to tell them what to do.
• I quite like the label “engineer” because it stands for solving real world problems with technology – but I do agree the process does need to be considered to be more creative than it currently is. I recently read something that likened software development to film making, which I think is a powerful analogy.
• Why does IT have such a bad reputation? I don’t think it is as simple as failed projects, I think it is how they’re ‘washed up’ afterwards too. Working closer with customers along the way makes sure they’re aware of the impact of their decisions (changing/adding requirements) so you can easily end up with projects over time/budget, which are not considered failures by the business – because they expected it and were involved along the way.

Randy Shoup - Distinguished Architect, eBay
• The architectural forces governing eBay are scalability, availability, latency, manageability, and cost.
• Their architectural strategies are partition everything, asynchronous everywhere, automate everything, and remember that everything fails!
• Data partitioning is dividing DBs by functional areas (user, items, feedback) then sharding so that each logical DB is horizontally partitioned (by a key such as itemID) which provides scalability for write heavy data sets.
• CAP - their system prefers availability and network partition tolerance and thus trades away consistency. Variable window for different features.
• Functional segmentation at application server layer follows the same pattern as the databases (selling, search, checkout).
• Most operations are stateless which means the most basic, off the shelf load balancers can be used to horizontally scale application server tier.
• When travelling though the system, what state is generated is accumulated in cookies, URLs, and a scratch (periodically flushed) DB.
• Patterns used:
1. Event dispatch (new item listed, item sold)
2. Periodic batch (auction close process, import 3rd party data)
3. Adaptive configuration (dynamically adjust consumer thread pools and polling frequency based on SLAs)
4. Machine learning (dynamically adapt experience by collecting data for the day on for example recommendations and then try different ones next day)
5. Failure detection (collect data on activity send on multicast message bus and listeners watch for certain error messages)
6. Rollback (entire site redeployed every 2 weeks and no changes cannot be undone, features rolled out in dependency tree order)
7. Graceful degradation (switch off non essentials, defer processing to more guaranteed async messages)
• Message reliability dealt with via framework.
• Code deployment decoupled from feature deployment, so new versions can be rolled out with dormant functionality (they call this wired off) that can then be turned on gradually.
• NOT agile and don’t claim to be!
• Completely redeploy entire site every 2 weeks – although this is more of a constant, rolling upgrade than the big bang it sounds like.

What I think
• A very agreeable talk, particularly around deployment, versioning, and A/B testing.
• The architectural forces are very similar to our 6 principles (availability, scalability, maintainability, quality, security, and user experience) and really are just different words to describe the same good practices needed to meet the challenges we have in common.
• State is the enemy of scalability, and it is refreshing to see someone tackling this well - instead of saying "how can we build a distributed state tracking system this big?" they're saying "how can we build our system so that the functionality does not depend upon a central state system?"

Rob Daigneau - Chief Architect, SynXis
• Anti-patterns - human behavior in software projects and how to deal with it.
• Overdependence on methodology as solution to all problems and smooth project execution – not enough focus on personalities and simple human nature.
• Individual anti-patterns:
1. Cyber addict – using only IM or email etc for communication is very efficient, but not necessarily effective as you can lose the meaning of your message.
2. Fortress – not sharing knowledge and being secretive leads to good old fashioned unsustainable codebases and key man dependencies.
3. Ugly baby – too little care taken with how and when criticism is issues can lead to valid advice and points being ignored to the detriment of the project.
4. Perfectionist – believing that masterpieces are the only way every time can lead to unnecessary gold plating, artificially inflating the cost and time to market beyond what’s beneficial to the organization.
5. Conquistador – being overzealous about "the one way to do it" can mean failure to make good tradeoffs.
6. Workaholic – not having the right work/life balance will only lead to burn out and mistakes made.
• Team anti-patterns:
1. Hazing – throwing people into the deep end with very little coaching or attention to integrating with the existing team often means they take a long time to become productive.
2. Fight club – the organization can become a casualty in pointless battles between architects and developers, developers and testers etc unless teams learn to work together.
3. Wishful thinking – working out some estimates that show a certain milestone cant be reached in a certain timeline means just that, no matter how much you’d like it not to.
4. Firing squad – if people are scared to disagree or the culture doesn’t encourage the introduction of new ideas; the organization can be robbed of great value that stays locked away in engineers’ heads.
5. Too many cooks – if there are no clear boundaries then consensus becomes a huge issue and teams become paralyzed, unable to make progress because of endless circular discussions.
• Leadership anti-patterns:
1. Deer in headlights – if a leader wont (cant?) make a decision then the company stands still and competitors gain ground.
2. Cliff jumpers – taking risks for the sake of risks or making decisions on dangerously incomplete data can waste time, money and patience.
3. Spin zone - talking talk not walking walk gets transparent pretty quick.
4. Drill sergeant – exercising authority in an aggressive way makes very short-term gains and never gets a team behind a goal.
5. False summits - artificial deadlines or objectives can be dangerous, if the activity doesn’t turn out to be committed to, a loss of morale and trust in management can result.

What I think
• Reasonably fundamental stuff for experienced management but good to see it getting airtime in this crowd, architects typically come up via technical tracks and probably don’t spend enough time on their influencing and consensus building skills given how important a part of the role this it.
• This would have been a good opportunity to talk more about incentive and how it drives behavior. A lot of anti-patterns can be traced back to KPIs that encourage/reward the wrong things.
• Preventing burnout and keeping sustainable development going are some key reasons I like SCRUM – working in sprints gives you a cyclic mix of high intensity work interspersed by more passive, thoughtful phases.
• I liked the Gerald Weinberg quite "no matter how it looks at first it is always a people problem".

Ken Spencer – Practice Manager, Solid Quality Mentors
How agile applies to traditional, formal technical projects using a current project for the US government as a case study.
Manufacturing background so most examples look back to this experience.
Looking at why most projects fail – this is because people fundamentally keep doing what they do.
Given that we almost automatically repeat what we know, the good news is that this means being successful once significantly increases the likelihood of being successful again.
Spend commensurate amount of time planning - you need plans but you need to know when to just start work and learn from doing.
Good enough – what are the testing criteria and when are you into diminishing returns?
The ‘love curve’ which is a cycle you go through when taking over a failed project; initially distrusted though bad experience, then looked upon as savior, then reality sets in, then they see sustainable success.
Getting back to basics – when things go off the rails look first at the simplest units of work the team ought to be doing and this is usually where you will find your answers.

What I think
• Largely based on the agile manifesto, so we have a pretty strong philosophical connection here.
• Slightly damming of waterfall, at the end of the day it isn’t all bad and isn’t evil, I think it’s quite suitable for very repeatable, commodity projects as long-horizon predictability is possible and the phase space of all possible features is finite.
• Not sure I like the label ‘formalized’ when used to describe waterfall vs. agile, as this kind of indicates less control, visibility, and organization around agile when in fact I think these are some of its strengths.

Whew! Time for beer now, see you tomorrow…

Thursday, 2 October 2008

Scheduled Reboots and Natures Way

One of the basic aspects of a biological computing mindset is the appreciation that nothing lasts forever. Everything degrades, corrupts, and dies over time - and that is perfectly normal, because it's duly replaced by a fresh-faced youngster, eager to service the rest of the organism [system] from a nice, fresh cellular structure [empty memory space].

This applies to systems in exactly the same way as it does to organisms. How many issues can you recall where memory leaks, counter errors, and freaky edge conditions all occurred after servers have been running exactly X long, or when a service has processed more than Y connections. I'm sure we could swap tales of woe late into the evening.

This being the case, why do we feel this rottweiler-like dedication to keeping individual devices going for the longest possible duration? I think there is 2 sources; a kind of point scoring pride effect engendered by the output of the "uptime" command, and good old fashioned poor system design. Perhaps one even leads to the other...

So - we design systems poorly. If we want a product to be available, why do we build it in a way that it's availability depends upon a piece of tin that we accept is inherently unreliable? So now the application is the server. This means the only way we can increase its availability is by increasing the availability of the underlying hardware. Not only is this expensive, it's doomed to failure because, as we accepted, servers grow old. So we spend a lot of time and money trying to achieve something we already decided that we cannot. No wonder we're so excited when that uptime counter rolls over to a nice big number!

Do you know what would be better? Accepting that product availability - the uptime of the whole system overall - is what we're really reaching for, and besides, it's how our customers will measure us. Next we need to apply this philosophy to how we design systems, let go of our attachment to keeping individual servers on life support, and put together services that don't rely on any one node, network, or storage device in order to serve our customers.

If you can master that arcane art, then you'll be able to arbitrarily recycle resources, anytime, when there is absolutely nothing whatsoever wrong at all - because this helps keep it that way.

Oh and you'll never be that guy with the box thats been going so long he's scared to reboot it just in case it doesn't come back!

The Fletcher Project