Sunday 27 September 2009

Screen Scraping for Dummies

If you run a website with any sort of valuable content, then you are almost guaranteed to run into scraping sooner or later. Screen Scraping is more or less an automated program taking an impression of a web page and then parsing it to pull out some specific bits of data that the scraper is interested in - which theoretically is then stored or used in some other way.

The piece of software that does this scraping is commonly called a robot, or bot, and it is really just an automated web client that accesses and uses sites in the exact same way as it's fleshy counterparts, just with machine precision and repetitiveness. A bot may be a large and complex program running on a server with it’s own database etc, or as simple as a script running in a browser on a desktop.


This is typically regarded as undesirable behavior by many sites because, in most cases, it’s a source of load and associated with unprofitable usage. Whenever we draw a page impression, which we’ll do for every bot hit just as we do for every human visitor, we consume web server time and, worse, back end time. When we structure highly functional pages loaded dynamic content we can create a very engaging user experience, and all that functionality is built on plenty of back end work and data. When used at regular human pace that's usually OK, but under the relentless rate of hammering robots are capable of, it starts to become expensive.

If you've got a case of bots, you have to start by identifying them. Their repetitiveness and flawless precision is, in this case, their downfall and we can usually spot them easily through proper analysis of web logs - no human user is as mechanically regular, millisecond quick, and consistent in journey as your average droid.

Spotting droids isn’t too hard, but then you've got to decide what you're going to do about it. Most scrapers aren't malicious and often don't realise the headache they're giving you. In the first instance it's best to try some detective work, see if you can find out who they are, and get in touch. Domain registrars can be a great resource for this, but don't overlook the obvious - maybe they even have an account with you if data they're using requires signup to view.

Beyond that it gets tricky, and can easily turn into an IP address blocking arms race. With some caching finesse or smart layer 7 rules you can throttle bot activity to more palatable levels, or persist them to their very own node that they can thrash all day without impacting the experience for the rest of your users.


If robots are a very big problem for you, then try and take it as a compliment on the value of your data and perhaps consider publishing it via a productised API or feed - if you make it simple enough to consume, then most scrapers will willingly change over to a more reliable integration mechanism and perhaps even pay you for it.

Sunday 20 September 2009

conferences - jolly good or just a jolly?

It has always interested me that two different people can look at the exact same conference, and one declare it to be an unmissable event relevant to their work and the other consider it a waste of time; a free day out of the office at best.

It always leads to a discussion about which events are worthwhile and which aren't, but here's the secret - whether a conference is worthwhile or not is only 10% the event itself and 90% what you do while there and afterward.

You could carefully plan the sessions you attend, take the time to meet people with similar technical problems or who have done worked with relevant technology and follow up with them later and, when you return to the office, share your new knowledge around and adopt some better practices.

Or

You could have a few days out, enjoy some vendor's hospitality (free beer always tastes better), and check out some new bars and, when you return to the office, settle comfortably back into your old habits.

That, ladies and gentlemen, is the difference between a jolly good conference and just a jolly.

Friday 4 September 2009

Delegation from the other side

I really like Manager Tools - it's probably the best single source of personal development material that you're ever going to find bundled in easy to consume episodically content. Recently I thought I'd go back and listen to the basics series again, because you can never spend too much time on the fundamentals of good management and, like Mark and Mike say, if you don't do anything else except those basic practices, you'll probably do OK.

It was a really good refresher, and worthwhile doing from time to time. Then I got to the juggling koan cast, which was about how to handle that inevitable situation when your boss gives you another big task to work on when you're already maxed out with too much to do. It's worth listening to - so I won't spoil it entirely - except to say that the answer focuses on developing your delegation muscles, passing on some of your things to make space for the new action and helping your team to grow as an nice side effect.

The majority of the cast focused on how to handle events after you accept this new delegated responsibility, and I liked the point about deliverables gaining size as they flow down the delegation tree (i.e. your small tasks are bigger challenges for more junior members of your team), however I think there was a valuable lesson here about the delegation from the recipient's perspective.

The Manager Tools perspective on this was to accept the new delegation, and to view it as the expression of confidence and trust and the development opportunity that it really is. Many new managers would worry about whether they have the time to take it on or not (no thanks boss) or attempt to enter into a negotiation about what else to drop.* The reason I like this angle is that I don't think there is enough coverage out there on followership.

Leadership is glamorous and fashionable to write about and speak on, so we do a lot of it, but good followership is important to develop too. There is an inherent responsibility on all the individuals in an organisation to be rational participants, and I think a group who know and practice the right behaviors on both sides of that equation performs better. Anyway, isn't part of what we're doing when we're developing our people just cultivating better followership?

Something I'd suggest you ponder when you listen to the cast is clarifying the difference between being handed an action and given responsibility for an outcome. There was heavy emphasis on not entering into negotiations with your boss about whether or not you are going to take this thing on, however I think it's both valid and valuable to question the task and look for the underlying business value.

This isn't arguing about whether or not you or your department have the time, or forcing your boss to manage your priorities for you, it's simply clarifying what you're really being expected to achieve rather than what you're being asked to do. In my experience, particularly because I run technical departments, there is often a difference between what people want to achieve and what they initially ask for, as they think they have resolved the outcome they have in mind into a few actions and sometimes those actions don't add up.

If you can have that sort of open discussion, then you might agree to accept delegation of a slightly different task than was first put your way, but one that much more directly leads to the goal your boss really had in mind.

* Key caveat here about dropping stuff. Things will eventually be dropped, and when you're handling the most important things you can and have delegated the rest then so will your directs, and therefore [due to the hierarchical effect created when everyone delegates their least critical things] what gets dropped should eventually be the least important tasks overall.