Sunday 27 September 2009

Screen Scraping for Dummies

If you run a website with any sort of valuable content, then you are almost guaranteed to run into scraping sooner or later. Screen Scraping is more or less an automated program taking an impression of a web page and then parsing it to pull out some specific bits of data that the scraper is interested in - which theoretically is then stored or used in some other way.

The piece of software that does this scraping is commonly called a robot, or bot, and it is really just an automated web client that accesses and uses sites in the exact same way as it's fleshy counterparts, just with machine precision and repetitiveness. A bot may be a large and complex program running on a server with it’s own database etc, or as simple as a script running in a browser on a desktop.

This is typically regarded as undesirable behavior by many sites because, in most cases, it’s a source of load and associated with unprofitable usage. Whenever we draw a page impression, which we’ll do for every bot hit just as we do for every human visitor, we consume web server time and, worse, back end time. When we structure highly functional pages loaded dynamic content we can create a very engaging user experience, and all that functionality is built on plenty of back end work and data. When used at regular human pace that's usually OK, but under the relentless rate of hammering robots are capable of, it starts to become expensive.

If you've got a case of bots, you have to start by identifying them. Their repetitiveness and flawless precision is, in this case, their downfall and we can usually spot them easily through proper analysis of web logs - no human user is as mechanically regular, millisecond quick, and consistent in journey as your average droid.

Spotting droids isn’t too hard, but then you've got to decide what you're going to do about it. Most scrapers aren't malicious and often don't realise the headache they're giving you. In the first instance it's best to try some detective work, see if you can find out who they are, and get in touch. Domain registrars can be a great resource for this, but don't overlook the obvious - maybe they even have an account with you if data they're using requires signup to view.

Beyond that it gets tricky, and can easily turn into an IP address blocking arms race. With some caching finesse or smart layer 7 rules you can throttle bot activity to more palatable levels, or persist them to their very own node that they can thrash all day without impacting the experience for the rest of your users.

If robots are a very big problem for you, then try and take it as a compliment on the value of your data and perhaps consider publishing it via a productised API or feed - if you make it simple enough to consume, then most scrapers will willingly change over to a more reliable integration mechanism and perhaps even pay you for it.

No comments: