blog
HOME · CREATIVE · WEB · TECH · BLOG

Thursday, May 15th, 2008

Ways To Identify Bad Bots That Execute Analytics Tags

Since I wrote the post on how bad bots can wreak havoc on web analytics by executing the Javascript-based tags that many analytics packages use to track users, I've been giving some thought to how one might identify analytics-executing spiders to lessen their impact...

Segmentation

One possible way is by using segmentation. Basically segmentation lets you set a variable for each user that you can use to group different types of users. You might have a segmentation variable with values like "employee", "customer", and "general public".

You set the segmentation variable for Google Analytics by adding a line to your tracking code that looks something like:

pageTracker._setVar("segmentation value");  [using new GA tracking code]
__utmSetVar('segmentation value');          [using old GA tracking code]

The other tag-based analytics programs have a similar feature (often more advanced with the ability to set multiple variable values).

The theory behind this approach is that the bots may not actually execute the Javascript, they may see the tracking code on your page, recognize it as a tracking code, grab your Google Analytics ID and do a programmatic call to Google Analytics (as described by Peter van der Graff). In this case, if you've set a segmentation value, they most likely won't be programmed to detect that and as a result they'll show up on your reports as "not set".

That said, at least in some of the cases I've seen, the bots are able to set segmentation variables, which makes me think the bots use embedded browsers in the same manner as the screen capture program WebShots (which is controllable from the command line).

Segmentation + A Honey Pot

A better option is to use segmentation in addition to a honey pot. Traditionally, a honey pot is link a real user would never click on, which is excluded by robots.txt, so no well-behaved spider would click on it either. Only bad bots load honey pot pages. In this case robots.txt exclusion is not necessary, since no "good bot" currently executes Javascript-based analytics tags, and if a good bot ever did do that you could tell it was a good bot by how it identifies itself in it's 'user-agent' string.

Essentially, you're setting up a page and hoping the bad bot goes to the page. When the bad bot goes to the page you set the segmentation variable to something like "bad bot" and you can then see the bad bot in your analytics reports. Because you're using a segmentation variable, just about any report can be broken down by the presence of this value.

You'll want to make sure you don't then overwrite the segmentation variable. If you use a honey pot you may want to only set the segmentation variable on the honey pot page, that way you won't wipe out the value on a subsequent page. Or at least test the cookie with the segmentation value in it before setting other segmentation values.

This solution is far from perfect. For starters, the bad bot may not choose to go to your honey pot page. Some may and some may not. These bad bots are not spiders - they're not trying to find pages on your site. They're programmed to act like humans, so they'll just click on a few links and leave. However, if you see "bad bot" in your segmentation values, it will tell you you may have bigger problems.

A Honey Pot Without Segmentation

You can also use the honey pot concept without segmentation. In this case you'd simply look for the honey pot page in your content reports and see examine entrance paths to the page to determine who's sending you bad traffic.

This may be sufficient. While segmentation variables can be used on many different types of reports, because the segmentation variable won't be set every time a bad bot crawls your site, they may not be as helpful as you might think. The general knowledge that there are bad bots crawling your site and their entrance paths may be as good as it gets.

But do realize that just because a bot says it's coming from xyz.com, doesn't mean it's actually come from xyz.com - that can be faked too. The particulars of your situation will determine how best to interpret what you see in the data.

Segmentation + Honey Pot, Take 2 (best option)

Using both of the approaches above separately is probably the best solution.

To find bad bots that are executing the analytics tags programmatically, set a segmentation variable for all users (bad bots or not), and use the lack of a segmentation value to identify programmatic bad bots. Then you use path analysis on a honey pot to find the bad bots that are executing the Javascript with an embedded browser.

That gives you coverage on both types of possible bad bot implementations.

Wrap Up

Let me reiterate that there's no way to catch every bad bot - at least I can't think of a way. But if you use some of the strategies I've mentioned here and see that bad bots are crawling your site and affecting your analytics, then you're ahead of the game. The worst case scenario is that bad bots cause you to make bad decisions by changing your data and leading you to false conclusions. If you know your data have been affected, you're more likely to think twice and not rely on faulty data to make decisions.

And also remember, as I've discussed in a previous post, there are ways to use segmentation to set up bot-free zones in your web analytics. Segmentation on things like "Paying customers" or "registered users" can all be relatively immune to bot traffic.

Good luck!

Categories: Bad Bots, Web Analytics

Leave a Reply

HOME · CREATIVE · WEB · TECH · BLOG