blog
HOME · CREATIVE · WEB · TECH · BLOG
May 15th, 2008

Ways To Identify Bad Bots That Execute Analytics Tags

Since I wrote the post on how bad bots can wreak havoc on web analytics by executing the Javascript-based tags that many analytics packages use to track users, I’ve been giving some thought to how one might identify analytics-executing spiders to lessen their impact…

Segmentation

One possible way is by using segmentation. Basically segmentation lets you set a variable for each user that you can use to group different types of users. You might have a segmentation variable with values like “employee”, “customer”, and “general public”.

You set the segmentation variable for Google Analytics by adding a line to your tracking code that looks something like:

pageTracker._setVar("segmentation value");  [using new GA tracking code]
__utmSetVar(’segmentation value’);          [using old GA tracking code]

The other tag-based analytics programs have a similar feature (often more advanced with the ability to set multiple variable values).

The theory behind this approach is that the bots may not actually execute the Javascript, they may see the tracking code on your page, recognize it as a tracking code, grab your Google Analytics ID and do a programmatic call to Google Analytics (as described by Peter van der Graff). In this case, if you’ve set a segmentation value, they most likely won’t be programmed to detect that and as a result they’ll show up on your reports as “not set”.

That said, at least in some of the cases I’ve seen, the bots are able to set segmentation variables, which makes me think the bots use embedded browsers in the same manner as the screen capture program WebShots (which is controllable from the command line).

Segmentation + A Honey Pot

A better option is to use segmentation in addition to a honey pot. Traditionally, a honey pot is link a real user would never click on, which is excluded by robots.txt, so no well-behaved spider would click on it either. Only bad bots load honey pot pages. In this case robots.txt exclusion is not necessary, since no “good bot” currently executes Javascript-based analytics tags, and if a good bot ever did do that you could tell it was a good bot by how it identifies itself in it’s ‘user-agent’ string.

Essentially, you’re setting up a page and hoping the bad bot goes to the page. When the bad bot goes to the page you set the segmentation variable to something like “bad bot” and you can then see the bad bot in your analytics reports. Because you’re using a segmentation variable, just about any report can be broken down by the presence of this value.

You’ll want to make sure you don’t then overwrite the segmentation variable. If you use a honey pot you may want to only set the segmentation variable on the honey pot page, that way you won’t wipe out the value on a subsequent page. Or at least test the cookie with the segmentation value in it before setting other segmentation values.

This solution is far from perfect. For starters, the bad bot may not choose to go to your honey pot page. Some may and some may not. These bad bots are not spiders - they’re not trying to find pages on your site. They’re programmed to act like humans, so they’ll just click on a few links and leave. However, if you see “bad bot” in your segmentation values, it will tell you you may have bigger problems.

A Honey Pot Without Segmentation

You can also use the honey pot concept without segmentation. In this case you’d simply look for the honey pot page in your content reports and see examine entrance paths to the page to determine who’s sending you bad traffic.

This may be sufficient. While segmentation variables can be used on many different types of reports, because the segmentation variable won’t be set every time a bad bot crawls your site, they may not be as helpful as you might think. The general knowledge that there are bad bots crawling your site and their entrance paths may be as good as it gets.

But do realize that just because a bot says it’s coming from xyz.com, doesn’t mean it’s actually come from xyz.com - that can be faked too. The particulars of your situation will determine how best to interpret what you see in the data.

Segmentation + Honey Pot, Take 2 (best option)

Using both of the approaches above separately is probably the best solution.

To find bad bots that are executing the analytics tags programmatically, set a segmentation variable for all users (bad bots or not), and use the lack of a segmentation value to identify programmatic bad bots. Then you use path analysis on a honey pot to find the bad bots that are executing the Javascript with an embedded browser.

That gives you coverage on both types of possible bad bot implementations.

Wrap Up

Let me reiterate that there’s no way to catch every bad bot - at least I can’t think of a way. But if you use some of the strategies I’ve mentioned here and see that bad bots are crawling your site and affecting your analytics, then you’re ahead of the game. The worst case scenario is that bad bots cause you to make bad decisions by changing your data and leading you to false conclusions. If you know your data have been affected, you’re more likely to think twice and not rely on faulty data to make decisions.

And also remember, as I’ve discussed in a previous post, there are ways to use segmentation to set up bot-free zones in your web analytics. Segmentation on things like “Paying customers” or “registered users” can all be relatively immune to bot traffic.

Good luck!

Digg It!  Add to del.icio.us  Add to StumbleUpon  Add to Reddit  Add to Technorati  Add to Furl  Add to Netscape

May 15th, 2008

Proving The Value Of SEO When Botnets Corrupt Your Analytics

If distributed attacks on web analytics become common place, those of us doing SEO are in for a world of hurt… After all, how can you prove your worth if your statistics can’t be relied on? But, while painful, botnet analytics hacking really isn’t the end of SEO - let me explain…

Insuring that pages are built correctly

At it’s heart, one of the core practices of SEO is to ensure that pages are built properly - that their focus and content can be easily understood by search engine spiders. “Web designers” just want pages to look pretty. “Web programmers” just want pages to be connected to data sources in cool ways or do some other cool trick. Neither of those groups really cares about how the pages perform. SEO will always need to be there to make sure the page can be understood by search engines. With the advent of botnet web analytics corruption, it may be difficult for an SEOer to prove s/he’s done her/his job, but it’s not impossible…

E-commerce

Sites that have e-commerce as their primary goal will be far less affected than “marketing” sites trying to get market exposure because e-commerce transactions will be difficult to fake. Conversion ratios may be corrupted, but the actual sales (in dollars, pounds, euros, etc.) will still be a concrete measure of success. Further, segmentation by things like “paying customer” will become invaluable.

Registered users

Building on segmentation possible with e-commerce, segmentation based on whether the user is registered will be invaluable as well. Using segmentation, you can look just at the activity of registered users which should be bot free provided you use a good captcha, and possibly e-mail confirmation. (At the very least it will take a very targeted attack to mess with your user segmentation data - it won’t be affected by random attacks).

Ranking in SERPs

Your actual ranking in the SERPs is also something that can’t be faked. Programs like WebPosition, while hated by the search engines, will be critical in measuring success and validating organic traffic. [If you're getting traffic off the keyword 'art', but don't rank for 'art', then you know you have a bot problem.]

The cost of good analytics

The analytics company that works the hardest to combat botnets will be the big winner here. Since Google Analytics, Microsoft Gatineau, and now IndexTools (Yahoo!), are free packages it’s unclear whether Google, Microsoft or Yahoo! really have the incentive to combat the effects of botnets. I mean, look at Google’s track record with click fraud in AdWords. Even if you accept the fact that they try to stop it, there’s still plenty of it that goes undetected. If the tool that is best able to effectively detect and squash botnets is a paid service, there will be a very real increase in the cost of quality analytics.

It will also be more expensive to interpret analytics since you can’t just take the numbers at face value. You’ll have to jump through more hoops to come to a conclusion and it’s not likely to be able to be done by people without significant experience. Those experts will come at a price.

I can also see 3rd party verification services cropping up that may not engage in SEO, but validate the results of SEO companies. Again, an added cost, though probably only something that would be done for very high-end projects.

Conclusion

Botnets aren’t going to destroy the SEO industry. If anything they’ll increase the level of professionalism. Word will get out that fraud and deception are possible with SEO and reputation will become absolutely critical. I’m guessing it will knock some smaller players (both customers and providers) out of the game - or at least knock them down to a place where they don’t even try to validate results. But in cases where money talks, money will be able to buy at least “decent” analytics and the people who are best qualified to interpret the data.

Still, analytics hacking by botnets will become a much bigger problem in the future, and it will radically change the SEO industry…

Digg It!  Add to del.icio.us  Add to StumbleUpon  Add to Reddit  Add to Technorati  Add to Furl  Add to Netscape

May 12th, 2008

Good spiders that execute Javascript

The general wisdom is that spiders don’t execute Javascript, yet a few are popping up here and there… There was a YOUmoz article that demonstrated that SearchMe’s spider was executing Javascript-based Google Analytics tags.

Another spider that executes Javascript is the spider Alexa uses to create site thumbnails. I believe this is the ia_archiver bot used by archive.org - a sister company to Alexa (both are owned by Amazon). If the spider encounters a Javascript on the page that contains something like

window.location = "http://wwww.google.com/";

It will take the thumbnail of the other page defined by window.location, not the page it was initially sent to crawl. This is true even if window.location is inside an if statement and the default (”OK”) action is to not go to the other URL.

It is a little unclear whether Alexa’s spider is executing Javascript or looking for the presence of window.location, though I’d guess it actually executes the Javascript because otherwise it couldn’t faithfully render certain pages. In other words, Alexa’s thumbnail spider needs to execute Javascript to do it’s job properly.

Just because SearchMe and Alexa’s spiders execute Javascript doesn’t make them a “bad bot”. We know SearchMe’s bot declares itself via “user agent”, and while I haven’t dug through my log files to confirm it, I would be shocked if Alexa’s bot didn’t do the same. I’d also bet that both follow robots.txt.

That means these are essentially “good spiders”, not “bad bots” since they’re not trying to cause problems. However, as the YOUmoz article points out, SearchMe’s spider can inadvertently cause problems by executing the Google Analytics tracking code.

Digg It!  Add to del.icio.us  Add to StumbleUpon  Add to Reddit  Add to Technorati  Add to Furl  Add to Netscape

May 9th, 2008

Red Text Isn’t Always Bad In Google Analytics

Take a look at the following graphic… What does it tell you about search engine traffic?

Red text in Google Analytics looks like a problem, but isn't...

It would appear that your search engine traffic has gone down by 2.76%. But that’s not actually the case at all. If you drill into search engines you see the following….

Google Analytics showing improved search engine traffic

Organic traffic has actually gone up 5.7% even though the first graphic said it had gone down 2.76%. The reason is simple - search engine traffic went down relative to the other types of traffic. In other words, direct traffic went up more than 5.7%. The first graphic is the product of a zero sum game - if one area increases, the others have to decrease, even though, on their own they’re increasing.

Let’s put it another way. If direct traffic and referring sites had gone up 5.7% as well, then the percentages in the first graphic would all have been zero even though all of them went up 5.7%.

The take away is that when interpreting web analytics stats, be careful in your interpretations. Digging deeper can clarify initial misconceptions.

Digg It!  Add to del.icio.us  Add to StumbleUpon  Add to Reddit  Add to Technorati  Add to Furl  Add to Netscape

HOME · CREATIVE · WEB · TECH · BLOG