Thursday, April 5th, 2007

Not indexed by Google? Check your robots.txt file…

Timeouts on robots.txt can be deadly...

A couple months ago we took over a consumer health content web site for one of our clients. We had designed the site, but they're a fairly large company with a web division and so the site was built and initially hosted by the internal web division. But things didn't work out so well. Six months after the site was launched the site had been indexed by Yahoo!, but Google still hadn't indexed a single page.

After we took over the site and had a look via Google's Webmaster Tools we realized what the problem was - the server they were hosting the site on timed out whenever robots.txt was requested. Yahoo! figured this meant there wasn't a robots.txt (which was actually correct) and they went and indexed the site as if there wasn't a robots.txt. However, Google simply wouldn't proceed without a definite response on robots.txt - they have to either receive a robots.txt file or a 404 "Not Found" response. Without one of those Googlebot just stops dead in it's tracks.

Of course, Googlebot exhibited the better behavior in this scenario - Yahoo!'s Slurp shouldn't have just assumed there wasn't a robots.txt. I'd also give Google a huge thumbs up for their work on Google Webmaster Tools - while it didn't highlight the problem, it did report the problem if you looked thoroughly enough.

This is just one of the ways you can get tripped up by the robots.txt file. For one, robots.txt is case sensitive, some search engines allow wildcard characters, others don't (most do, however). The best resource I've found for checking your robots.txt file is Google's Webmaster Central - the use of their robots.txt checker tool is explained in their blog...

Tags: , ,
Categories: Spiders/Bots, Web Site Configuration

Previous Post: «

Leave a Reply