Media/Image Crawlers Need to See HTML Pages
A month or so ago I was refining the robots.txt file for netterimages.com - an e-commerce site that sells rights managed medical illustrations. At the time I (mistakenly) thought that googlebot crawled the HTML pages and then googlebot-image came along and got the image files that googlebot saw. So I used robots.txt to restrict googlebot-image from the HTML pages (there were performance issues which have since been fixed).
But that’s not how it works. I haven’t actually looked at the log files to see whether googlebot-image is crawling HTML pages or whether googlebot makes note of whether googlebot-image would be allowed to craw HTML pages, but the end result was that images started dropping from Google’s index, and since 80% of our referral traffic is from Google Images, our traffic started going down.
Luckily, I keep on top of the analytics for the site and I noticed the drop in traffic pretty quickly. I fixed it when a little over 1,000 of the 15,000 images we have in Google Images had been dropped. Even though I fixed it at that point the number of images has continued to drop and we’re down to 10,000 images in Google Images. This is probably just the normal delay between the crawl and the public index and hopefully it will get fixed quickly…
The good news is that we’ve never actually seen a sale come from Google Images (people who use Google images want free images - they’re not looking to buy images) - so it’s only affecting brand awareness (slightly), not revenue - and it shouldn’t be long before the images that have been dropped reappear in the index.
Tags: Image Search, spiders
Categories: Image Search, Spiders/Bots