blog
HOME · CREATIVE · WEB · TECH · BLOG

Thursday, April 5th, 2007

Don’t baffle spiders with duplicate content!

www.slicksurface.com vs. slicksurface.com

One of the complications search engine spiders (like Googlebot, MSNbot, and Yahoo!'s Slurp) have is trying to distinguish between content on URLs like www.slicksurface.com and slicksurface.com (without the www). In some cases (don't ask me to cite any) the content on the two is different. Having a different sites on www and no-www isn't common these days, but it used to be more common when the web sites were first appearing 10+ years ago.

www.slicksurface.com vs. slicksurface.com vs. www.slicksurface.ca vs. slicksurface.ca

These days it's more common to have multiple domains pointing to the same site. For example, we have slicksurface.com and slicksurface.ca, other companies might have the .net or .org versions of their domains pointing to their sites. When I was looking at someone's resume the other day I went to their site, looked at the source code of their home page and they had 5 domains listed in a "see-also" meta tag for each one of those domains the www and no-www forms of the URL worked - so that was 10 URLs for every page! I'm not trying to rag on him - I made similar mistakes when I was first trying my hand at SEO. I used to think having the extra domains would give me more inbound links and make my site rank better, but the search engines are a lot smarter than that...

Trying to figure out the mess...

The search engines have to make sense of the mess their presented with. Some people prefer their URLs to be with a www, other people prefer them to be without the www. The search engines try to pick one based on what the preponderance of the links are coming into a site or a page, but they can get pretty easily confused if some of the links are www and some are no-www. The end result can be that some of the pages on your site are indexed with www and others without. In the case of the guy I mentioned above - the pages on his site could theoretically be indexed with 10 different hosts!

What won't happen is that the pages will get indexed both ways. Search engines want to know the one authoritative source for every piece of information out on the web. If they see the same information at multiple locations they call that "duplicate content" - they'll do their best, but they'll only keep the one they think is the best or most authoritative.

It gets worse with dynamic & database-driven sites

The worst case scenario can happen with large dynamic sites where there are multiple URLs for the same page or the content on the page is slightly different every time the page is loaded. If some of the content changes between the time the spider loads the www version and the no-www version, but the bulk of the content is still the same, then the spider might think the sites are separate sites and handle each site separately and give one (or both) duplicate content penalties.

In database-driven sites there are often multiple URLs for the same exact page. On our IMG2D-driven sites (like netterimages.com), image detail page URLs can look like http://www.netterimages.com/image/5355.htm (the SEO'd version) or http://www.netterimages.com/image/detail.htm?variantID=5355 (the version pointing to the page template). It can be an even bigger problem for search results pages where for example http://www.netterimages.com/image/didymus.htm and http://www.netterimages.com/image/didymus.htm?page=1 might actually be the same page. If there are multiple search parameters just putting the parameters in a different order could very well result in problems.

The biggest problem, by far, on dynamic sites is session IDs in URLs. Never, ever give a search spider a session ID... You're site will soon be overwhelmed with requests by spiders requesting multiple versions of the same page, the search engine will eventually figure out it can't make sense of your site, and your site won't get far in organic search.

Syndicated content

Another duplicate content scenario is syndicated content. One of our clients distributes press releases via PR Newswire. We were helping them monitor the progress of one of their projects and one week there were 10,000 versions of the press release available on the web (according to Google), a couple of weeks later there were a couple hundred (and the number kept dropping). This drop was confusing to the client because they wanted a hard number to report back to their client and we had to explain that this is pretty much expected with any syndicated content since the search engines one a single authoritative source for each piece of content.

Penalties if you get it wrong...

If the search engines have a really difficult time figuring that out they'll actually penalize you for it. "Duplicate Content Penalties" are widely discussed in SEO and webmaster forums and blogs. The bottom line is you don't want to do it. In my mind the primary rule of SEO is to not confuse the search engine spider, and duplicate content can seriously confuse a search engine.

So what can you do?

The general idea is that you pick one and run with it and every time one of the others is requested you redirect them to the one you've picked. Never actually serve a page with a URL that's not from your preferred host/subdomain.

A fix for Apache

Assuming you're using Apache (as most of you are), if you are able to edit the config files for your server (or get someone to do it for you) you can insert the following code in your virtual host config file...

RewriteEngine On
RewriteCond %{HTTP_HOST} !^www\.slicksurface\.com$
RewriteRule ^(.*)$ http://www.slicksurface.com$1 [R=301,L]

Of course you'll want to replace www.slicksurface.com with your preferred domain. With this code you can have as many different "aliases" set up on the server as you want - from as many different domains (like the guy with 5 domains and 10 different hosts/subdomains). The code does pass search parameters onto the preferred URL (the parts of the URL that come after a question mark).

Here's a quick rundown of some of what it's doing (if it looks like greek to you)...

On the line with HTTP_HOST - the ! means "not", so it's saying if the HTTP_HOST is not the preferred host. The ^ marks the beginning of the host name, the $ marks the end. Without these it would be saying if the HTTP_HOST does not contain the preferred host and something like foo.www.slicksurface.com wouldn't get redirected. Also the \ before the periods are there because periods have special meaning and so it's saying use a real period, not the period with special meaning.

On the last line - the "R=301" part tells it that the redirect is permanent. This is by far the best for SEO provided you're not going to change your mind. If the redirect should be temporary, use R=302.

Setting preferred domain with Google Webmaster Tools

If you don't have access to your Apache config files and your problem is just www vs. no-www then you can fix the problem with Google by signing up for Google Webmaster Central, and using their "Preferred Domain" tool (shown below).

Google Webmaster Tools - Preferred Domain

Of course, using Google Webmaster Tools won't help you with Yahoo!, MSN/Live, or Ask - it's just a Google thing.

The Bottom Line

The bottom line is to be kind to the search engines and not confuse them. The nicer you are to them, the nicer they'll be to you...

Tags: ,
Categories: Duplicate Content, Spiders/Bots

Leave a Reply

HOME · CREATIVE · WEB · TECH · BLOG