I feel sorry for Google…
Well, the set up of the 166,000 page test is complete, and 166,242 pages have been added to slicksurface.com in the form of a medical thesaurus.
In the process (which took a week since I was doing it in my spare time) I made a few stumbles and have to say I really feel sorry for the folks at Google (and Yahoo!) who have to make sense of my mistakes...
First, I put everything into a directory called /medical-dictionary/, then I realized that MeSH was a thesaurus, not a dictionary, so after submitting sitemaps with over 100,000 URLs to Google, I changed the directory name. I tried to be nice and submit a URL removal request, but it was denied because I had a 301 redirect from /medical-dictionary/ to /medical-thesaurus/ in place (URL removal requests are only honored if a 404 or 403 is returned, or the URL is excluded with robots.txt.)
Speaking of robots.txt, I managed to mess that up too. I was going to excluded the directories that weren't yet complete (there are five main 'tables' in MeSH - concepts, terms, semantic types, qualifiers, and descriptors - each of which I put in a separate subdirectory). I set up the exclusions, but later realized they were for /medical-dictionary/ not /medical-thesaurus/. Oh well... More confusion...
I mentioned having a 301 from /medical-dictionary/ to /medical-thesaurus/, but that went out the window too since in some cases I was putting so many files in a directory that OS X was slow handling the folder and Dreamweaver would completely crash if it got anywhere near the directories. So I had to break up 3 of the sub directories into smaller subdirectories, which completely ruined the URLs the spiders would have crawled on some of the initial pages I put up, which made the redirect pointless.
One lesson I learned in the process was to randomize by the end of a numeric string, not the beginning. Let's say you have a numeric string that starts with 000001 and goes up. At first I took the first two numbers and put all the ones that started with '00' in a directory, but that was still too many files, and meanwhile the '99' dirctory was empty. It didn't take long to figure out I needed to use the last digits in the number ('01' in the example), since they're pretty much evenly distributed. But still, when you're generating 25,000, 40,000 or 95,000 files at a time it takes a while and every little problem takes a long time to fix.
If you look at the URLs it looks like there are thousands of directories, but in fact there aren't. I figured, if I was going to put things in a directory, I might as well put the unique ID as the directory name, since elements of the URL do count somewhat towards the keywords for the page. So for example, you see a URL like:
http://www.slicksurface.com/medical-thesaurus/descriptor/D006809/humanities.htm
but in fact the 'real' (hidden) URL is:
http://www.slicksurface.com/medical-thesaurus/descriptor/9/D006809.htm
I use Apache's mod_rewrite to map an SEO'd URL to the file on disk (which is in a more manageable, but less SEO-optimized, directory structure).
The other big error I made was in the first iteration of the 'terms' directory I had bad URLs to other documents - some of the URLs were to wwww.slicksurface.com, not www.slicksurface.com. Given that there are 95,000 terms, that took a day or so to fix and upload.
The bottom line was that I should have waited for everything to be complete before submitting sitemaps - that would have stopped me from confusing Google and Yahoo! with bad URLs in sitemaps and 'missing' pages on their crawl. But part of me thought it would be smarter to put up some of the smaller subdirectories first so as not to overwhelm them with all the URLs at once.
Given all the problems it will be interesting to see how long it takes before the pages start getting indexed. Needless to say it's a bit of a worst case scenario since, from Google's perspective it would seem I tried my hardest to confuse them. I didn't really mean to, but that was the net effect...
Now Dan's pushing me to get the supplementary records up which are mostly chemicals and drugs. That will nearly double the number of pages yet again and will require quite a bit of work since I haven't taken a close look at the data and haven't imported it into a database yet. I'm just going to let things sit for a few months first and come back to it sometime in the summer...
Tags: Google Webmaster Central, slicksurface.com, spiders
Categories: Duplicate Content, Google, Spiders/Bots