A little while ago, we ran into a problem where Google and Yahoo strangely appeared to have stopped indexing most (but not all) pages on one of our clients' sites.
Pages in the top level of navigation on the site were being properly indexed by the search engines. However just about every page from the second navigation level down didn't appear in the search indexes.
Google's error report was "Network Unreachable". Not very helpful!
(By the way - for those who are uninitiated in this area - we used Google's Webmaster Tools to get a view of what Google was doing with regards to indexing our pages.)
The site in question was using a number of technologies that we thought could be contributing to the issue:
- URL rewriting to create "friendly" URLs from database URLs with querystrings
- Hosted on an advanced web hosting network, with a strict firewall
- A redirection technology, to generate proper 302 redirects for URLs that used to exist on the earlier version of the site so that the search engines would update their indexes properly
However the only functional difference we could see between the site in question and others that used similar structures, similar content, similar technologies, and the same content management system under the hood, was that the front end of the site had been changed from classic ASP code to ASP.Net 2.0. It took a lot of digging, but we uncovered the problem. It's related to ASP.Net browser definition files:
Browser definition files contain definitions for individual browsers. At run time, ASP.NET uses the information in the request header to determine what type of browser has made the request. Then ASP.NET uses .browser files to determine the capabilities of the browser, and how to render markup to that browser.
MSDN Library: Browser Definition File Schema (browsers Element)
Nothing wrong with that, but apparently around March 2006 Google updated the Googlebot user agent string from:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Prior to this change, the user agent string wasn't able to be matched to an existing browser definition, so ASP.Net used the
default.browser definition file. Unfortunately once the user agent string was changed, ASP.Net isn't able to find a match for everything after "Mozilla" in the user agent string, so instead, uses the "Mozilla/1.0" compatible settings. And that leads to all sorts of problems.
Thankfully, ASP.Net is quite configurable. You can create your own browser definition files that override the built in defaults that are shipped with the IIS web server. The solution is to create a browser definition file that properly recognises the user agent strings from Google and Yahoo. This file can then be placed in the
/App_Browsers folder in your website, and the problem is solved.
Thank you to Brendan Kowitz (here's his blog) for this solution. His blog post discusses this issue in more detail and provides additional technical insight.
Paste the following code into a file named
genericmozilla5.browser and add that file to the
/App_Browsers directory in your website. Simple as that!
Note for cm3 Acora CMS Developers and Owners
The latest versions of the cm3 Acora CMS white site are shipped with this fix pre-installed. So you don't need to do anything. If you're running an earlier model of the ASP.Net white site, follow the instructions above.
Related Issues and Handy Tools
More About Browser Capabilities Mismatches
Sander Gerz goes into some detail about problems caused by a mismatch in browser definition files.
A Way to Test Different User Agents
Another handy tool I found when trying to diagnose the problem is the User Agent Switcher Firefox add-on by Chris Prederick. Just add it to Firefox and change your user agent string to see how Googlebot sees your site. It also includes user agents for Yahoo Slurp, MSNbot 1.1 as well as different Internet Explorer versions and the iphone.
Google Webmaster Tools
Invaluable tools for manging and improving your website.