What's all this about then?
Well one day I was playing with Google curious what position my site was and
out of idle curiosity I thought I'd see where I was in Microsoft's Google beater
the new MSN beta search engine.
So....
one Google later and I find myself on the first page at position 5. I then tried
on MSN and was surprised that I was down in position 67 well off the front page....
I clicked on a few of the sites in the first few results and for some reason the fact that a couple were IIS based caught my eye. I compared with the first page of Google's results and didn't find a single IIS based server.
Odd. Pure coincidence perhaps. But how to check.
I decided to try and analyse the results produced by Google and IIS and compare them for bias.
I tried a few manual queries and found a few similar anomolies and started writing a few automated test scripts.
After a bit of tapping away on the evening train commute, I had a way of running a set of words through each engine, generating a list of sites and interrogating them to determine what software they were running.
Manually looking at the results from the first few words did indeed seem to show a bias towards IIS servers. But to show a pattern it was time for some detailed analysis. After a few attempts at anlysing the results in OpenOffice, I couldn't find a way of presenting the data in a way I wanted. I doodled down on a bit of paper what I thought would be the fundamentally interesting details, such as overall results, analysis of the first page of results and overall coverage through a large set of results, and set about writing a bit of perl to generate some results.
The initial set of words indeed showed a significant difference between the results from Google and the results from the Beta MSN search. With the Google results mirroring the Netcraft data, and the MSN results with a distinct swing from Apache to MSN. Hmmm.
The results on this page are the collated stats across the full set of words (see below). Each phrase is then broken down into a detailed page which shows the data for the full set of 100 results, a summary of the top 10 hits and a chart showing the coverage of webservers across the search results.
A perl script was used to perform searches against Google and MSN and scrape the results. I intentionally didn't want to use an API search to be sure I was getting the same results as a normal users. Each server was then identified using basic fingerprinting: an initial HEAD query, followed by more specific queries as required and finally trying OPTIONS matching. This proved to be sufficient to identify all but a few esoteric servers which were either intentionally hardened, or non-generic custom servers with no identification (e.g. directory.yahoo.com). Then it was a simple method of slicing and dicing.
All of the scraping, analysis and charting was done using a few simple perl scripts. I will make these available so others can have a play with their own analysis or want to verify these results, all of the raw data from the searches is available in the (not quite) .csv.gz files
A few highlights of these statistics. Firstly, for some reason some MSN queries return a results set that contains tracking links for every URL. These return a URL Starting "http://g.msn.com/9SE/1?" followed by the original URL, followed by some tracking information. One such search that originally screwed the results was for "MP3". It would be interesting to see what other search terms return tracked result sets.
Looking at the queries for Microsoft and Linux show results scewed either for IIS or Apache respectively so people are eating their own dogfood. Interestingly the percentage of Microsoft sites using IIS is much lower (64%) compared to Linux sites running Apache (94%) suggesting that not everyone is as confident with their webserver.
Staying with the linux theme the 'linux' query interstingly returns RedHat Debian and Novell on the first page of Google results, but none of these show up on the first page of MSN results and Debian doesn't even feature in the top 100.
On the whole is seems that the MSN search engine is indeed placing IIS hosted sites higher in the results more frequently than other webservers. Frequently the MSN search is placing more IIS servers in the important top 10 results than Google even where result sets from a query have actually returned fewer IIS servers overall on MSN.
Looking at the coverage graphs, most search phrases return a more even spread of IIS servers thoughout the results sets from the MSN searchs.
So what's going on?
I have no idea, I doubt it's all a big conspiracy... but some possible explanations
spring to mind:
Perhaps the MSN search has simply been coded by developers used to talking
to IIS machines and so it just does that job better?
Perhaps the MSN spider is taking advantage of some specific IIS features to
provide enhanced indexing?
A friend of mine Tim Meadowcroft suggested that the Google Zeitgeist may be favouring news sites and so suggested his own set of alternative words.
The alternative results can be found here.
Ivor Hewitt
January 2005,
Surrey, England.
Top 100 summaryThis is the total results totted up across all words between Google and
MSN search. The current figures from Netcraft match quite nicely with the Google figures. The current Netcraft figures show Apache at 68.43% and IIS at 20.86%.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Top 10 summaryThis shows the distribution of webservers just within the crucial top
ten front page.
Detailed Results:This is the full list of the current search words/phrases used by the validator, the detailed breakdown will show the full analysis of each word and the web server coverage graphs. The source data shows the raw csv data from the queries and also a dump of the HEAD responses from each server.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
All content Copyright © 2005 Ivor Hewitt.
http://www.ivor.it - Technology - http://www.ivor.org - The Hedge.