I recently installed the latest piwik beta (0.2.32) and am really happy with it except for one point. Most of our visitors come from google search and I was wondering why piwik displayed most of our visitors coming from live.com. So I checked my access log and found that the live.com bot is not correctly recognized as a crawler. And the worst thing is, that the bloody bot sends a referrer matching the most important keyword of the site to crawl which results in piwik displaying lots of traffic coming from live.com search.
examples from my access log:
220.127.116.11 - - [19/Mar/2009:09:29:21 +0100] "GET /waadt/villars-tiercelin HTTP/1.0" 200 11576 "http://search.live.com/results.aspx?q=villars" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)" 18.104.22.168 - - [19/Mar/2009:09:29:24 +0100] "GET /waadt/ependes HTTP/1.0" 200 11488 "http://search.live.com/results.aspx?q=ependes" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)" 22.214.171.124 - - [19/Mar/2009:09:29:39 +0100] "GET /waadt/molondin HTTP/1.0" 200 11440 "http://search.live.com/results.aspx?q=molondin" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322)"
I will now try to find out how to identify the bloody bot from redmond. I can either try to implement the detection on my own or maybe someone will give me a hint on where to start? I will make the patch available to the community as soon as it’s working of course.