We are tracking our whole website in PHP in the backend. We now get a lot of crawlers, hosted on AWS instances, who access the site once and leave again. We assume they are automated crawlers looking for vulnerabilities in our software stack. I want to filter them out and put the into a separate category. I don’t want to exclude them from tracking, since the information is still valuable to us. The problem is all those accesses are messing with our bounce rate, conversion rate and general hits.
What is the best way to separate legitimate hits from the ones described?
Hi,
There was a really extensive discussion about filtering bots in the PHP tracker on the German forum:
To sum it up:
- There is no rule that fits every use case.
- Piwik by default ignores bots it recognizes via user agent. (But you can disable it)
- You’ll probably want to build your own bot detection on top of device-detector
- Maybe track users you detect as bot to an separate site-id or add a Custom Dimension/Variable, so you can create segmentation to differentiate between them.
@Lukas I read the whole thread and there definitely seems to be no straightforward way to solve this issue. I think it would already make sense to track users where JavaScript is deactivated in a separate site.
I just don’t know how I want to implement that either, since all solutions presented on SO are not very elegant.
Is there a way in Piwik php tracker to track a page view as a bot view or do I really have to change the device detector on the tracking server itself?
Hey @Minalcar,
There are two approaches: You may either
- Try to filter out the undesired hits on the reporting end by setting up two segments, one excluding your assumed bots and one focusing on them. You would need quite distinct ideas on which criteria separate the two segments - and since you cannot block single IPs or ranges in segments, that will become quite difficult
- Try to detect & assign each visit to your two types of visitors before calling the tracking api
I’m affraid you will have to come up with some kind of custom solution before calling the php tracking to get anywhere near your desired result.
Here is why:
If you apply changes within the DeviceDetector, teaching it to detect the assumed bots, that will lead to those hits disappearing from your stats, which is not what you want.
What you would like to do (keep the bot hits, but listed in a different segment / siteID) will require you to assign the different SiteID or custom parameter when calling the tracking api.
You will not be able to check Javascript execution on first visit when using the php tracker, so I guess you will have to focus on other signals like missing referrers, missing cookies, odd browser agents or IP address (if you like to keep up with updating these as new bots show up).
You’ll need a custom isABotIMO()
-function similar to the DeviceDetector, but focused only on your own ideas (mind, deviceDetector will run if not disabled, so no need to replicate that if you do not intend to track even more bots than right now).
In the end, your php tracking code should follow below logic:
if (isABotIMO()) {
/*call php tracking with site ID Y, to track bots*/
}
else {
/*call php traking with site ID X, track normal visitors*/
}