How to force Matomo to merge duplicate visitor entries

I recently realized that we’re missing a significant amount of analytics data because Google indexes our PDF files and serves a direct link to the PDF to the searcher (this means they never load the JS tracking code, so Matomo never knows about this traffic).

So after doing some testing and brainstorming, I’ve implemented a simple little Cloudflare worker that intercepts requests for PDF files, checks to make sure the referrer isn’t our own website, and then sends a POST to our Matomo installation with info like the IP, user agent, etc. That’s great (our daily visitors doubled!), but I noticed that we seem to be getting a number of duplicate entries for PDF downloads now. See the screenshot below:

I know Matomo is supposed to be smart about detecting page views as the same visitor, but it would be nice if the same would extend to downloads. Is there something I can do to “force” Matomo to view those 2 events as the same visitor?

Sometimes Matomo does group downloads into the same “visit” but still shows 6 separate downloads for the exact same file over the course of a few seconds.

Again, it would be nice if these would just be counted as 1 download, because otherwise it artificially inflates our statistics to the point that they’re not nearly so useful. Any ideas about how we can clean up this data to make it more useful? :grin:

Hi @577895
As you track some downloads server-side I think you should then also remove the client-side tracking for these events. If you track twice, then it’s normal you’ll get hits twice!

Hello @heurteph-ei,
As I mentioned previously, I’m excluding all traffic with our website as the referrer, so we’re not actually tracking twice. This is traffic to PDF files which cannot trigger JS in any case.

The issue is that we get multiple requests from the same visitor for the same PDF in the span of a few seconds, and each one gets counted as a separate request. All I want to do is have them merged so that it doesn’t skew our statistics.

Hi @577895
Sorry, I misunderstood the problem :blush:
I think you have to determine somewhere in your worker if the user is the same or not (eg. via cookie value received while serving the PDF, or just thanks to the IP). And with this information, send a visitor ID to Matomo:

  • _id (recommended) — The unique visitor ID, must be a 16 characters hexadecimal string. Every unique visitor must be assigned a different ID and this ID must not change after it is assigned. If this value is not set Matomo (formerly Piwik) will still track visits, but the unique visitors metric might be less accurate.

From:
https://developer.matomo.org/api-reference/tracking-api#optional-user-info

Thanks, @heurteph-ei I was wondering about that. So yesterday I tried setting the _id to be Cloudflare’s cf-ray header, since that seems to be a decent option for a unique identifier. But it didn’t help, I’m still getting duplicate views.

However, I just took a closer look at my Matomo logs… while the IP address that CF reports for the visitor stays the same, (I get it from the CF-Connecting-IP header and pass it to Matomo via the cip parameter) the subsequent requests in the log actually are coming from different IPs! I wonder if perhaps the PDF file gets served along a couple of different “hops” in the network (because they’re cloudflare cached, maybe they’re moved to a datacenter closer to the user or something :man_shrugging:) and my worker dutifully intercepts each of those requests and sends them along to Matomo. As it turns out, cf-ray is not a good option for a unique visitor id, since it changes for each request.

IMO, Matomo still should be able to merge those duplicates into one, since they look nearly if not exactly identical, but if I can figure out how to keep my CF worker from sending the duplicate requests in the first place, then I might be able to stop the problem from that end. I’ll be happy with either solution. If anyone has any ideas on how to accomplish it, I’ll be happy to hear them!

I’m going to try creating my own unique _id by hashing the visitor ip from CF and the URL and see if that works.

Update
I updated my CF worker to set a unique _id for each visitor by hashing the visitor’s IP and the URL. However, I’m still seeing some duplicate visits, although I’m hopeful that maybe setting the _id is helping at least part of the time. Here’s an example:

image

The user agent changed from Google Go (com.google.android.apps.searchlite) to Chrome mobile, but the unique _id didn’t change between the 2 visits, so that means the visitor’s IP and access URL were the same. (I wonder if one of the visits was a “prefetch” or something?)

Also, I think I figured out the reason for different IPs showing in the logs… Evidently CF workers can “work” from different datacenters. In this case, the IP addresses that submitted these requests to the Matomo API were different between these 2 requests. But since I’m using the visitor’s IP address for my unique _id it didn’t affect my _id at all. Nonetheless, the problem still isn’t solved. :frowning: