What can explain the mysterious "too many Direct Entries" phenomenon?


(Jeff F.) #1

Hi everybody,
I’ve been using Piwik for the last ten years (!) and I still love it. There is, however, something I never quite managed to figure out and solve: the wildly overreported (or misclassified) “Direct entries” phenomenon.

I did read FAQ entry no. 51 as well as the campaigns tracking documentation, in addition to every thread mentioning “Direct Entries” in this forum. It has been mentioned in on way or another in dozens of threads (including 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, etc.), but the discussion usually ends up nowhere close to a satisfying answer, either with the poster saying “oh I found the issue”, or “ah I actually had redirects in place” (not my case AFAIK, see further below) or “try a newer version” (and then the original poster disappears) or in some cases the thread received no answers at all… so I’m trying my luck here, with some thorough data to back me up (multiple websites, multiple servers, years of data)

Summary of the problem (TL;DR): on almost all the websites I maintain (except one or two), I get 30 to 50% of the visits being accounted as “Direct Entries” by Matomo—no matter the time of the year, no matter the server or what runs the website. On two different Matomo servers, running different Matomo versions, this strange phenomenon has always been there.

Since a picture/table is worth a thousand words, let me share a summary of statistics, comparing various websites on various servers:

This sample of data is made over the entire period between 2016-08-09 and 2019-02-07, so it’s covering 2.5 years, and I trust that’s enough to have some sort of data significance.

What you can notice there:

  • for almost all my websites, there is an insane (usually between 40 and 50%) “direct entry” rate (sometimes as crazy as 80%), whereas on Google Analytics you get maybe 7-10% being accounted as direct entry (unfortunately for this comparison, I don’t run GA on most websites because, well, it’s evil :wink:
  • my personal blog is an exception, the percentage of “Direct Entries” is lower because it receives an extraordinary (compared to other sites) amount of referrals/SEO (I’m not sure why, but oh well), so it is an outlier.

Notes:

  • Most websites are HTTP, some (my personal blog and “Org. P”) are HTTPS, and none of the websites weird redirections going on (as far as I can tell from using wget or some the random online “redirection check” tools, unless you have a better recommendation for a command to check this on Linux)
  • Beyond the notion of relative percentage, the absolute value of the number of direct entries stays very consistent from year to year. Some year you might get less referral visits, yet the direct entries remain pretty much the same.

Considering I’ve done no TV/radio advertising (and that in the very rare cases where I do online advertising, I use piwik tracking links), I have a really hard time imagining that nearly half of my hundreds of thousands of visitors type my website URL into their browser instead of coming from another website, from a search engine, or social media. My websites are not well-known public brands, it’s not Apple’s website, nor Amazon or YouTube… For the “Org. P” website, in December and January I have run a couple of social media advertising campaigns (with the ?pk_campaign=some_campaign_name_here tracking links, of course), and the results were mostly the same: still roughly the same percentage of direct entries (between 40 and 50%), plus about a thousand visits tracked as being part of my ad “campaigns”, and the rest is just normal traffic.

Interestingly enough, the fact that I did have a couple thousand “campaign” visits, amounting to 7% of visits, separately from “social networks” visits, tells me that it’s not a case of “something on my website/server is stripping off Matomo’s campaign URL parameters”, because if that was the case then it would be 0%.

So, at this point, I’m wondering what the “direct entries” truly mean, it’s making me doubt the validity of Matomo’s data, as I don’t believe for a second that I’m getting 60 thousand people (out of 140 thousand) entering the website address “manually” into their browser each month. It defies common sense and social psychology; I’m quite sure people are lazier than that! I can see three possibilities then:

  • I am wrong, and truly half of the trackable human population knows the brand of most of my websites (unlikely) and bothers typing it or accessing it from a bookmark (unlikely), month over month. Oh, and that also means Google Analytics is wrong.
  • There is a bug in all Piwik/Matomo versions and it consistently misclassifies a significant portion of traffic as being “Direct Entries”
  • The Internet is full of bots that ping every website out there all the time and they get considered as being “Direct entry” (but then: why would the traffic spike at the same time as the referral traffic, and how does Google Analytics deal with it and Matomo does not, and why would those bots be running the page’s javascript, etc.?), in which case that means I should completely ignore (segment out) all visits that are “direct entry” when looking at my website stats?

Hypotheses I’ve considered and discarded:

  • Could “something” (a redirection, a script, etc.) be “stripping off” the referrer information or stripping off campaign tracking URL parameters? Nope, because:
    • I did get 7% visits from the campaigns I ran recently, and if it was stripping it off then it should be 0%
    • I didn’t find anything odd going on with wget (or page redirection checking tools online)
    • During that time the “Direct entries” didn’t change, still between 40 and 50%.
  • Is it because of non-javascript visitors, or adblockers? No, because then I wouldn’t be seeing the “direct entries” numbers in Matomo any more than the other numbers… Note that Piwik/Matomo being blocked by EasyList is a fact of life since 2013 or so, I always presume visits on the website are 2-3x higher than reported, and so far I’ve been estimating that 80% of the visitors on the “Org. P” website are unseen by Matomo because they are using adblockers or some other technique.

So, yeah… I am totally puzzled by this. Why would “Direct Entries” represent such a big (>20%) portion of visitors on various websites? Did any of you encounter this problem, know other things to investigate or other possible explanations?


(Lukas Winkler) #2

Hi,

This is something I (and probably many others) are also seeing.

My hypothesis:

Keep in mind that direct entry doesn’t really mean that someone typed in the URL, it rather means that there was no way Matomo could find out where the visitor came from.

The main source to know this is the “Referer” header, that the browser sends on every website visit and that points to the URL you came from.
But things are quite more complicated that that. E.g. if you click on a link in webmail, you probably don’t want the website to know the full URL (which may contain things like the folder names of your emails). Therefore there is a HTTP header called Referer-Policy that a server can send and that tells the browser how to create the referer (e.g. only set the domain part or don’t send it at all). This is also the reason why Matomo can’t show you google keywords anymore, google sets the referer to just https://www.google.com.

But there is another large aspect:

When you go from a HTTPS site to a HTTP site, the browser is never allowed to send a referer as this would allow someone intercepting your internet connection to infer information about the secure site.

So if someone clicks on a link on https://some-random-website.example/ pointing to your website, no referer is set in the browser, Matomo has no way to know where the visitor is coming from and therefore you see it as a “direct entry”.

You may ask “I see visitors coming from Google and Google only uses HTTPS”, but this is just because Google overrides this setting on every link in the results page.

So my summary is: Try to get more of your website to HTTPS, worst case this didn’t improve, but your website visitors can be sure that they are getting your website and not a malware-added variant of your website, because their ISP thinks that modifying websites they deliver is a great idea.


(Jeff F.) #3

Thanks for the initial reply on this, the mystery remains so far however:

  • If I understand your point about the “Referer-Policy” header correctly, I would see a lot of visitors coming from a generic website (ex: google.com) but not from specifics (ex: mail.google.com)? But that’s not what we’re observing in this case.
  • If otherwise the Referer-Policy was somehow a reason the info was stripped entirely for people coming from their webmail or email clients (especially mail clients!), I should see a pattern that varies from normal website-referred visitors; ie I should see a spike of direct entries in particular when I’m sending out an email to the organization’s mailing list, and I should see the direct entries being dwarfed out by campaign or website referrers when I have an ad going or when a news site writes an article about the organization’s cause; instead, I’m seeing a rather constant presence of direct entries, never below 30% in the case at hand:

    …unless those are from a search engine that sets the Referer-Policy to “nothing at all” (?), or something else?
  • Regarding your 2nd hypothesis: the main website I’m concerned about lately (“Org. P”), which brought this whole reflexion, is using HTTPS everywhere (as demonstrated below), so I had already discarded the “maybe it’s losing HTTPS info going to HTTP” scenario…
$ wget http://the_domain
URL transformed to HTTPS due to an HSTS policy
Resolving the_domain... 138.68.xxx.xxx
Connecting to the_domain |138.68.xxx.xxx|:443... connected.
HTTP request sent, awaiting response... 200 OK

Hopefully I understood your reply correctly so far, and I welcome any further insights you folks may have for me!


(brody) #4

I am seeing 100% Direct Entries, I think this has something to do with my Apache headers

What should my http headers look like?

also what should I have in …

Cross-Origin Resource Sharing (CORS) domains?