We’re doing some direct reporting on the DB to total up the number of pages viewed in a visit using page buckets (1, 2, 3, n pages per visit). The goal is to produce a histogram, but all we need is the data.
However, we’re getting really unusual distributions that are not squaring with other analytics systems (and this is across several sites with independent third party analytics). We’ve decomposed the SQL and there are no obvious problems. Piwik itself doesn’t provide this reporting, but from the schema, it seems eminently doable. What are we missing here?
page_count AS bucket_size, COUNT(idvisit) AS num_visits
idsite, idvisit, COUNT(pla2.name) AS page_count
JOIN piwik_log_action pla2 ON (pllva2.idaction_url = pla2.idaction)
pla2.type = 1
GROUP BY idvisit
ORDER BY idvisit ASC) tmp
GROUP BY bucket_size
ORDER BY bucket_size;
Piwik does provide this report, see: Piwik Forums - Tuesday, January 9, 2018 - Web Analytics Reports - Piwik
Thanks for this. I did find that later (although the buckets are too wide). Was able to run down the issue over the weekend and while it’s not exactly related to Piwik, but may be of interest to other users who are looking to work around Piwik volume limitations.
We’re monitoring dozens of sites that in total have billions of monthly impressions. To avoid having Piwik crumble or throw a lot of hardware at the problem, we built a sampling rate into the data collection JS to throttle this down to a manageable level. This works fine for nearly everything we measure (acceptable error bands on the metrics), but totally screws up engagement metrics such as pages per visit and repeat visits (because the 2nd…nth pageview/visit may not be sampled). Hence the weird distribution.
Btw: feel free to suggest changes to page buckets in a new issue @ Issues · matomo-org/piwik · GitHub