Hi all!
We’ve been experiencing some Matomo UI outages over the past couple of months. Our most recent outage was addressed by changing performance of our EFS filesystem to “Enhanced” in AWS. But now we’re facing a new issue. Ever since the most recent outage, we’ve seen a significant amount of sessions and transactions not being tracked in Matomo. I connected with our Matomo expert already on this, and he suggested it may be a tagging issue, but we’ve already tested several tagging fixes, and nothing seems to be helping (we also have deployed Matomo across +30 brands that all leverage different technologies/infrastructures, and we’re seeing this missing transaction issue across all of them). The timing of these issues also leads us to believe that this likely pertains to some challenging-to-detect back-end issue since we’re seeing the issue across all sites and we didn’t make a company-wide tagging change that would correlate with the issue.
I’ve spent the past several days sifting through logs in both GA4 (it’s tracking fine there) and Matomo trying to identify geo, device, OS, user type, purchase type, visitor type, page load timing patterns that could explain this, and I’ve found nothing. There’s no consistency with the missing sessions or purchases. The only thing I found that may be useful is that some of the missing purchases appear to happen in clumps.
Any idea what this could be? We know the missing transactions are real because the transaction IDs correlate with real purchases in our sales cube.
1 Like
Have you found anything out yet, because I’m seeing similar things and I need to be able to explain these deviations. Currently, I’m only getting cryptic answers or none at all. Thanks! Mel
Hey Mel. For background, we have an on-prem deployment. It ended up being an issue with our queued tracking. Below is the description from our cloud operations team.
"After about 3 hours of trial and error, we were able to get the queue processing working properly again. I ended up writing a script that will run the queue processing more explicitly, specifying the queue ID like they mentioned, but doing it for each of the queues within the redis cache. On prod we have 8 (1 per CPU core), and on dev we only have 1. What we were seeing was the number of requests not decreasing at all, any time we tried to run the singular queue processing with the crazy high request count, the lock was never released or would be lost, which would result in either the processed requests to be “rolled back”, aka added back into the queue, or there would be no requests processed at all.
The script I wrote takes into account the number of cores/queues and has retry logic if there are failures in processing. Over the last couple of hours we have eaten through 2GB of requests (we were at 7.51GB in the redis cache, now at 5.51GB). We are processing thousands of reqests per second based on the queuedtracking:monitor output. We were at 2.8 million when this version of the script first started out, now we are just below 2 million.
I am working on adding some extra monitoring to this, it looks like there is some basic alerting that can be set up when the queue size surpasses 250,000 requests, which I set up. I would like to create a cloudwatch alarm for this as well. I also am seeing that the old CloudWatch alarm that I had set up which is supposed to alert us when the Redis cache utilization exceeds 70% is not working/collecting data. I will fix that today as well."
After a couple days, things returned to normal and we were capturing all sessions/transactions again.