Pageviews erroneously duplicated after import with import_logs.py


#1

Greetings,

In Piwik 2.3, I have my logs set to import once a day with import_logs.py, followed by archiving, as a cron job. Although the unique visits aren’t too high (in the high hundreds per day) and although the archiving is not being performed at the time of viewing the visitor log pages in Piwik, sometimes it can take several minutes for any visitor log page to load for a single day, or trying to load a visitor log page for a single day can even time out in the “oops…” etc error.

When these slow or failed loading times are happening, a concomitant symptom is that the page views in the log are duplicated dozens or hundreds of times in the visitor log display. I’m attaching a screenshot showing this so you can see the numbers in question.

Is this expected? This server is a few years old but it has plenty of RAM and HD space and it is performing all of its other web service related work very well. I’ve lengthened the mysql timeout which has helped with the fact that this situation always used to end with a mysql timeout, but that wasn’t the underlying issue causing the slow loads/no loads and heavily duplicated pageviews.

Piwik is working great for me in every other respect so it would be great to get some help figuring out what is going on with the slowness and duplication, since it makes it not possible for me to use Piwik right now when it’s hard to load pages at all and then they have wrong pageview numbers on them. I don’t see any error output in the Piwik logs or in my archive.log that the archive call outputs to, but perhaps I’m looking in the wrong place.

Thank you for your help! Here is my script that is called by cron to do the log import and archive:


#!/bin/bash
#
/path_to_public_html/piwik/misc/log-analytics/import_logs.py \
--url=http://www.site.com/piwik \
--enable-reverse-dns \
--idsite=1 \
--exclude-path="/wp-login.php?action=register" \
--exclude-path="/wp-login.php" \
--exclude-path="/wp-admin/admin-ajax.php" \
--exclude-path="/robots.txt" \
--exclude-path="/wp-cron.php" \
--exclude-path="/ascript1.php" \
--exclude-path="/ascript2.php" \
--exclude-path="/ascript3.php" \
--exclude-path="/anrssfeedurl/" \
--exclude-path="/anotherrssfeedurl/" \
--exclude-path="/piwik/*" \
--exclude-path="/ascript4.php" \
--exclude-path="/somestuffnoonecaresabout/*" \
--exclude-path="/someotherstuffnoonecaresabout/*" \
--exclude-path="/wp-admin/*" \
/var/log/httpd/httpdlog-$(date -d 'yesterday' "+%Y%m%d").gz

php /pathtopiwik/console core:archive \
--url=http://www.site.com/piwik/ \
> /var/log/piwik/piwik-archive.log


(Matthieu Aubry) #2

Greetings

What this 123X icons means, is that, the visitor in this row viewed that same web page, 123 times in a row. So Piwik will select very many rows from the DB to display you this 123X (since it selects at least these 123 rows). This can mean that to display visitor log it may take selecting thousands of rows, which is slow on your server.

the question is, why do you have such repeats in the visitor ? is it normal behavior that one person views many same pages, or is it maybe a bug do you think?


#3

Hi Matt,

Thanks for your response. I’m asking about it as a presumed bug since no one is reloading a single page 123 times on a visit or even 4 times under usual circumstances for this site. It’s some kind of error during the import I believe, or possible at the time that the visitor log page is loading in the Piwik front end.

I moved this server to a new host last week, so previously it was on shared hosting and the Piwik installation doing log import never showed a repeat of more than 3x for the couple weeks or so it was running before the move. The reloads and revisits were as you’d expect, ranging from 1x to sometimes 3x for something the visitor spent a lot of time with and/or came back to a few times. Now on this new machine (no longer shared) as you can see it shows many different visitors viewing a page anywhere between 4 to 65 to 123 times. Some visits also lack the duplicates and some have lower values which are still a little suspicious (like 12x and 16x and 44x).

Another hint that it’s a bug is that the specific unrealistically-high numbers 65 and 64 are repeated so frequently across different users’ visits. I’ve also looked at the raw log file and the repetitions aren’t found in it.

This is a CentOS 6.5 machine, 64-bit AMD Athlon, a bit old, but as mentioned doing everything else web-service-related normally. While trying to troubleshoot this a few times I’ve done fresh installs of Piwik and it’s also my impression that sometimes that issue with the duplicated revisits doesn’t happen on the same logfiles as input.

I can turn on any logging in the import script or elsewhere that could help to solve this. Thanks again!


#4

Oh, also wanted to add - this issue appears whether archiving occurs on page load or whether it is done automatically by the cron-invoked script after the import completes.


(Matthieu Aubry) #5

Can you post a full screenshot of visitor log ? for example i’m wondering if Ip addresses are looking correct, or are they all set to same ip address in visitor log ?


#6

Sorry, I can’t, because among other reasons it won’t let me load a visitor log that isn’t heavily filtered due to “oops” errors I mentioned above, but also because it’s sensitive data.

But I can tell you that the IPs are definitely correct and different visitors aren’t being conflated. For the filtered sets that I am able to view, I can see the same unique IP addresses in the original logs and in the eventual Piwik output, in both cases correctly listed as different visitors. The difference is that in the original log it shows that IP address requesting two pages once each, and in Piwik it shows them requesting two pages 64 times each.


#7

Today the same issue recurred, but there were also repeated marked-as-separate-visitor visits from the same IP occurring in the same second, usually four or eight of them, along with the usual repetitions of between 4x and 123x. Additionally, the stats for yesterday were removed so yesterday’s stats now show zero visits, while they showed about a thousand yesterday.


#8

Hello,

Here is a redacted screenshot showing a small custom visitor filter and its results. For the IP address ending in 59 there is really only a single visit of a single page, but in the Piwik stats it is split into three different visits comprising a total of 41 different pageviews all occurring in the same second with the same IP, so it doesn’t look like 39 different pageviews that got somehow concatenated, but instead one pageview by one visitor that was unintentionally duplicated into 41 pageviews viewed by 3 unique visitors, which looks like it happened during the import process.

Here is the only reference to this IP address and its page visit in the access log that was imported (the visit comprises two lines because there was a GET of the page and a POST to the Wordpress heartbeat function, although the POST isn’t the Piwik stats because my log_import.py arguments (visible above) request that the wordpress ajax calls are not imported into the Piwik stats:

XXX.XXX.XXX.59 - - [03/Jun/2014:20:32:12 +0200] “GET /PATHTOVISITEDPAGE/ HTTP/1.0” 200 15857 “GOOGLESEARCHTHATLEDTOPAGE” "VISITORSUSERAGENT"
XXX.XXX.XXX.59 - - [03/Jun/2014:20:32:15 +0200] “POST /wp-admin/admin-ajax.php HTTP/1.0” 200 589 “/PATHTOVISITEDPAGE/” “VISITORSUSERAGENT”

The log was a gzip in case that is interesting. Something else I’ve noticed about these imports is that they are slow on this machine compared to others I’ve seen do this import without all the issues. Part of that can be attributed to the fact that the machine is pretty old and has a single core, but it has a lot of RAM and the uncompressed logs are usually only 1-2 MB and it can take 20 minutes or more to import one. I moved to the GeoIP PECL method in order to see if that helped with the speed but it didn’t make a difference.

I’m happy to leave the troubleshooting of why Piwik deleted the previous day’s visit stats when it imported the current day’s visit stats until after the duplicate pageview thing is solved. Thank you for your help.


(Matthieu Aubry) #9

can you do reverse lookup on the IP address, what domain name / internet provider is revealed?


#10

I can, but it doesn’t seem relevant to this issue since it occurs with the majority of my visitors. I can tell you that the IP address corresponds to a host which is almost guaranteed not to have anything technically inept about its setup and it isn’t a bot. Why do you ask, do you think it is possible that somewhere in the GeoIP lookup the number of visits is being multiplied in the database?

I turned on verbose logging for Piwik and here are some things that are evident from the log:

Archiving is happening automatically many times a day, although I really only need my one archiving pass right after the log import because there is only a single import per day and no other data comes in. It’s unexpected to me that when I turned off the option to archive on HTML page load in favor of running my own archiving job once a day that archiving would still be taking place as an independent cron job, but it is because the verbose archiving log shows archiving starting at 5:19am (a classic “daily” cron job random time) but my archiving job runs only once at 7am.

I would like to be able to turn off all cron-triggered auto-archiving since I believe it is confusing things, since the cron-triggered archiving gets mysql errors and then restarts and it is also apparently deleting my previous days’ data if this message is correct:


INFO CoreConsole[2014-06-05 05:23:26] - Will process 1 other websites because some old data reports have been invalidated (eg. using the Log Import script) , IDs: 1

Here is an example of an error which repeats in many different archiving processes:


INFO Referrers[2014-06-04 14:52:54] LOAD DATA INFILE failed... Error was: SQLSTATE[28000]: Invalid authorization specification: 1045 Access denied for user 'piwik'@'localhost' (using password: YES)[28000]

Since Piwik can connect to the database under many other circumstances this error is unexpected. Another error from Piwik’s own cron-triggered archiving:


ERROR CoreConsole[2014-06-05 05:25:57] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

I had given the MySQL server a really long timeout due to other Piwik timeouts, so I have no idea why this small number of visitors of about 1000 is overwhelming the timeout during archiving. I’m suspicious that the difference between the original logs and the results in Piwik could be related to the fact that multiple archiving passes are run over the course of a day and that they don’t complete without MySQL errors.

Today when I first checked how my log import job had done, the data for the 2nd was still deleted, there was visitor log data for both the 3rd and the 4th, but in the graph on the upper left of the dashboard it showed zero visits for the 4th. It still wasn’t possible to ever load the default visitor log view without a timeout because so many visits had 70x or 150x visits and it is too much to show.