Losing data daily when archive.php runs with cron


#1

After having used piwik successfully for about a year now, we decided to set cron to manage the archiving of reports instead of on user visit. We’ve got around 100 sites with fairly low traffic; although a number may reach 100 hits a day; small potatoes in comparison to some of the posters on this forum. We did this prior to the 1.6 upgrade and it seemed to be quite successful.

We run PHP 5.3.x via Apache 2/mod_fcgid on a Debian server and it works really well. We made some tweaks to ensure that the archive process ran successfully (details in this gist) by removing limits on execution time in php.ini and in the httpd vhost configuration.

After a few days we noticed that February’s data was set at 0 for 25% of our sites. I manually ran archive.php and it was still 0 until i ran with the --force-all-sites flag. This repeated the following day, and so on, for the last 7 days. We have since upgraded to 1.7 in the hope that it may resolve the issue, but 'twas of little or no avail.

Does anyone have any idea? I have since started logrotating so that I can try and spot any errors in the logs but haven’t found any evidence of errors.


(Matthieu Aubry) #2

Please run the archive.php with --force-all-websites --force-all-periods=8640000
does it archive the missing months?


#3

[quote=matt]
Please run the archive.php with --force-all-websites --force-all-periods=8640000
does it archive the missing months?[/quote]

Just --force-all-websites is enough. At the moment, I’m manually running it with --force-all-websites each morning. And each morning afterwards, the data disappears and I start again with --force-all-websites.

I’ve indulged you and done it with the additional --force-all-periods and it, too, has recovered the February data. Unfortunately, I’ll have to wait until tomorrow to see if it has permanently rectified the issue


(Matthieu Aubry) #4

Please update to 1.7.1 and try to remove all parameters. Is the problem still there?


#5

I updated to 1.7.1 and left it overnight and I just arrived this morning to find no improvement.

I can stick some logs online if it’s helpful although I’m struggling to see the problem myself.


(Matthieu Aubry) #6

Actually i have found a possible bug. Can you please try the following patch:


Index: misc/cron/archive.php
===================================================================
--- misc/cron/archive.php	(revision 5901)
+++ misc/cron/archive.php	(working copy)
@@ -278,7 +278,7 @@
 		    
 		    $visitsAllDays = array_sum($response);
 		    if($visitsAllDays == 0
-				&& $shouldArchivePeriods
+				&& !$shouldArchivePeriods
 				&& $this->shouldArchiveAllWebsites
 			)
 		    {

This might fix your problem. If not, please let meknow and post here the output of archive.php after applying patch? Thx


#7

This has improved matters significantly. From 50% of our sites to 10%. I’ll wait for the last cron job of the day before I uncork the champagne and rebuild the last 10%. Thanks for your help so far!


(Matthieu Aubry) #8

good to hear. I created ticket to track progress in archive.php does not archive weeks/month/year in some cases on low traffic websites · Issue #2984 · matomo-org/matomo · GitHub

But, it should fix the problem in all cases. So, if you still see websites with no data (tomorrow, after at least 24 h after applying patch) then please let meknow as we would want to fix it in all cases


#9

Bad news. By the time I arrived this morning, ~50% of the sites had reverted to 0 again.


(Matthieu Aubry) #10

Is your cron running hourly?
in general settings, change the timeout from 3600 (if you set that) to 2000 and wait next cron run?


#11

We run it every three hours but with a timeout of 3600 (1 hour). I can still change it to 2000 if you feel it will help?


(Matthieu Aubry) #12

Yes, and run it every hour too it might help?


#13

I set it to run every hour and set the refresh rate to 2000 and we are starting to see an increase in sites with 0 data.

I suspect that we’re seeing sites revert to 0 data overnight because their traffic reduces to 0 between cron jobs perhaps.


(Matthieu Aubry) #14

Perhaps, but the archive.php should also force to archive all websites at least once a day, when the day finishes in their timezone, it should trigger full archiving for these websites. Please keep reporting if you have bugs, we would have to investigate further to make sure it’s fixed!


#15

After it has run overnight on an hourly schedule, we now see even more sites (73%) with 0 data.


(Matthieu Aubry) #16

OK, if you don’t mind lets try this:

  • in the cron, write the output in a file such as php …/archive.php --url=http://… >> /home/test/output-archive-runs.log
  • Then wait 24 hours
  • Then PM me or email at matt att piwik the full log file containing at least 24 hourly runs
  • And also email me the list of websites IDs that do not display the weekly/monthly/yearly data etc.
  • Also email a list of timezones (are all websites set to the same timezone in Piwik website settings?)
    I will try to understand the problem / bug.

#17

Just a heads up. I PM’d you a link to the logs along with the IDs. You probably already know this :wink:


#18

Have you managed to resolve this issue yet Matt?

We still have plenty of sites displaying 0 data for the month despite having logged plenty of hits. We are having to force all sites on a daily basis right now which isn’t really a problem as I could just cron it anyway but I’d much prefer a less aggressive measure


(Matthieu Aubry) #19

The bug is my queue but it’s very long queue right now. I will take a look in the next 2 weeks for sure. stay tuned, it’s good that you have a work around in the meantime


#20

i experienced the same bug lately, and even if i can’t add any significant information to the above said, it seems that only current year, month, week, and day are not processed properly.

which workaround you would recommend?