Run matomo cronjob manually

melbao · April 19, 2023, 1:14pm

Hello, i have startet the matomo cronjob manually in the SSH console with saving a log file.

php /path/to/matomo/console core:archive --url=https://matomo.example.com/ > /path/to/logfiles/matomo-archive.log

Its the first time to run a “cronjob” in my matomo installation.

In the log file are logs like this:

e[32mINFO [2023-04-19 11:53:29] 55728 e[39m Archived website id 2, period = day, date = 2020-12-11, segment = '', 47 visits found. Time elapsed: 0.507s
e[32mINFO [2023-04-19 11:53:29] 55728 e[39m Archived website id 2, period = day, date = 2020-12-10, segment = '', 65 visits found. Time elapsed: 0.507s
e[32mINFO [2023-04-19 11:53:30] 55728 e[39m Archived website id 2, period = week, date = 2020-12-07, segment = '', 470 visits found. Time elapsed: 0.506s
e[32mINFO [2023-04-19 11:53:30] 55728 e[39m Archived website id 2, period = day, date = 2020-12-05, segment = '', 77 visits found. Time elapsed: 0.506s
e[32mINFO [2023-04-19 11:53:30] 55728 e[39m Archived website id 2, period = day, date = 2020-12-04, segment = '', 50 visits found. Time elapsed: 0.506s
e[32mINFO [2023-04-19 11:53:30] 55728 e[39m Archived website id 2, period = month, date = 2020-12-01, segment = '', 2264 visits found. Time elapsed: 0.278s
e[32mINFO [2023-04-19 11:53:33] 55728 e[39m Archived website id 2, period = week, date = 2020-11-30, segment = '', 557 visits found. Time elapsed: 2.241s
e[32mINFO [2023-04-19 11:53:41] 55728 e[39m Archived website id 2, period = year, date = 2020-01-01, segment = '', 27898 visits found. Time elapsed: 8.520s

and many many many more with old dates.
The logfile are over 2 MB …
This looks like old data was until now not archived ?

dbx12 · April 24, 2023, 7:20am

Hey melbao,
there can be many reasons why matomo processes an older date. Someone or something could have invalidated the archives for this date and thus they are reprocessed. Or the cronjob wasn’t run properly at all. Can you confirm your cronjobs are executed? Depending on your hoster or OS, there are additional hurdles to take when setting up cronjobs.

For your log output, it gets neater when you specify --no-ansi when you want to store it in a file, this prevents control sequences like the e[32m you see at the beginning of every line.

melbao · April 24, 2023, 1:44pm

hi @dbx12 , my matomo installation runs 9 years. There were never a cronjob activated. Plus, in the last months, it runs with only database R/W permissions without CREATE. One manually core:archive don’t make the job full. There were 5 times neccessary. After that, the log files are minimal and the old date jobs are rarely. Yet i have activated a hourly cronjobs and the jobs are only for the current period day / week / month / year. Further, i deactivate the browser triggering. Yet, the backend is very faster. There was also many old stuff was not archived.

A manually cronjob running in the console is practical for very small matomo databases to run it occasionally. Even for medium ones, an hourly cronjob should be created. If too much accumulates, then it takes a very long time and may not be done in one run. This is for many people probably the cause of this “Oops… there was a problem during the request.”. Maybe few manually cronjob runnings in the console solve this problem, but the browser triggering generate furthermore at every time a wait for and this can be end in a timeout.

Thanks for the hint with the --no-ansi. In the real cronjob there are no problems in the log files.

dbx12 · April 24, 2023, 2:15pm

Hmm, if it was running for 9 years, then it is unlikely the cronjob wasn’t running at all. But I see the same “weird” behavior of one core:archive invocation not doing “everything”.
For the matomo instance I supervise, we run hourly cronjobs but even they have very different run times. Some need over an hour to finish, others only 10 minutes and then they are done. So this further supports your / our assumption of “one invocation not doing everything”. I guess matomo does some prioritization and selects just the “most urgent” tasks, running the command repeatedly cleans out that queue and even the not-so-important tasks get their chance.

melbao · April 24, 2023, 7:46pm

When this is the situation with your hourly cronjob, increase the interval. maybe every 5 minutes. By me with hourly interval there is every cronjob nearly equal in time and the log files have a nearly equal size. Only after a invalidation, the first 5 cronjobs are very different.

dbx12 · April 25, 2023, 6:53am

It’s not that easy since the cronjob comes with an overhead for us (AWS hosted, cronjobs run on larger machines with more power). So increasing the rate cronjobs are started would increase cost as well and finance does not like that I think for now, we can deal with the occasional hour-long cronjob.

melbao · April 25, 2023, 10:57am

I am not sure if you understood what I wanted to tell you.

Few days ago i had invalidate all data in my matomo installation/database. After that i have start a cronjob core:archive @hourly with log files:

#1 2 hour 15 minutes 2.300 kb
#2 1 hour 15 minutes 1.300 kb
#3 15 minutes 300 kb
#4 45 minutes 800 kb
#5 10 seconds 15 kb
#6 1 minute 20 kb
#7 1 minute 20 kb
#8 1 minute 20 kb
#9 1 minute 20 kb
#10 1 minute 20 kb
#11 1 minute 20 kb
#12 1 minute 20 kb
... all further ~ 1 minute 20 kb for the last 3 days

In the first 4 times, there are old data archived. After that, only new data.
Example for a website (newer log snippet):

INFO [2023-04-23 07:24:53] 16710  Start processing archives for site 123.
INFO [2023-04-23 07:24:53] 16710    Will invalidate archived reports for today in site ID = 123's timezone (2023-04-23 00:00:00).
INFO [2023-04-23 07:24:53] 16710    Will invalidate archived reports for yesterday in site ID = 123's timezone (2023-04-22 00:00:00).
INFO [2023-04-23 07:24:54] 16710  Archived website id 123, period = day, date = 2023-04-23, segment = '', 12 visits found. Time elapsed: 1.012s
INFO [2023-04-23 07:24:55] 16710  Archived website id 123, period = week, date = 2023-04-17, segment = '', 123 visits found. Time elapsed: 0.984s
INFO [2023-04-23 07:24:56] 16710  Archived website id 123, period = month, date = 2023-04-01, segment = '', 1234 visits found. Time elapsed: 1.133s
INFO [2023-04-23 07:24:58] 16710  Archived website id 123, period = year, date = 2023-01-01, segment = '', 12345 visits found. Time elapsed: 1.846s
INFO [2023-04-23 07:24:58] 16710  Finished archiving for site 123, 4 API requests, Time elapsed: 5.225s [123 / 234 done]

So, when your cronjobs running more than few minutes, these also have to process/archive old data each time because they couldn’t finish with it with the previous cronjobs. So you have at every cronjob expansive cronjobs. When you increase the interval, then there is a chance that at some point all old data will have been processed/archived and only new data will be processed/archived. That make the cronjobs faster and less CPU-heavy. Yes, you have more cronjobs, but every faster and less CPU-heavy.