Archiver process 'stuck' on Tasks.purgeInvalidatedArchives

Gareth · February 17, 2021, 3:14pm

Hi,

We have an issue with archiving where it’s locked up (but doing something, I think) on the purgeInvalidatedArchives task.
The last entry in the archive log (it’s even running at vvv / DEBUG level) is:
INFO [2021-02-17 00:22:29] 108 Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.purgeInvalidatedArchives...
It’s been here for about 12 hours (I’ve left it to see if it will finish).
CPU utilisation on the server (ECS managed EC2 instance) has been at 50% since it started this process (around 00:25 this morning). So it seems to be doing something.
As far as I can tell, there’s not a lot of database activity.

Is there any way to tell what this is doing? Is it worth leaving it a little longer?

PS - this used to fail at this point due to memory issues (and would abort with an error at this point) - I’ve recently configured the PHP memory_limit fix.

Matomo version: 4.0.5

Gareth · February 17, 2021, 3:26pm

Here’s what some of the metrics look like…

In ECS:
Screen Shot 2021-02-17 at 15.19.11

And this is CPU on the EC2 instance.

Archiving here starts around midnight and then hits purgeInvalidatedArchives at 00:25 and locks the CPU on 50% - it’s still there.

Gareth · February 18, 2021, 12:42pm

Some further info (dates & logs) that could give some clues (sorry for the long post)…

We started capturing descent logs in Cloud Watch for this on 13th Jan. At that point, all looks good:

2021-01-13T00:25:14.295+00:00	INFO [2021-01-13 00:25:14] 108 Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.purgeInvalidatedArchives...
2021-01-13T00:25:15.133+00:00	INFO [2021-01-13 00:25:15] 108 Found 14 invalidated archives safe to delete in matomo_archive_numeric_2021_01.
2021-01-13T00:25:15.223+00:00	DEBUG [2021-01-13 00:25:15] 108 Deleted 322 rows in matomo_archive_numeric_2021_01 and its associated blob table.

This ran fine (daily) until 27th Jan where we got this:
(PS - we have since increased the PHP memory limit)

2021-01-27T00:24:35.522+00:00	INFO [2021-01-27 00:24:35] 108 Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.purgeInvalidatedArchives...
2021-01-27T00:25:02.621+00:00	ERROR [2021-01-27 00:25:02] 108 Fatal error encountered: /var/www/html/libs/Zend/Db/Statement/Pdo.php(233): Allowed memory size of 134217728 bytes exhausted (tried to allocate 61440 bytes)
2021-01-27T00:25:02.621+00:00	on /var/www/html/libs/Zend/Db/Statement/Pdo.php(233)

…but ran OK again on 28th:

2021-01-28T00:13:53.018+00:00	INFO [2021-01-28 00:13:53] 108 Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.purgeInvalidatedArchives...
2021-01-28T00:14:34.463+00:00	INFO [2021-01-28 00:14:34] 108 Found 4337 invalidated archives safe to delete in matomo_archive_numeric_2021_01.
2021-01-28T00:14:43.125+00:00	DEBUG [2021-01-28 00:14:43] 108 Deleted 35129 rows in matomo_archive_numeric_2021_01 and its associated blob table.

It continued fine until the end of the month (completed OK on Jan 31st).
Then, on 1st Feb, we got another…

2021-02-01T00:04:32.372+00:00	INFO [2021-02-01 00:04:32] 108 Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.purgeInvalidatedArchives...
2021-02-01T00:05:51.254+00:00	ERROR [2021-02-01 00:05:51] 108 Fatal error encountered: /var/www/html/libs/Zend/Db/Statement/Pdo.php(233): Allowed memory size of 134217728 bytes exhausted (tried to allocate 131072 bytes)
2021-02-01T00:05:51.254+00:00	on /var/www/html/libs/Zend/Db/Statement/Pdo.php(233)

Then on 2nd:

2021-02-02T00:20:17.583+00:00	DEBUG [2021-02-02 00:20:17] 107 Executing tasks with priority 9:
2021-02-02T00:20:17.591+00:00	DEBUG [2021-02-02 00:20:17] 107 Task Piwik\Plugins\CoreAdminHome\Tasks.purgeInvalidatedArchives is scheduled to run again for 2021-02-03.
2021-02-02T00:20:17.591+00:00	INFO [2021-02-02 00:20:17] 107 Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.purgeInvalidatedArchives...

(and at that point the archiver locks-up entirely - no more errors logged and the CPU running 100%)

We’ve been digging into this more deeply this week and increased the PHP memory limit - but on the next run:

2021-02-05T17:22:12.452+00:00	DEBUG [2021-02-05 17:22:12] 107 Executing tasks with priority 9:
2021-02-05T17:22:12.458+00:00	DEBUG [2021-02-05 17:22:12] 107 Task Piwik\Plugins\CoreAdminHome\Tasks.purgeInvalidatedArchives is scheduled to run again for 2021-02-06.
2021-02-05T17:22:12.458+00:00	INFO [2021-02-05 17:22:12] 107 Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.purgeInvalidatedArchives...

(locked-up again at this point)

It’s still there, having been “running” for 36 hours (I’d left it in the hope it might finish whatever it’s doing - but I’m not convinced it will get any further).

Any ideas what might fix this?

I’d been wondering whether ./console core:purge-old-archive-data all might help?

If anyone has any thoughts or advice, would be much appreciated.

Many thanks.

diosmosis · February 18, 2021, 4:28pm

Hi @Gareth, if it NEVER ran successfully for you then that would mean your database is full of old, invalidated archives that were never purged. So it’s possible that it’s just taking an incredibly long time to remove them. This doesn’t go through every table though so that would seem unlikely.

To diagnose whether it’s the amount of invalidated archives, you could run this query:

SELECT COUNT(*) FROM archive_numeric_2020_01 WHERE name LIKE 'done%' and value = 4;

For the moment I would recommend stopping the core:archive run that’s stalled then trying to run ./console core:purge-old-archive-data on specific tables to see if any stall in the same way.

EDIT: I see there are some successful runs, but the count would pile up after failed ones. Still it’s strange that such a simple task would just stall.

Gareth · February 19, 2021, 11:18am

Hi @diosmosis,

Thanks for the reply. Yeah, that’s really useful to know what should be happening behind the scenes. It is strange that it’s locked up… there doesn’t seem to be any DB activity so it seems like something is maybe looping - perhaps something corrupt in the archive table could cause that?

Anyway, I’ve run the query on a few of the most recent tables, this is what I see:

Table                     Total Rows      WHERE value = 4 and name LIKE 'done%'
archive_numeric_2020_10       164092          0 
archive_numeric_2020_11      3122599          0
archive_numeric_2020_12      8508696        309
archive_numeric_2021_01      5733325       3197
archive_numeric_2021_02        65457       2142

I guess a table with successfully purged invalidated archives should be 0? (3rd column above)

I will take a look at that core:purge-old-archive-data command and see if we can clear the blockage with that. I just need to take care and ensure no successfully archived data is lost, that’s all.

I’ll reply with the outcomes once I’ve tried a few things out.

Thanks again for your help!

Gareth · February 19, 2021, 4:03pm

Hi @diosmosis,

A quick question for you re your previous note… you mentioned:

I just wanted to check with you how best to do that (i.e. to target specific tables)?

I’ve run console help core:purge-old-archive-data to get some detail, but I’m not clear how the dates argument translates or which would be the wisest order to run any commands.

The help is…

The months of the archive tables to purge data from. By default, only deletes from the current month. Use ‘all’ for all dates. (default: [“2021-02-19”])

So I was wondering, should I run with specific dates? e.g.

console core:purge-old-archive-data "2021-01-31" -vvv

Or is it valid here to set a month range? e.g.

console core:purge-old-archive-data "2021-01" -vvv

(is that even a valid value here?)

I was thinking possibly that the all switch may not be the best option to diagnose where the issue might be (might just hang like the cron archiver and still we’d not be sure where the issue lies).

Also (last question) what would be the best order to run in - whether picking dates or months. Better to execute them chronologically: “2020-12”, “2021-01” then “2021-02” - or in reverse?
Likewise, if the best option is a specific date, then to select dates chronologically, or start with most recent and work back?

Sorry for all the questions. Only that I want to be sure of running the commands the best way, since I don’t want to risk losing any data or losing any useful troubleshooting info.

Thanks (sorry for the long post - again).

diosmosis · February 21, 2021, 8:42pm

This is the correct format, give it a date within the month.

This instinct is probably correct.

Working backwards is better since you’re more likely to have more archives that need to be purged in later dates. The exception to this being the january tables that also have year archives.

Gareth · February 22, 2021, 4:01pm

Hi @diosmosis,

Thanks for the detailed updates to the questions there. I will try this today.

One thing I did notice (in case this gives some clues)… when the purgeInvalidatedArchives archiver task starts, a fairly intensive SQL query kicks off. I’ve looked into it, and it seems to be the query from the getInvalidatedArchiveIdsSafeToDelete function here.

This query (when I ran it manually) took about 2 mins to run - and probably about another minute for the results to finish scrolling in my terminal. Is it normal for this query to produce such large results? (I’m guessing possibly yes due to the uplifted group_concat_max_len just before the query runs).

Thanks

diosmosis · February 22, 2021, 4:34pm

That query returns information for every archive that has duplicates. If the task has not run successfully in some time, and duplicate archives have not been purged, it would return a lot of data. Looking at the code though I can see a possible issue. Can you apply this patch and see if it fixes the issue for you (you will have to kill ongoing archiving runs): https://github.com/matomo-org/matomo/compare/4.x-dev...loop-break

Though strangely if this were the case I would also have expected the command to run forever for you as well.

Gareth · February 23, 2021, 10:50am

Hi @diosmosis,

Ah yes, I see what you mean. I will look into how we could introduce this here.

I’m not too familiar with PHP, but just reading about the difference between break and continue (here) and could see how perhaps the lack of a “.” might start looping around the same duplicateArchives[0] element indefinitely. If I’ve understood it correctly.

If this helps, I’ve run the query manually again (running it from a mysql client) and looked at the results with this loop in mind. And I wonder if it could be caused by the archives column being truncated and therefore missing a “.” in the last / truncated value?

Interestingly, the length of the string is 1024 characters and looks something like this… (see edit / correction below re. “1024”)

2418370.5,2417545.5,2416748.5,.........,2127808.5,2126853.5,2123

(I’ve removed much of the middle section here)

Now, I’m not sure whether this is my client that does this truncation, but I guess it could lead to archive items without a “.”. I guess the break would avoid the indefinite loop - but just wanted to let you know what I’d spotted in case it’s relevant.

Thanks again.

EDIT - ignore the note about 1024 characters - I realised I’d forgotten to add SET SESSION group_concat_max_len=131072; in my client - however even with this I still get truncation:
Previously:

206 rows in set, 192 warnings (3 min 5.46 sec)

And with the increased group_concat length:

206 rows in set, 30 warnings (1 min 32.90 sec)

The 30 warnings are all Row X was cut by GROUP_CONCAT()
Those rows in question terminate mid-id, similar to the example earlier.

I hope this is useful.

Gareth · February 23, 2021, 4:26pm

Hi @diosmosis,

Thanks for the help with this. The patch looks to be running better.
I see this in the logs now…

GROUP_CONCAT cut off the query result, you may have to purge archives again.

…which I guess we should have seen before - but the loop was preventing the code from getting to that point. Will keep an eye on it for the next couple of days - but so far is looking much better.

Thank you.

diosmosis · February 23, 2021, 10:26pm

Hi @Gareth, glad the change seems to be working for you, and thanks for going through the trouble of verifying it! If anything else comes up let us know.

Gareth · February 25, 2021, 11:02am

Will do. Thanks for finding a fix @diosmosis.
One thing I did notice is that we still get a number of GROUP_CONCAT cut off the query result, you may have to purge archives again messages each night (on the midnight run of “tasks with priority 9”). Maybe those will clear themselves after some time. I’ll keep an eye on it.
Thanks again