Database old archive data strong increased

Matomo version: 5.0.3
MySQL version: 10.5.23-MariaDB-0+deb11u1
PHP version: 8.3.6

Cronjob: @hourly (all fine).

I had considered, that the database size are more increased as expected. Normaly monthly 30 MB increasing. Last month 130 MB increased.

So, yet i had checked the database.

I have a mysqldump from 2024-04-01 and from 2024-05-01.

The archived data since 11/2011 are strong increased in the mysqldump from 2024-05-01 compared to the mysqldump from 2024-04-01.
Database tables: archive_blob_[year]_[month]

mysqldump                   mysqldump 
2024-04-01  Rows    Size    2024-05-01  Rows    Size    Difference  Rows    Size    

2021_11 	32960 	11		2021_11 	33468 	18		2021_11 	508		7
2021_12 	27599 	9		2021_12 	28101 	13		2021_12 	502		4
2022_01 	31278 	13		2022_01 	31804 	19		2022_01 	526		6
2022_02 	25667 	10		2022_02 	26135 	15		2022_02 	468		5
2022_03 	29756 	10		2022_03 	30266 	16		2022_03 	510		6
2022_04 	26783 	9		2022_04 	27277 	15		2022_04 	494		6
2022_05 	36019 	11		2022_05 	38690 	17		2022_05 	2671	6
2022_06 	31780 	10		2022_06 	34526 	17		2022_06 	2746	7
2022_07 	31573 	10		2022_07 	34426 	17		2022_07 	2853	7
2022_08 	32442 	12		2022_08 	35474 	18		2022_08 	3032	6
2022_09 	28875 	9		2022_09 	31646 	14		2022_09 	2771	5
2022_10 	27343 	9		2022_10 	30220 	14		2022_10 	2877	5
2022_11 	26116 	8		2022_11 	28891 	12		2022_11 	2775	4
2022_12 	26878 	8		2022_12 	29759 	12		2022_12 	2881	4
2023_01 	33821 	14		2023_01 	36319 	19		2023_01 	2498	5
2023_02 	29050 	9		2023_02 	31267 	14		2023_02 	2217	5
2023_03 	33571 	11		2023_03 	36288 	17		2023_03 	2717	6
2023_04 	31652 	10		2023_04 	34247 	17		2023_04 	2595	7
2023_05 	33864 	12		2023_05 	36330 	18		2023_05 	2466	6
2023_06 	33141 	11		2023_06 	35415 	17		2023_06 	2274	6
2023_07 	39598 	12		2023_07 	42171 	18		2023_07 	2573	6
2023_08 	33237 	11		2023_08 	35462 	15		2023_08 	2225	4
2023_09 	31637 	10		2023_09 	33901 	14		2023_09 	2264	4
2023_10 	36138 	12		2023_10 	38570 	16		2023_10 	2432	4
2023_11 	34245 	12		2023_11 	36647 	16		2023_11 	2402	4
2023_12 	38267 	12		2023_12 	40775 	17		2023_12 	2508	5
2024_01 	42349 	16		2024_01 	44945 	21		2024_01 	2596	5
2024_02 	31791 	12		2024_02 	34450 	17		2024_02 	2659	5
2024_03 	31323 	12		2024_03 	35051 	18		2024_03 	3728	6

In the next step, i had checked an archive_* date range database table. The 2024-05-01mysqldump has more *_chunk_* rows compared to the 2024-04-01 mysqldump.
Example:

mysqldump 2024-04-01

2641 	Actions_actions 	2 	2021-11-01 	2021-11-30 	3 	2023-04-04 14:37:49 	[BLOB - 15.2 KiB]
2641 	Actions_actions_url 	2 	2021-11-01 	2021-11-30 	3 	2023-04-04 14:37:49 	[BLOB - 687 B]
2641 	Actions_actions_url_chunk_0_99 	2 	2021-11-01 	2021-11-30 	3 	2023-04-04 14:37:49 	[BLOB - 16.3 KiB]
2641 	Actions_downloads 	2 	2021-11-01 	2021-11-30 	3 	2023-04-04 14:37:49 	[BLOB - 14 B]

2642 	Actions_actions 	2 	2021-11-01 	2021-11-07 	2 	2023-04-04 14:37:45 	[BLOB - 8.9 KiB]
2642 	Actions_actions_url 	2 	2021-11-01 	2021-11-07 	2 	2023-04-04 14:37:45 	[BLOB - 501 B]
2642 	Actions_actions_url_chunk_0_99 	2 	2021-11-01 	2021-11-07 	2 	2023-04-04 14:37:45 	[BLOB - 7.2 KiB]
2642 	Actions_downloads 	2 	2021-11-01 	2021-11-07 	2 	2023-04-04 14:37:45 	[BLOB - 14 B]
mysqldump 2024-05-01

10398 	Actions_actions 	2 	2021-11-01 	2021-11-30 	3 	2024-04-16 10:15:21 	[BLOB - 211.8 KiB]
10398 	Actions_actions_url 	2 	2021-11-01 	2021-11-30 	3 	2024-04-16 10:15:21 	[BLOB - 859 B]
10398 	Actions_actions_url_chunk_0_99 	2 	2021-11-01 	2021-11-30 	3 	2024-04-16 10:15:21 	[BLOB - 81.4 KiB]
10398 	Actions_actions_url_chunk_100_199 	2 	2021-11-01 	2021-11-30 	3 	2024-04-16 10:15:21 	[BLOB - 64.9 KiB]
10398 	Actions_actions_url_chunk_200_299 	2 	2021-11-01 	2021-11-30 	3 	2024-04-16 10:15:21 	[BLOB - 52.3 KiB]
10398 	Actions_actions_url_chunk_300_399 	2 	2021-11-01 	2021-11-30 	3 	2024-04-16 10:15:21 	[BLOB - 22.2 KiB]
10398 	Actions_downloads 	2 	2021-11-01 	2021-11-30 	3 	2024-04-16 10:15:21 	[BLOB - 14 B]

10399 	Actions_actions 	2 	2021-11-01 	2021-11-07 	2 	2024-04-16 10:15:15 	[BLOB - 83.6 KiB]
10399 	Actions_actions_url 	2 	2021-11-01 	2021-11-07 	2 	2024-04-16 10:15:15 	[BLOB - 774 B]
10399 	Actions_actions_url_chunk_0_99 	2 	2021-11-01 	2021-11-07 	2 	2024-04-16 10:15:15 	[BLOB - 37.9 KiB]
10399 	Actions_actions_url_chunk_100_199 	2 	2021-11-01 	2021-11-07 	2 	2024-04-16 10:15:15 	[BLOB - 41.0 KiB]
10399 	Actions_actions_url_chunk_200_299 	2 	2021-11-01 	2021-11-07 	2 	2024-04-16 10:15:15 	[BLOB - 7.8 KiB]
10399 	Actions_downloads 	2 	2021-11-01 	2021-11-07 	2 	2024-04-16 10:15:15 	[BLOB - 14 B]

In the next step, i had checked this archive_* data binary Blobs with PHP.
Example:

$blob_content = file_get_contents(__DIR__ . "/20240501-Actions_actions_url_chunk_300_399.bin");
$blob_content = unserialize(gzuncompress ($blob_content));

var_dump($blob_content);

/*
php blob.php > 20240501-Actions_actions_url_chunk_300_399.txt
*/

The blobs from the last mysqldump (20240501) have URLs inside, that not included in the older mysqldump (2024-04-01).

The question is: How does additional tracking data get into old archived data?

News:

The matter is very confusing. The mysql dumps have every time in the last days different sizes. No compression is used. The dumps are *.sql. No deletion of data or reports or elsewhere is activated. Only a cronjob @hourly core:archive.

actually: 2024-04-01 - 1,654 MB

expected: 2024-05-01 - 1,684 MB (+ 30 MB / monthly in the last months)

actually: 2024-05-01 - 1,780 MB (~100 MB bigger as expected)

actually: 2024-05-05 - 1,800 MB

actually: 2024-05-06 - 1,738 MB

Then I run: php ./console core:purge-old-archive-data all

actually: 2024-05-06 - 1,738 MB (~50 MB bigger as expected)

It is really confusing.

When I have it correctly observed, the content of the database table archive_invalidations are changed (deleted) with cron:archive (with the cronjob) infrequently? The size of this table is relevant with up to 20 MB.

Every cron core:archive log file contains at the end always a SCHEDULED TASKS log. In few core:archive cron jobs mosty between 22 and 02 o’clock are few or many solved tasks are logged.

The last information says, that it must be wait after a solved manually task like core:invalidation and core:archive before making a new mysql dump? Because, there are few or many tasks are still unsolved?

Example: Cron log file SCHEDULED TASKS section:

INFO      [2024-05-06 00:25:13] 19477  ---------------------------
INFO      [2024-05-06 00:25:13] 19477  SCHEDULED TASKS
INFO      [2024-05-06 00:25:13] 19477  Starting Scheduled tasks... 
INFO      [2024-05-06 00:25:13] 19477  Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.invalidateOutdatedArchives...
INFO      [2024-05-06 00:25:13] 19477  Browser triggered archiving disabled, archives will be invalidated during core:archive.
INFO      [2024-05-06 00:25:13] 19477  Scheduler: finished. Time elapsed: 0.000s
INFO      [2024-05-06 00:25:13] 19477  Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.deleteOldFingerprintSalts...
INFO      [2024-05-06 00:25:13] 19477  Scheduler: finished. Time elapsed: 0.014s
INFO      [2024-05-06 00:25:13] 19477  Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.purgeOutdatedArchives...
INFO      [2024-05-06 00:25:13] 19477  Purging archives in 698 archive tables.
INFO      [2024-05-06 00:25:15] 19477  Scheduler: finished. Time elapsed: 1.642s
INFO      [2024-05-06 00:25:15] 19477  Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.purgeOrphanedArchives...
INFO      [2024-05-06 00:25:16] 19477  Deleted 645 rows in archive tables (numeric + blob) for deleted segments for 2021-11-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 636 rows in archive tables (numeric + blob) for deleted segments for 2021-12-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 659 rows in archive tables (numeric + blob) for deleted segments for 2022-01-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 586 rows in archive tables (numeric + blob) for deleted segments for 2022-02-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 640 rows in archive tables (numeric + blob) for deleted segments for 2022-03-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 621 rows in archive tables (numeric + blob) for deleted segments for 2022-04-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1370 rows in archive tables (numeric + blob) for deleted segments for 2022-05-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1298 rows in archive tables (numeric + blob) for deleted segments for 2022-06-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1355 rows in archive tables (numeric + blob) for deleted segments for 2022-07-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1357 rows in archive tables (numeric + blob) for deleted segments for 2022-08-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1269 rows in archive tables (numeric + blob) for deleted segments for 2022-09-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1358 rows in archive tables (numeric + blob) for deleted segments for 2022-10-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1278 rows in archive tables (numeric + blob) for deleted segments for 2022-11-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1299 rows in archive tables (numeric + blob) for deleted segments for 2022-12-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1349 rows in archive tables (numeric + blob) for deleted segments for 2023-01-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1219 rows in archive tables (numeric + blob) for deleted segments for 2023-02-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1322 rows in archive tables (numeric + blob) for deleted segments for 2023-03-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1342 rows in archive tables (numeric + blob) for deleted segments for 2023-04-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1376 rows in archive tables (numeric + blob) for deleted segments for 2023-05-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1271 rows in archive tables (numeric + blob) for deleted segments for 2023-06-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1345 rows in archive tables (numeric + blob) for deleted segments for 2023-07-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1310 rows in archive tables (numeric + blob) for deleted segments for 2023-08-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1254 rows in archive tables (numeric + blob) for deleted segments for 2023-09-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1339 rows in archive tables (numeric + blob) for deleted segments for 2023-10-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1298 rows in archive tables (numeric + blob) for deleted segments for 2023-11-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1327 rows in archive tables (numeric + blob) for deleted segments for 2023-12-15.
INFO      [2024-05-06 00:25:16] 19477  Deleted 1368 rows in archive tables (numeric + blob) for deleted segments for 2024-01-15.
INFO      [2024-05-06 00:25:17] 19477  Deleted 1234 rows in archive tables (numeric + blob) for deleted segments for 2024-02-15.
INFO      [2024-05-06 00:25:17] 19477  Deleted 1364 rows in archive tables (numeric + blob) for deleted segments for 2024-03-15.
INFO      [2024-05-06 00:25:17] 19477  Deleted 854 rows in archive tables (numeric + blob) for deleted segments for 2024-04-15.
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 1.916s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.updateSpammerList...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.029s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\Referrers\Tasks.updateSearchEngines...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.040s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\Referrers\Tasks.updateSocials...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.026s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\Login\Tasks.cleanupBruteForceLogs...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.001s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\TwoFactorAuth\Tasks.cleanupTwoFaCodesUsedRecently...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.002s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\UsersManager\Tasks.cleanupExpiredTokens...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.001s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\UsersManager\Tasks.setUserDefaultReportPreference...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.007s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\UsersManager\Tasks.cleanUpExpiredInvites...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.002s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.purgeInvalidatedArchives...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.062s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.purgeInvalidationsForDeletedSites...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.058s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\PrivacyManager\Tasks.deleteReportData...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.021s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\PrivacyManager\Tasks.deleteLogData...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.000s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\PrivacyManager\Tasks.anonymizePastData...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.002s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\PrivacyManager\Tasks.deleteLogDataForDeletedSites...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.007s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.cleanupTrackingFailures...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.001s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\CoreAdminHome\Tasks.notifyTrackingFailures...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.002s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\CoreUpdater\Tasks.sendNotificationIfUpdateAvailable...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.002s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\Marketplace\Tasks.clearAllCacheEntries...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.009s
INFO      [2024-05-06 00:25:17] 19477  Scheduler: executing task Piwik\Plugins\Marketplace\Tasks.sendNotificationIfUpdatesAvailable...
INFO      [2024-05-06 00:25:17] 19477  Scheduler: finished. Time elapsed: 0.002s
INFO      [2024-05-06 00:25:17] 19477  done
INFO      [2024-05-06 00:25:17] 19477  ---------------------------

The question about that is: Is it possible to start all SCHEDULED TASKS manually before making a new mysql dump?
EDIT: Yes, it is possible with

php ./console core:run-scheduled-tasks --force --no-ansi > run-scheduled-tasks-$(date +"%Y%m%d%H%M%S").txt

--force ignored the scheduled task intervall like hourly/daily/weekly/monthly.

I run few commands manually:

  • console core:archive --url=https://example.com/
  • console core:run-scheduled-tasks --force (Info)
  • console core:purge-old-archive-data all (Info)

After that, the database tables

  • segment = 12 rows, 16.0 KiB (all marked as “deleted”).
  • archive_invalidations = 0 rows.

The database mysql dump is yet 1,733 MB (furthermore 50 MB more as expected).

The TRUNCATE of the database table archive_invalidations is fine, because it was 20 MiB.

The furthermore storing of “deleted” segments is not so fine. But it is only 16 KiB.

I think the “Save” of a segment is the cause of the increasing of the database, because, in the month (April) between the monthly mysql dumps …

  • 2024-04-01 - 1,650 MB
  • 2024-05-07 - 1,733 MB (50 MB more as expected)

… the only what was done differently compared to the months before was a “Save” of a segment (and afterwards “deleting”. This Problem: Problem with Segments).

So, the question here is: It is possible or is it so, that a “Save” of a segment increase the database archive_* tables? And when yes, how is it possible to make it undo?

So, a long way …

After, I had found ./console core:purge-old-archive-data all (includes ./console database:optimize-archive-tables all ) and ./console core:run-scheduled-tasks --force, i tested little bit things, but no way to decrease the database to the expected size.

So, after that, I had TRUNCATE all archive (report) database tables archive_blob_****_** and archive_numeric_****_** via phpMyAdmin. And also (in my case) TRUNCATE the database tables archive_invalidations and segment and goal.

So, yet my database has only the log (raw) data and matomo settings saved.
= 938 MB

Next step: .console core:archive --url=https:matomo.example.com, but this creates reports (archive) only from the current day, week, month, year.

So, before run core:archive, it is a ./console core:invalidate-report-data --dates=1996-01-01,2024-05-12 --ignore-log-deletion-limit is needed.

Date range:
--dates=[first database archive table year_month + day 01],[current year_month_current-day]
In my case 1996-01-01 and 2024-05-12.
This will be create many many rows (entries) in the database table archive_invalidations.

So, yet start the .console core:archive --url=https:matomo.example.com.
Yet, it will be created reports (archive) for all dates with tracked data.

After that, a ./console core:run-scheduled-tasks --force cleaned up the database (TRUNCATE archive_invalidations).

So, yet, my database is decreased and lower as expected. :heart:
= 1,583 MB

Result: maybe bugs and others

It seems, that the invalidation and the purge (and the run-scheduled-tasks) commands don’t work correct (but without error).
I had observed, that the database after this commands are not up to date. Used the mysql (Maria DB, INNO DB) a cache?

For creating backups (mysqldump), it is needed to make before a run-scheduled-tasks --force and a purge all, because otherwise junk data are backuped.

1 Like

Hi @melbao
I suggest you create a bug report in the GirHub repo:

1 Like