Looking for suggestion to improve archiving performance

Rama_Gurram · October 27, 2023, 3:20pm

We have more than 100k+ sites and we want to know what we can delete or modify in the database to speed up the archiving process if its even possible. We cleared old archive tables for “matomo_archive_blob” and “matomo_archive_numeric” to clear some space, but still they recreating after execution of batch job. Any suggestions pls

Matomo_version: “3.14.1”

Thanks

heurteph-ei · October 31, 2023, 7:56am

Hi @Rama_Gurram
As first suggestion, I would advise to upgrade to Matomo 4!

If old unused sites are present, you can also delete them.

Also, as expected, in your situation, the archiving process is long because you have many sites to process…

melbao · October 31, 2023, 8:14am

Maybe you must to experience with the archiving interval. Less data → long interval. Much data short interval. Very much data very short interval. Observe the archiving log files and the server cpu load.

Rama_Gurram · November 2, 2023, 10:19am

Thanks for the suggestions but we can’t upgrade the matomo version to 4 since we have more sites in production… but is there any specific feature in v4 that controls the duration of archiving? we tried couple of changes cleaning up old reports and logs etc… but no luck. Is there any further suggestion to improve the duration on archiving with 185k+ sites?

melbao · November 2, 2023, 11:43am

“cleaning up old reports” … when the cleaned old reports don’t invalidate, than it has no consequense to the archiving, because the cleaned reports would be not refresh by a new archiving. But the raw data is furthermore saved in the database and maybe make it slowly. When the raw data is no longer necessary, than delete it.

What is your cronjob interval?
Consider: the archiving don’t archived all raw data in one cronjob, when the data are too much. Then it overwelmed. Better is a interval with that the cronjob can archive all new raw data in one run.
By 185k+ sites a cronjob interval @daily is too small. Better @hourly or half hourly or each 5 minutes.

Delete old RAW data

heurteph-ei · November 2, 2023, 12:17pm

Hi @Rama_Gurram

Sorry I don’t understand the reason…

Not sure, but I think Matomo optimizes better archiving with newer versions…

Rama_Gurram · November 8, 2023, 3:08pm

Thanks for the suggestions. we tried recurring every 4hours interval in a day, that helped us to reduce some duration compared with previous ones. But we would like understand more on how we can make sure to archive only last one year of data and also how to keep only last one year sites data in the DB instead all old years.

Our intention is to keep last one year of historical data in DB and only last year worth of data can be archived. Please let us know if anyone having any thoughts to overcome this… that would be great help!

Thanks

heurteph-ei · November 9, 2023, 4:49pm

Hi @Rama_Gurram
In > Privacy > Anonymize data, there is some sections to delete periodically raw and aggregated data:

Regularly delete old raw data
Delete old aggregated report data

Rama_Gurram · November 10, 2023, 1:25pm

Thanks @heurteph-ei, we verified Regularly delete old raw data is enabled with Delete logs older than 30 days but we haven’t enabled Delete old aggregated report data. So, if we enable Delete old aggregated report data would speed up archiving? if so, can we know why?

melbao · November 10, 2023, 2:55pm

it’s weekend.

@Rama_Gurram your intention is not really clear for me.

“to speed up the archiving process”
“we tried recurring every 4hours interval in a day”
“Our intention is to keep last one year of historical data in DB and only last year worth of data can be archived.”
“how to keep only last one year sites data in the DB instead all old years.”
" Regularly delete old raw data is enabled with Delete logs older than 30 days"
" we haven’t enabled Delete old aggregated report data."

puh , thats many input.

The workflow:
Tracking: raw data saving → Archiving: report data saving → Backend: Display reports.
By “Live” reports:
Tracking: raw data saving → Backend: Display reports.

There are 2 Datadumps in the database:

Raw data
Report data

The Matomo Tracking saved only raw data in the database. This raw data is only used to display the Live reports. All other reports don’t use the raw data, but archived data. With the tracking action (Tracking code, Tag Manager, etc.) it’s possible to define what raw data are be saved. By the archiving process it’s possible to use only few of the raw data. There are 2 possibilitys to save archived data from the raw data.

Furthermore it’s possible to delete old raw data or specific raw data.
Furthermore it’s possible to delete old reports data and archive new report data.

Deleted raw data are forever deleted, but with a deleting of raw data no report data will be deleted — the report data are still saved.

Deleted report data can be re-archived from existing raw data. No raw data, no re-archiving.

You delete old raw data regularly, but don’t delete old reports. When you delete the old reports, this are forever deleted, because the raw data are deleted. No re-archiving is possible.

Performance:
From now: Define with the tracking code or Tag manager wich raw data are be saved. Not used data in the reports must not be saved.

For allready saved data:

Delete the detailed raw data wich is not used in the archiving/reports.
Delete all reports.
Invalidate all reports.
Re-archive the reports.

The defined savig of raw data makes the database smaller. In my matomo are browser plug-ins data not saved as raw data, because this is not neccessary in my reports. A workflow for delete this data in the raw data:

There are many possibilities what data are be saved as raw data.

_paq.push(['disableBrowserFeatureDetection']);

// or disable all link tracking
// _paq.push(['enableLinkTracking']);
// and use an onclick event for this on specific links
_paq.push(['trackLink', url, 'link']); // 'link' or 'download'

You use “Heatmap”? You can define wich data are saving for that.

Your archiving cronjob interval is maybe to big. In my matomo with much less websites there is a interval with @hourly. Every cronjob runs few minutes and don’t archive very old raw data was leftover of the previous cronjob archiving. That is important for a good archiving performance: no overlapping archiving cronjobs, expect by a re-archiving.

The last point is the database performance outside of matomo influence.

Optimizing the MySQL Server

Databases save the data in non human readable data, but in bits and bytes. The greater the database, the slower the database, because the reading is expensive. Normaly, databases are constructed to handle big data. Problematic are read and write at the same time by big databases, respectivelly big database tables. Because of that it’s better to use many tables, but it depends on the website architecture. When you have very much data for a “same” table, than will be the table very big.