Looking for suggestion to improve archiving performance

melbao · November 10, 2023, 2:55pm

it’s weekend.

@Rama_Gurram your intention is not really clear for me.

“to speed up the archiving process”
“we tried recurring every 4hours interval in a day”
“Our intention is to keep last one year of historical data in DB and only last year worth of data can be archived.”
“how to keep only last one year sites data in the DB instead all old years.”
" Regularly delete old raw data is enabled with Delete logs older than 30 days"
" we haven’t enabled Delete old aggregated report data."

puh , thats many input.

The workflow:
Tracking: raw data saving → Archiving: report data saving → Backend: Display reports.
By “Live” reports:
Tracking: raw data saving → Backend: Display reports.

There are 2 Datadumps in the database:

Raw data
Report data

The Matomo Tracking saved only raw data in the database. This raw data is only used to display the Live reports. All other reports don’t use the raw data, but archived data. With the tracking action (Tracking code, Tag Manager, etc.) it’s possible to define what raw data are be saved. By the archiving process it’s possible to use only few of the raw data. There are 2 possibilitys to save archived data from the raw data.

Furthermore it’s possible to delete old raw data or specific raw data.
Furthermore it’s possible to delete old reports data and archive new report data.

Deleted raw data are forever deleted, but with a deleting of raw data no report data will be deleted — the report data are still saved.

Deleted report data can be re-archived from existing raw data. No raw data, no re-archiving.

You delete old raw data regularly, but don’t delete old reports. When you delete the old reports, this are forever deleted, because the raw data are deleted. No re-archiving is possible.

Performance:
From now: Define with the tracking code or Tag manager wich raw data are be saved. Not used data in the reports must not be saved.

For allready saved data:

Delete the detailed raw data wich is not used in the archiving/reports.
Delete all reports.
Invalidate all reports.
Re-archive the reports.

The defined savig of raw data makes the database smaller. In my matomo are browser plug-ins data not saved as raw data, because this is not neccessary in my reports. A workflow for delete this data in the raw data:

There are many possibilities what data are be saved as raw data.

_paq.push(['disableBrowserFeatureDetection']);

// or disable all link tracking
// _paq.push(['enableLinkTracking']);
// and use an onclick event for this on specific links
_paq.push(['trackLink', url, 'link']); // 'link' or 'download'

You use “Heatmap”? You can define wich data are saving for that.
https://developer.matomo.org/guides/heatmap-session-recording/reference#disableautodetectnewpageview

Your archiving cronjob interval is maybe to big. In my matomo with much less websites there is a interval with @hourly. Every cronjob runs few minutes and don’t archive very old raw data was leftover of the previous cronjob archiving. That is important for a good archiving performance: no overlapping archiving cronjobs, expect by a re-archiving.

The last point is the database performance outside of matomo influence.

Optimizing the MySQL Server

https://dev.mysql.com/doc/refman/8.0/en/optimizing-server.html

Databases save the data in non human readable data, but in bits and bytes. The greater the database, the slower the database, because the reading is expensive. Normaly, databases are constructed to handle big data. Problematic are read and write at the same time by big databases, respectivelly big database tables. Because of that it’s better to use many tables, but it depends on the website architecture. When you have very much data for a “same” table, than will be the table very big.