How to archive only recent data


#1

Hi all,

I’ve been wading through Piwik optimization for the past week as part of an internship for a company that creates and tracks sites, among other things. Piwik has been running on a CentOS instance on an Azure server for a while, tracking about 13,000 sites. The mySQL database is currently 57GB.

Until I came, site-triggered archiving had been enabled in addition to a once per hour cron job. The archive job has (I’m told) been steadily slowing down from about 8 minutes to 4+ hours. Now, when I run core:archive (outside of the crontab), It takes >24, and by my estimation will take about 10 days to complete.

When I try to archive only the past 24 hours of data, my log looks as follows:


INFO CoreConsole[2015-06-18 20:41:13] - Archiving was last executed without error 11 days 16 hours ago
INFO CoreConsole[2015-06-18 20:41:14] - Will pre-process 2 Segments for each website and each period: provider!@google, provider!@google;actions>1
INFO CoreConsole[2015-06-18 20:41:14] - Will only process the following periods: day (--force-periods)
INFO CoreConsole[2015-06-18 20:41:14] - Will process 99 websites with new visits since 1 days 0 hours , IDs: [...]


INFO CoreConsole[2015-06-18 20:41:14] - Will process 164 other websites because some old data reports have been invalidated (eg. using the Log Import script) , IDs: [...]


INFO CoreConsole[2015-06-18 20:41:14] - Will process 12955 other websites because the last time they were archived was on a different day (in the website's timezone) , IDs: [...]

99 + 164 sites to archive is no problem at all, but I’d rather not archive the 12,955 sites that haven’t had a hit in the past day. Is there a way to run core:archive without archiving sites that where last archived on a different day?

Thanks oodles!
Jack

Edit: a bit more info about our setup

Currently there is little to none of the high traffic optimizations implemented. We will, in the near future, update our version of PHP and run SQL’s OPTIMIZE TABLES. Both will require downtime, so we’ve got them planned for tonight. We are not using APC at this time, but if there are tangible performance benefits we certainly can.

Rather strangely, Piwik (the only thing running on this Azure instance) takes very little CPU and memory - CPU usage has peaked at ~11%, and memory for the core:archive command sits around 300M, even though 512M is allowed. We suspect there may be a memory leak somewhere, as top gives us ‘Mem: 28735772k total, 28157108k used, 578664k free, 291768k buffers’.


#2

Bump? Does everyone running Piwik archive every site every day, even when there are no views? I thought the --force-all-periods=84600 was meant to stop that from happening.


(Matthieu Aubry) #3

Does everyone running Piwik archive every site every day, even when there are no views?

Yes, this is how Piwik works (at the moment)

Is there a way to run core:archive without archiving sites that where last archived on a different day?

Not yet possible, but feel free to create a feature request: Issues · matomo-org/piwik · GitHub


#4

Will do, here it is! Adding the ability to only archive sites that were viewed within a time period · Issue #8263 · matomo-org/piwik · GitHub

Thanks Matt!