Duplicate entries in archive tables

Hi,

I wanted to remove a high-traffic site from the database and followed this article to do so.

After I deleted the archive tables and rebuild them with archive.sh I recognized that two tables were way to large in comparison to the others. Please see Screenshot1 for the numbers.

Investigating this, I found that some lines with identical content (apart from primary key and timestamp) exist multiple times in the monthly tables (please see Screenshot2). First I thought the archive script mistakenly writes a new entry every time it runs, but the timestamps are also within a single run of the script. It also seems to concern all types of reports, not only the action-related you can see on the attached screenshot.

I first recognized this with version 1.3 but updated to 1.4 since. Deleting and rebuilding the archive tables after the update did not improve the situation.

Since I analyze many sites with Piwik, my guess is that the archive script in some cases does not reset its collected data when switching to the next website. Meaning, for every website the already written data of the archived websites before gets written again. For example: After archiving website 1 & 2, archiving website 3 also writes the collected data for the sites 1 and 2 again. This might also be the reason for the memory issues some mentioned.

Strange tough, that it does not affect data of all months.

Thanks,
Ruediger

This should get cleaned up once a day or so… but to minimize this number, you can increase timeout in “General Settings” for example set to 1000 seconds, piwik will then process reportsonly once per 1000 seconds

[quote=matt]
This should get cleaned up once a day or so… but to minimize this number, you can increase timeout in “General Settings” for example set to 1000 seconds, piwik will then process reportsonly once per 1000 seconds[/quote]

Hi Matt,

I archive via cronjob, so I think setting this value is irrelevant. But you’re right, the table sizes have become smaller by the weekend. There are still a lot of duplicate entries in it, but maybe it just takes a lot of script runs to eliminate them completely.

Thanks,
Ruediger

/closed for me