Report archives are huge since update to 2.10


#1

Hi.

We’re a public school district and have been using Piwik on our website for the past three years. We usually get around 750,000+ visits per month and the archive blob tables are usually around 30~40MB and the archive numberic tables are usually around 100KB.

Since we’ve upgraded to 2.10 the table sizes have expanded by a huge margin. In January, we had 730,000 visits, but the archive blob table grew to 517MB and the archive numeric table 528KB.

We’re only two days into February and both tables are already 57MB and 432KB respectively.

The blob tables usually have around 60,000 rows, but for January they were over 1,000,000 rows!

What could be causing this? We haven’t changed any setting.

Thanks


#2

After only one more day, the February blob table is now above 143MB with 243,000 rows. At this rate it’ll be over 1GB by the end of the month.

Did anything change in the code from 2.9 to 2.10 that could have caused the archives to grow in size like that? Any pointers on how I can track the source of that increase?

Thanks


(Matthieu Napoli) #3

Thanks for reporting that. I don’t have the answer myself but I’ve asked to the team, that seems like a huge difference though…


#4

Hi @skyhawk669, the increase you’re experiencing is definitely not normal. I can think of the following possible reasons for it:

  1. Piwik isn’t correctly deleting old archives.

To check if this is the issue, you can run the following query on an archive table:

SELECT idsite, date1, date2, period, name, COUNT(*) as count FROM piwik_archive_numeric_… GROUP BY idsite, date1, date2, period, name HAVING count > 1;

If the result isn’t an empty set and there are a lot of rows, it means there are lots of duplicate archives that aren’t being deleted for some reason.

  1. Your users are creating lots and lots of segments and they are set to be pre-processed (ie, created when the cron archiving job is run).

You can check for this by looking at the number of segments users have created in the UI and checking whether they are set to be processed in real-time or not (in the segment editor dialog).

  1. Your website’s URL structure has changed and now there are many, many more URL path segments (ie, http://mysite.com/1/2/3/4/5/6/7/8/9/10/11/12/…). This would increase the size of certain reports unless some config options are set. This is probably not the issue though since it would only affect the blob tables.

Let me know if this helps you.


#5
  1. I have run the query and I have attached a couple images with the results.

  2. There are no segments being used.

  3. The URL structure hasn’t changed.

The only thing that changed was the upgrade from 2.9 to 2.10

Thanks


#6

I don’t see a period column in your query results, did you select and group by period?


#7

My mistake, here’s the results of the correct query.


#8

Ok the problem’s definitely duplicate archives that aren’t being deleted. Can you run one more query?

SELECT ts_archived FROM piwik_archive_numeric_… WHERE idsite = 1 AND date1 = 2015-02-01 AND date2 = 2015-02-28 AND period = 3 AND name = ‘Actions_nb_downloads’;


#9

That query doesn’t return anything…


#10

Sorry, the dates weren’t quoted. Try this query:

SELECT ts_archived FROM piwik_archive_numeric_… WHERE idsite = 1 AND date1 = ‘2015-02-01’ AND date2 = ‘2015-02-28’ AND period = 3 AND name = ‘Actions_nb_downloads’;

Please select from the same table you selected from in the last query.


#11

Here’s the result. I ran that on the same table as above.


#12

Ok, thanks! I’m not sure why old archive data isn’t being deleted, but my guess, though, is that you are being hit by this bug: Scheduled email reports not sending in 2.9.1 · Issue #6868 · matomo-org/piwik · GitHub or something similar where scheduled tasks aren’t being run.

You can try to run scheduled tasks manually by running “php /path/to/piwik/console core:run-scheduled-tasks” or “php /path/to/piwik/console core:run-scheduled-tasks --force” on the command line. If this is the cause of your problem, running this command will delete the extra data from your archive tables, but it’s not a permanent fix.

I will get back to you w/ further info.


#13

@capedfuzz

The “–force” command seemed to purge the files a bit, they’re smaller but still bigger than usual.

Thanks for you help.


#14

Hello,
I appear to be having the same problem. I tried running the “SELECT idsite, date1, date2, period, name, COUNT(*) as count FROM piwik_archive_numeric_2015_01 GROUP BY idsite, date1, date2, period, name HAVING count > 1;” query and got over a hundred rows with ‘count’ higher than 2, at least 6 with ‘count’ of 42. Here’s a screenshot of part of that, sorted by highest ‘count’:
[attachment 2141 Highdbcount.png]

We upgraded our 1.7.1 Piwik to 2.8.3 in the first week of January 2015, and then to 2.10.0 this past weekend. The visits we’re tracking have been consistent, but the piwik_archive_blob_2015_01 and piwik_archive_blob_2015_02 (which has barely started) are already triple the size of December 2014 (1,594,160 Rows for January vs 517,402 Rows for December). The piwik_numeric_2015_01 and 02 are similarly oversized (516,465 Rows for January vs 214,879 Rows for December 2014). We’re using MySQL 5.6

I tried running the “php C:/inetpub/[our piwik folder]/console core:run-scheduled-tasks --force” command as recommended above, but didn’t see any effect at all on the database size. I’ve attached a screenshot of the command results. Of particuarly concern is about 115 lines of “Purging temporary archives: skipped”. What might be causing the skipped purging? Is there something I can do directly through MySQL in the meantime (the db is now about 9Gb/50% larger than it should be).
[attachment 2139 ForcedScheduledTasks.png]

Also, are these [Deletelogs] configurations relevant?

delete_logs_enable = 1
delete_logs_schedule_lowest_interval = 1
delete_logs_older_than = 120
delete_max_rows_per_run = 500

Any advice would be appreciated, Thank you.


(Matthieu Aubry) #15

Hi there,

Can you confirm that you have setup the cron that looks like


./console core:archive [...]

If you have setup this cron command and you still don’t see the data being purged after a few days, maybe you could create an issue on our tracker? Issues · matomo-org/piwik · GitHub and link to this forum post


#16

Matt,

Yes, I have this line in cron. I’ve had it all along. This started happening after the upgrade to 2.10. I will submit on the tracker.

Thanks


#17

Hello Matt,

Yes the /console core:archive is set up as a scheduled Windows Task that I’ve run manually from command line and verified is running successfully as a Scheduled Task on an hourly basis. Reports are being generated just fine. The Piwik also has root access to the MySQL database. Manually running the archive doesn’t help the cleanup, in fact it usually grows the db even faster.

Edit: This may or may not be related to the core:run-scheduled-tasks --force recommendation above.
When I run php “C:\inetpub[our piwik folder]\console” core:archive --accept-invalid-ssl-certificate --url=https://outsite.com , the final few results show:
"
INFO CoreConsole[2015-02-11 19:15:38] SCHEDULED TASKS
INFO CoreConsole[2015-02-11 19:15:38] Starting Scheduled tasks…
INFO CoreConsole[2015-02-11 19:15:38] No task to run
INFO CoreConsole[2015-02-11 19:15:38] done
"
Is this supposed to happen?

In the meantime, is there any way I can clean some of it up directly from MySQL?
Thank you,
Patrick


#18

Here’s a few more screenshots to show the ballooned DB size. We were only at 16gb db size prior to the upgrade from 1.7.1 to 2.8.3, after which the db grew to 22gb, now after upgrading to 2.10.0 we’re around 25gb despite tracking about the same volume of visits. The log_link_visit_action and log_visit files appear to be rather big, though I don’t have records of how big they were prior to the upgrade.
[attachment 2149 DBFiles.png]
[attachment 2148 DatabaseUsage.png]

I also used the SELECT idsite, date1, date2, period, name, COUNT(*) as count FROM piwik_archive_numeric_2014_12 GROUP BY idsite, date1, date2, period, name HAVING count > 1; MySQL query to compare pre-upgrade and post-upgrade archive_numeric files. Pre-upgrade, the highest “count” wouldn’t exceed 4 or 5, but post-upgrade, the highest “count” were multiple 40’s
[attachment 2150 MonthComparison.png]

Our system is
Windows Server 2008 R2
IIS Webserver
PHP 5.6
MySQL 5.6

Hope this information can be helpful to figuring out what’s going on.


(Matthieu Aubry) #19

Hi Pats,

can you try upgrade to the latest Piwik beta? I would like to test early beta and RC releases, how do I enable automatic updates to use these development versions? - Analytics Platform - Matomo

if after 2 days you still see the huge DB size, then please create an issue on the tracker: Issues · matomo-org/piwik · GitHub

Question: have you created some Email reports, and do the reports send correctly as expected? (this would confirm the scheduled tasks work fine)


#20

Matt, Pats

I already have created an issue: Report archives have tripled in size since update to 2.10 · Issue #7181 · matomo-org/piwik · GitHub