Multi-threaded archiving\reporting tracking thousands of sites

I was hoping someone could point me in the correct direction. We are currently hosting tens of thousands of web sites and tracking about 20,000 web sites on a dual server Piwik system. We have Piwik 1.8.4 currently installed, Piwik server is on RHE 5.5 32bit (Apache, PHP); MYSQL server is on RHE 6.2 64bit.

Our hits per site tend to be fairly low 10 to 1000 per day but we have a dilemma with archiving\reporting. Due to the nature and overhead of the archive.php and general PHP engine we are looking for a more efficient way to perform archiving\reporting. Since the current archive.php is a single threaded linear process this is not geared toward thousands of sites, nor maintaining the archiving\reporting rate we need. We need to have multiple batches of archiving jobs running concurrently to complete our archiving\reporting in a timely fashion.

One option was to reverse engineer the logic of the archive.php and implement it in the form of a SQL stored procedure. Another option would be to enhance archive.php to accept a variable which would group the archiving into batches using the GROUP. Then we would simply run concurrent archive.php each with a unique set of sites (using GROUP) so we in essence manually multi-thread the process.

Could someone provide the SQL logic so archiving\reporting can be converted to a SQL routine\procedure directly eliminating PHP?

Are there any other ideas pertaining to multi-threaded archive\report generation?

thanks

Due to the nature and overhead of the archive.php and general PHP engine we are looking for a more efficient way to perform archiving\reporting.

Thanks for your message, it’s very interesting to hear about your use case of 20,000 sites. definitely, making Piwik work well in this case is very important for us!

Can you please describe what the problem is with archive.php ? does it take too long to run when setup as a cron: http://piwik.org/setup-auto-archiving/

Does it take 3h, 10h, or more to run ?

Since the current archive.php is a single threaded linear process this is not geared toward thousands of sites, nor maintaining the archiving\reporting rate we need. […] We need to have multiple batches of archiving jobs running concurrently to complete our archiving\reporting in a timely fashion.

When I originally developed the archive.php script I actually tested it so it works if the script is run multiple times at once. Each instance of the running archive.php should archive different websites. Could you try that and see if it helps, as it seems like the right thing to do (rewriting the script in sql is not possible)

We have been collecting tracking information for 8+ months without running Archiving therefore we have over @50+million records (piwik_log_visit) and a lot of data that needs to be reconciled and archived. Unfortunately we did not have an understanding about the need for regular archiving until recently. Over the last 8 months were not using the UI for our customers we were only collecting the data. Now we are trying to turn on archiving so we have a usable customer UI that is quick and responsive.

The initial archive job is taking between 5-10 days to complete (average processing is @15-45 seconds per site * 20,000 sites). However it tends to crash for some reason or another and requires babysitting and restarting. We are not using CRON, we are doing this directly on the command line and piping results to a log file since it is so problematic. Subsequent daily jobs we estimate will take 18-20 hours to complete based on our average processing rate and number of web site visits per day. This is not a sustainable rate.

We have thrown more and more hardware (CPU RAM)at this process. We have experimented with tuning MYSQL. At this point our MYSQL is basically running in 100% ram and is not suffering from paging or disk IO much. All tuning have made some improvements but the process is still too slow due to the fact that there are just so many sites and it is a linear and single threaded process. The PHP seems to be the bottleneck due the heavy transactions between php and mysql.

It appears that both the php server and the mysql server have more CPU, RAM and Disk IO resources available which is why we are trying to determine ways to multi-thread this. Ideally we are looking for a database side only process to complete this task which would eliminate the PHP half of the equation. Although PHP could still be used IF it was efficient enough to handle the volume of sites we have.

So I am open to ideas or information.

thanks

  1. Try launching the archive.php multiple times in parallel - the script should work, and will archive several websitesconurrently. This should take less than 20000 * 30s for sure!
    please report on this as its really interesting use case. For example does it work faster when running 2 at once ? 3 ? 4 ? 8? :wink:

  2. first, please make sure you read and apply all advice in: Optimize and Scale Piwik - Analytics Platform - Matomo

Ok ready to help you further !

Sorry Matt but running parallel archive.php simply does not work correctly nor does it achieve our goal of going faster. What ends up happening is that the archive.php archives the same sites repeatedly in each thread. This was the very first thing we tried and experimented with when we realized that we need more threads running.

What happens when you launch multiple archive.php is you end up with multiple archive jobs running all working on the same sites multiple times. This results in a hit on CPU\RAM\Disk IO performance but no real gain in effectiveness. …Again if the underlying php contained the logic to break the archive job into groups (batches) or had the ability to skip the sites which other threads that were currently being archived by another thread then it would work well however our testing has proved otherwise. I could send you the logs showing this if you like.

Our current plan of attack is to modify and enhance the achive.php and allow for an additional input parameter utilizing the new GROUP field. Then we would call multiple archive.php threads but each thread with a different parameter for its unique GROUP. This way each thread is only working on its unique group (or batch) of sites which should eliminate the overlap.

…All this would be a non-issue if there was a stored SQL procedure which we could leverage and perform direct SQL manipulation on the back end. We are evaluating the possibility of engineering a stored SQL procedure but I am hoping someone has already done that already. Also we are concerned that if we build our own stored procedure that it may have issues with future Piwik development and changes.

No, sorry, there is no stored procedure for that already existing. I guess its no wonder that most of the archiving work is done in php :).
Also it would probably a PITA if the code has changed and you need to update the stored procedure.

Nonetheless I would be very interested if you go ahead and provide a procedure as I see similar issues with 6000 sites even though it runs only for like 3-4 hours a day by now.

[quote=Ken]
Sorry Matt but running parallel archive.php simply does not work correctly nor does it achieve our goal of going faster. What ends up happening is that the archive.php archives the same sites repeatedly in each thread. This was the very first thing we tried and experimented with when we realized that we need more threads running.[/quote]

Thanks for the report, that’s a bug. I fixed it in: http://dev.piwik.org/trac/changeset/7042

please replace your archive.php with this file: http://dev.piwik.org/svn/trunk/misc/cron/archive.php

Does it now work to run several in parallel ? does it improve performance ?

Looks like the process starts then thows this error.

Fatal error: Call to undefined method Piwik_Http::getUserAgent() in /var/www/html/analytics/misc/cron/archive.php on line 469

Sorry, few things are changing in SVN and changing only one file wasn’t enough.

Try also replace this file: http://dev.piwik.org/svn/trunk/core/Http.php

if it doesn’t work let me know I will make a new build that includes the fix.

[my mistake. disregard below.I forgot switched the file back to the original… I am retesting. I am leaving the info below so you can see how it was working.]

Note these are the two commands that I am running. Simply running archive.php twice but piping to two different log files.

[~]# /usr/bin/php /var/www/html/analytics/misc/cron/archive.php --url=http://ip removed/> /var/log/piwik-archive-1.log &
[~]# /usr/bin/php /var/www/html/analytics/misc/cron/archive.php --url=http://ip removed/> /var/log/piwik-archive-2.log &

[i]JOB #1
[2012-09-25 19:20:40] [15236f8c] START
[2012-09-25 19:20:40] [15236f8c] Starting Piwik reports archiving…
[2012-09-25 19:21:10] [15236f8c] Archived website id = 1, period = day, Time elapsed: 29.915s
[2012-09-25 19:21:57] [15236f8c] Archived website id = 1, period = week, 0 visits, Time elapsed: 46.379s
[2012-09-25 19:22:27] [15236f8c] Archived website id = 1, period = month, 0 visits, Time elapsed: 30.469s
[2012-09-25 19:22:48] [15236f8c] Archived website id = 1, period = year, 54 visits, Time elapsed: 21.174s
[2012-09-25 19:22:49] [15236f8c] Archived website id = 1, today = 0 visits, 4 API requests, Time elapsed: 128.489s [1/19760 done]
[2012-09-25 19:23:20] [15236f8c] Archived website id = 2, period = day, Time elapsed: 31.471s
[2012-09-25 19:23:31] [15236f8c] Archived website id = 2, period = week, 0 visits, Time elapsed: 10.659s
[2012-09-25 19:23:34] [15236f8c] Archived website id = 2, period = month, 0 visits, Time elapsed: 3.704s
[2012-09-25 19:23:37] [15236f8c] Archived website id = 2, period = year, 20 visits, Time elapsed: 2.058s
[2012-09-25 19:23:37] [15236f8c] Archived website id = 2, today = 0 visits, 4 API requests, Time elapsed: 47.896s [2/19760 done]
[2012-09-25 19:23:38] [15236f8c] Archived website id = 3, period = day, Time elapsed: 1.904s
[2012-09-25 19:23:52] [15236f8c] Archived website id = 3, period = week, 376 visits, Time elapsed: 13.855s
[2012-09-25 19:24:47] [15236f8c] Archived website id = 3, period = month, 15278 visits, Time elapsed: 55.118s

JOB#2
[2012-09-25 19:20:40] [41e0db5a] START
[2012-09-25 19:20:40] [41e0db5a] Starting Piwik reports archiving…
[2012-09-25 19:21:19] [41e0db5a] Archived website id = 1, period = day, Time elapsed: 39.122s
[2012-09-25 19:21:57] [41e0db5a] Archived website id = 1, period = week, 0 visits, Time elapsed: 37.178s
[2012-09-25 19:22:27] [41e0db5a] Archived website id = 1, period = month, 0 visits, Time elapsed: 30.678s
[2012-09-25 19:22:54] [41e0db5a] Archived website id = 1, period = year, 54 visits, Time elapsed: 26.574s
[2012-09-25 19:22:54] [41e0db5a] Archived website id = 1, today = 0 visits, 4 API requests, Time elapsed: 134.094s [1/19760 done]
[2012-09-25 19:23:21] [41e0db5a] Archived website id = 2, period = day, Time elapsed: 26.625s
[2012-09-25 19:23:23] [41e0db5a] Archived website id = 2, period = week, 0 visits, Time elapsed: 2.039s
[2012-09-25 19:23:34] [41e0db5a] Archived website id = 2, period = month, 0 visits, Time elapsed: 11.365s
[2012-09-25 19:23:37] [41e0db5a] Archived website id = 2, period = year, 20 visits, Time elapsed: 2.609s
[2012-09-25 19:23:37] [41e0db5a] Archived website id = 2, today = 0 visits, 4 API requests, Time elapsed: 42.645s [2/19760 done]
[2012-09-25 19:23:38] [41e0db5a] Archived website id = 3, period = day, Time elapsed: 1.539s
[2012-09-25 19:23:52] [41e0db5a] Archived website id = 3, period = week, 376 visits, Time elapsed: 13.860s
[2012-09-25 19:24:47] [41e0db5a] Archived website id = 3, period = month, 15278 visits, Time elapsed: 55.137s[/i]

Can you try to start them at 1min interval, just to see if that would change something?

it works on my box (running SVN trunk) so it’s just matter of finding out why it doesn’t work on yours, but it definitely should!

Sorry to pull up an old topic, however I am running Piwik 1.10.1 and I have this problem. I’ve tried 2 windows boxes, 2 linux boxes, and all 4 at once using the a shared MySQL backend and unfortunately they will run the same reports. The only time it will skip already completed reports is if those reports are done before I start another archive.php process.
For example, if job 1 starts and completes sites with ids 1-10 and then I start another archive.php, it will skip Ids 1-10 but while it is skipping job 1 will have gotten to ID 15. Job 2 will also run ids 10-15 even though job 1 has already completed them. It is like it is pulling stale or cached data from when the job was started instead of having the latest data from MySQL. Our Piwik install has about 2k websites and it takes about 72 - 96 hours to run currently. From what I can see, the bottleneck is that it only runs on 1 cpu instead of running on more of the other 31 processors available to each system. The only Mysql is barely running but the php processes are usually maxing out 1 full core of the system. If I can get this process to under 24 hours so that we can put this in a cron job, that would be amazing!

Thanks in advance for any assistance!

There might be a bug. We could fix it but please contact pro services for paid support for this one as it’s non trivial: http://piwik.org/consulting/