Piwik 1.7.2-rc8 - log import slow - poor performance

Hi, I’ve recently discovered Piwik and was trying to use it for a high-traffic site (~4 million pages/mo). I did some tests with regular Javascript tracking and all seemed good, and Piwik works fine (congrats!). However, what I really want to do is use this via the log import, since we also want to track file downloads and other information, and real-time info is not a priority, since we already rely on Analytics. It isn’t going well :frowning:

  1. I installed the latest RC on a dev box (1.7.2-rc8), and went to town. I fed it a daily log file from the production site, with 3.1 million lines, from/to local webserver. I’m using a virtual machine with 4 Xeon cores on a host with SSDs and 4GB of RAM. The import is, at lack of better words, horribly slow. I tried pumping more RAM into MySQL via increasing table_cache to 512 and key_buffer_size to 768M, but still no joy. Best I could get was 111 requests/second, and that was with 4 recorders. From the looks of top, Piwik itself takes ages to process a single request.

  2. After importing nearly half a million requests (and then hitting Ctrl+C), I went to see how it Piwik itself behaved. The leftmost widgets took ages to load, and said 18k visits and 260k pageviews (which fits with the above partial import). However, any widgets other than those (and the blog) simply don’t load.

So… any suggestions?

  1. try increasing recorders which might help (but probably not)
    Yes it is slow to track half a million pages, we have much work yet to do on optimizing for large traffic patterns, and welcome contributors/donators to help work in this area

  2. don’t use browser trigger archiving on suich big setup but make sure you use: How to Set up Auto-Archiving of Your Reports - Analytics Platform - Matomo

I have experience with PHP/MySQL and some handling of large datasets. From what I’ve read around the Piwik site, seems there are some improvements that could be made (or already are in progress). Please stop me if you’ve heard all of this before, or if this has been the subject of development discussion, as I just got to to the party and this may be old ground:

  1. InnoDB may be a much better choice for this workload:
    ---- a) High concurrency. I don’t know if you’ve benchmarked this with Piwik specifically, but these days a minimally-tuned InnoDB engine is far better under high load than MyISAM ever will, for a good number of workloads.
    ----- b) Transaction support. This is a really big one for Piwik, because it means you can do several hundreds at a time (often 1000-2000) in a single transaction. I.e. START TRANSACTION; INSERT INTO table (fields) VALUES (record1),(record2)… COMMIT. I’ve tested this myself at work and it’s massively better than doing one at a time without transactions, or with autocommit enabled.
    ------c) Resiliency. Also a big one. We’ve all seen MyISAM tables become inaccessible because of corrupted indexes. Just use InnoDB, enable binary logging, and forget about a crash being a big issue.

  2. It seems that Piwik is processing one record at a time via GET instead of a load of them via POST (I’ve seen the dev ticket on this). This is also a major bottleneck, as having many parallel Apache processes doing one record at a time is bound to be a huge waste of time and resources. I mean, it’s possible that just the connecting/parsing/database-connect steps take longer than the actual work.

  3. Not as important as the above two, but on cases where the log is located on the same machine that’s running Piwik (something that is bound to happen often), it’d probably be better to insert directly into the database, or if that’s too far removed from the API, use Piwik’s API directly via scripts instead of doing an HTTP call.