Hello,
I recently added massive segmentation to my piwik (eg country), and it ended up in a lot more work to do when archiving (over 200 times more actually, even after I narrowed the country list to country with visit in the relevant period).
I was still using the archive.sh logic and archiving 5 month took above 15hours on my local xampp (i7 2600k win 7) for a 2000 visitor/day website. What was frustrating is that the cpu was barely tickled by the task (maybe 1 or 2 % load on it).
I then dived in the new (well, not so new, but better late than never) archive.php, still in misc/cron and started to play with the curl logic.
First problem was that as is, the script will start as many HTTP request at once as there are segments, and this is not an option with over 200 of them.
So I implemented a basic queue mechanism, and it ran well, but then apache was consuming the most resources with 3 threads. I was not able to run the whole archiving with more than one thread on my local computer which must be a problem with xampp or my machine (though since it’s not production I don’t mind that much not to be able to handle heavy http loads), and from there did not run the single thread one until the end since it was going to take over ten hours again.
So I ended up implementing the // with queue logic with background cli tasks. And this works VERY fast. My initial 15 hours became 11 minutes with 5 // processes (cpu around 60% avg, mysql only around 45%).
As it is for now, the script makes a lot more sense with a lot of segment because I kept the original logic (with curl, //isme really occurs for segments). The main logic could of course be changed to archive several sites in //, but as it is, it can help out a lot for sites with many sites and segments. And, since my production server is running ubuntu, the script does (well, should for far most) handle both windows and linux commands.
The cli // logic is comparable to the curl one, processes are spawned in background, they all start with creating a lock file (all file work in piwik/tmp), then trigger the actual command, dump the result (with eventual errors) in a log file and finally remove the lock file. Once the configured number of processes is reached, the script starts checking the tmp dir for finished tasks (eg log file without associated lock file) and reads it just like curl would read the server response. I also implemented a timeout for cli processes, even though they are not killed in this version, the archive script will report the timeout and will consider the task as errored in the same way as before.
Anyway, have a look at the code, I commented the changes, it’s of course a POC, and can be enhanced (for example, url param is not actually required anymore in cli mode), but I though it was still worth mentioning.
Source code of the modified misc/cron/archive.php :
[PHP] Piwik fast // archive - Pastebin.com
Cheers