Faster archive processing / country segmentation

dcz · November 27, 2012, 11:05am

Hello,

I recently added massive segmentation to my piwik (eg country), and it ended up in a lot more work to do when archiving (over 200 times more actually, even after I narrowed the country list to country with visit in the relevant period).

I was still using the archive.sh logic and archiving 5 month took above 15hours on my local xampp (i7 2600k win 7) for a 2000 visitor/day website. What was frustrating is that the cpu was barely tickled by the task (maybe 1 or 2 % load on it).

I then dived in the new (well, not so new, but better late than never) archive.php, still in misc/cron and started to play with the curl logic.

First problem was that as is, the script will start as many HTTP request at once as there are segments, and this is not an option with over 200 of them.

So I implemented a basic queue mechanism, and it ran well, but then apache was consuming the most resources with 3 threads. I was not able to run the whole archiving with more than one thread on my local computer which must be a problem with xampp or my machine (though since it’s not production I don’t mind that much not to be able to handle heavy http loads), and from there did not run the single thread one until the end since it was going to take over ten hours again.

So I ended up implementing the // with queue logic with background cli tasks. And this works VERY fast. My initial 15 hours became 11 minutes with 5 // processes (cpu around 60% avg, mysql only around 45%).

As it is for now, the script makes a lot more sense with a lot of segment because I kept the original logic (with curl, //isme really occurs for segments). The main logic could of course be changed to archive several sites in //, but as it is, it can help out a lot for sites with many sites and segments. And, since my production server is running ubuntu, the script does (well, should for far most) handle both windows and linux commands.

The cli // logic is comparable to the curl one, processes are spawned in background, they all start with creating a lock file (all file work in piwik/tmp), then trigger the actual command, dump the result (with eventual errors) in a log file and finally remove the lock file. Once the configured number of processes is reached, the script starts checking the tmp dir for finished tasks (eg log file without associated lock file) and reads it just like curl would read the server response. I also implemented a timeout for cli processes, even though they are not killed in this version, the archive script will report the timeout and will consider the task as errored in the same way as before.

Anyway, have a look at the code, I commented the changes, it’s of course a POC, and can be enhanced (for example, url param is not actually required anymore in cli mode), but I though it was still worth mentioning.

Source code of the modified misc/cron/archive.php :
[PHP] Piwik fast // archive - Pastebin.com

Cheers

halfdan · November 27, 2012, 2:38pm

This, Sir, is amazing! Job well done.

dcz · November 27, 2012, 3:26pm

Thanks

It’s really nice to be able to play with country segmentation and get city level statistics.

matthieu · December 31, 2012, 1:45am

Could you please submit a diff from TRUNK of your changes? I could take a look if they make sense in core and would improve performance for other users!
Thanks

dcz · December 31, 2012, 9:52am

Here you go.

Just remove the .pdf to view the actual diff files, it looked handier to attach these here, but since no zip can be attached …

Anyway, I updated my proposal to latest trunk version (7618 for this file). The trimmed version is just handier to read since it won’t include the changes where only WS at the end of line differs.

As is, it’s functional, but some choices are to be made by piwik team before the script can be finalized, the $is_win and $phpbin are not dynamically set, the url param is not required in cli mode, it does implement a dynamic country segment optimization which could be handle elsewhere or follow some config choices etc …

As is, changes set is minimal and KISS logic. I can spend a bit more time on this if you like, that is after you (we?) make some decisions.

Cheers.

matthieu · January 1, 2013, 9:40am

Thanks for the patch. unfortuantely it’s a bit too complicated for me to review and commit now. Maybe we can revisit later or when we ourselves experience the performance problem with many segments

Thanks!!

dcz · January 1, 2013, 1:18pm

Happy new year ;p

From my experience, the current logic just cannot handle country segmentation (or any heavy segmentation, that is over 3-5) because running hundreds of simultaneous http connections at the very same time is the kind of thing that would ddos a significant proportion of servers.
At the very least, there is a need for some queuing mechanisms.

Anyway, I really shared this because a) it’s the whole point about open source, b) I found little documentation about async processes in php and c) if my implementation is a POC, I’m pretty sure that the final solution is to be found near it.

Now if you need me, just whistle, even though I don’t think (actually, I’m sure) that I am any better than the piwik devs.

cheers

matthieu · January 2, 2013, 4:24am

I think the way we would like to implement this would be by using the Piwik Bulk API mechanism: http://piwik.org/docs/analytics-api/reference/#toc-advanced-users-send-multiple-api-requests-at-once

rather than use the async calls, and indeed use a queueing mechanism to prevent hundreds of small requests. If you’re keen to provide a simpler patch without the php async calls that works when hundreds of segments, that would be awesome!

dcz · January 3, 2013, 9:36am

Both logic are included already, just try set $this->archive_cli to false and you’ll have the http logic with queue done in HTTP_archiveVisitsAndSegments.

The max // request is so far hard coded in HTTP_archiveVisitsAndSegments :


		// maximum amount of // requests
		$max_connect = 3;

3 can seem few, but since even this was not functional in my test case (again, this was on windows7 and xampp and blabla, but still, I’m not feeling like testing more than that on my production server running ubuntu with decent hardware). Adding around 200 segments only increased archiving time from 2 to 5 minutes average with the cli logic on the production box.

I kinda miss the bulk request mechanism point because in my mind (and I may miss things), I do not see any reason to trigger each archive request through http. I really get why it can be handy to trigger archiving remotely, but this does not need to use more than one remote http request to start, and few more to check the result (periodical check, like every minute until the work is reported as finished with all the data we could want). The extra coding effort required seems negligible in regards to the resources saved (50% from my experience).

As far as I can see, using a parameter to switch from http to cli logic, either in config or from the initial request could be the solution to fit all use cases and is not that hard to maintain. IMO the http logic is only necessary for people hosting their piwik on a host denying the use of exec (or popen on windows, which is a bit more likely even though I really think that it would be far from majority). And this can be easily determined in the script before eventually falling back to http logic.

Of course, if you really need the http logic alone, it’s simple, just drop the $archive_cli declaration, the associated “if” and the CLI_archiveVisitsAndSegments method. I think that the other changes (a part from the dynamic country segmentation maybe) are good to keep (since building the full task list before processing makes sense as soon as we’re dealing with queues).

Also, the logic used for queue is trivial : add x item to queue, then throw all requests and grab the result. This could be “optimized” to add x item to queue, then throw all queued requests and constantly check for result and add new requests as we remove them.
I quote optimize because I also think that this is debatable. Since the archiving server is likely to be the tracking server itself, your probably do not want to overload it. Current option goes in that direction since the average // request number will be lower than the maximum set and there will be tiny breaks to allow the rest of the server tasks to go on.
This of course only mean something with segment (the more segment, the more meaning) since otherwise queuing just does not occur at all (one single request at a time).

matthieu · January 5, 2013, 12:11am

Thanks it looks interesting. Can you please create a ticket in trac, and post the patch there with a quick summary of the changes? I will consider to include it in next release. thanks!

dcz · January 15, 2013, 7:31am

hum,

It’s only the third time that I attempt to post the link to the ticket -dev . piwik . org/trac/ticket/3658

Hopefully this time, it will actually stay there …