Importing large amount of historical data

I want to move from Urchin to Piwik. There’s about a dozen websites from which I’d like to import historical data for about the last year.

One of them is quite busy (in Piwik terms, anyway) - I’ve about 230 million lines of historical log data to read in, and this takes more than a day to do (no doubt a huge contributor to this is doing it via the Piwik http API - I expect it’d be orders of magnitude faster if there were an offline importer, but then I guess lots more code would need to have been written).

Having read in the data Piwik is unusably slow and I gather the answer is to archive it which is where I run into problems. The archiver fails and gives errors about being out of memroy, even though I’ve set PHP’s memory max. to 1GB and watching htop, the biggest I’ve seen a PHP process get to was over 400MB - but I could have missed its getting bigger, as the process takes a while.

I though I could maybe read in a day’s data, archive, and then do the next day’s data but I have to use --force-all-periods to archive the old data, and then on every run, it’s going to reprocess all the old data. I also have to use --force-all-websites, which means it’s trying to repeatedly reprocess all data for all websites.

What’s the correct approach to read in a reasonably large amount of historical data and then archive it?

One way I did think of was to fire up a new VM and then to reset its date to day N+1, read in the data for day N, run archive.php, reboot, and repeat - but that seems very convoluted, and I don’t even really know that it would work - though it seems plausible.

Orders magnitude faster: no because we import 500 lines at once (or you can try different settings via parameters).
Also it’s common to have to increase memory to 2-5-8+ Gb for huge jobs. It should work, if not we could tweak archiving to also take better into account this use case of archiving 200+ M page views at once, let us know…

So what should I do here - bang the max memory for PHP up to 8GB and do the archive run again?

One problem I foresee is that, as I said, I have several websites to import, but one are anywhere near as big as the big one. Do I need to import all the data for all of them first, and then run the archive process - because it seems I need to always use --force-all-periods and --force-all-websites to import historical data, so I’d end up reprocessing, and I expect that’s going to be slow.

I’ve had a look at the archive.php code, and it doesn’t look like it’d take much effort to hack it to only process a particular website, and just a bit more effort to turn that into something specifiable on the command line. Is that a reasonable assessment?

I have one install where I’ve already read in this data but the archive process failed - if I rerun it there with the larger memory, will it happily start over?

Data that has been processed will not be reprocessed, even with --force-all-periods.

If that’s correct Fabian, then maybe I should again try to import say a month’s data, and then archive, and then import another month. Just had anothere weird one though - I set memory_limit to 8192M, verified that by checking the output of a phpinfo() page, but now archive.php fails and the apache error log shows:

PHP Fatal error: Allowed memory size of 167772160 bytes exhausted

which is 160M - that’s a distance from 8192M. I reviewed the logs and sure enough, previously when I had it set to 1024M, it was failing with

PHP Fatal error: Allowed memory size of 1073741824 bytes exhausted

but why on earth it’s failing now as above is beyond me.

Retrying now with memory_limit of 2048M . . .

The failure mentioned above is very weird, and not Piwik related - I have memory_limit specified in php.ini but I have lower values for memory_limit specified in the vhosts files for OTHER vhosts - not for piwik. However, somehow piwik is being influenced by these other values. I changed them from 160M to 240M and now the piwik apache error logs show i running out of memory at 240M.

What’s even weirder is that I have error logs from earlier attempts when I had memory_limit set to 1G and THOSE logs show the failure happening at 1G. Go figure . . . .

Can you try and run the archive.php script with:

php -d memory_limit=8096M misc/cron/archive.php …

The problem isn’t with the CLI php program (and anyway the CLI php.ini has memory_limit=-1) - it’s with the web server processes, and I see the PHP error messages in the apache error logs. But for the life of me, I can’t make sense of those. When I had it set to 1G I saw error messages indicating that it had used over 1GB, which is OK. But when I’ve had it set higher than that, I’ve seen it fail with error messages which indicate that the effective limit is 160M or 240M - values which I’ve had set in vhosts BUT NOT the piwik vhost.

Maybe try changing these values in the global.ini.php ?

; Minimum adviced memory limit in php.ini file (see memory_limit value)
minimum_memory_limit = 128

; Minimum memory limit enforced when archived via misc/cron/archive.php
minimum_memory_limit_when_archiving = 768

It seems to be working for me now. I’m importing a few days at a time, and then running the archiver. I have memory_limit set to 1024M in php.ini and all the vhosts. There’s definitely something odd going on with php/apache on the box, because with it set to a higher value, I was getting failures logged because it couldn’t allocate a much lower value.

Just to update - I finally got this done, taking about a month ! I was able to import about 10 days worth of logs at a time from the busiest site and run archive.php each day. I don’t think this is a particularly busy site - it gets about 7K visits a day - so I don’t know what you’d do if you wanted to import data from a much busier site into piwik - or maybe piwik just isn’t the tool for that job.

I have some other questions now but I think another thread would be more appropriate.

I was doing the import with 8 recorders on an 8 core Xeon, though it might have had to fight for some of those cores from time to time. No doubt more hardware would do somewhat better, but I can’t help feeling that the way the importer works, calling the web site, is a limiting factor. Certainly I know that urchin imports logs much faster than piwik does - but then it was designed to import logs in that way from the start, and has had longer to refine its code.

BTW this isn’t a complaint, more of an observation, and the users are happy with piwik, which is the main thing.