Log import question

(I managed to install python 26 on the old server, so I can crunch some 26Gb worth of logs undisturbed now)

Is there a way to import old access logs, keeping the search engine visits/terms but discarding other bots? It looks like the current --enable-bots option is an all or nothing kind of deal, is that correct?

Ideally, it could be nice to be able to see trends and topics of what real searches had been bringing visitors to my doorstep, not just the ones going forward :slight_smile:

update: never mind. For some reason, I think it wants me to do an archive.php more frequently than I had been.

Is there a way to import old access logs, keeping the search engine visits/terms but discarding other bots? It looks like the current --enable-bots option is an all or nothing kind of deal, is that correct?

–enable-bots will track bots requests to your site.

however piwik will always track the referrer of the visits, it does not depend on this setting

Yes, it seems to be working now, for the most part.

I split the 25Gb logfile into parts of 5 million lines each, and with 2 recorders, it has so far taken between 1.5-2.25 hours to process each part (done 2 so far, currently chewing on a third).

All of this information is for a single site, a year’s worth of data for Site #1, and I’m now running the archive.php to force all updates for the year in between each run of the import. I think I need to adjust something else, too, because with one run of the archive.php, I got a server error, but it didn’t seem to affect the data being saved.

I had disabled the cron so that it wouldn’t try to process any data while my logs were being imported, so I don’t think that caused the error. But if it pops up too many more times while I’m crunching these logs, I’ll ask more about it :slight_smile:

Okay, this is the error that pops up while running the archive.php after doing a 5million record log import:


[2012-11-05 22:13:54] START
[2012-11-05 22:13:54] Starting Piwik reports archiving...
[2012-11-05 22:13:54] Archived website id = 1, period = day, Time elapsed: 0.657s
[2012-11-05 22:21:13] ERROR: Got invalid response from API request: /index.php?module=API&method=VisitsSummary.getVisits&idSite=1&period=week&date=last52&format=php&trigger=archivephp. Response was '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>500 Internal Server Error</title> </head><body> <h1>Internal Server Error</h1> <p>The server encountered an internal error or misconfiguration and was unable to complete your request.</p> <p>Please contact the server administrator,  root@localhost and inform them of the time the error occurred, and anything you might have done that may have caused the error.</p> <p>More information about this error may be available in the server error log.</p> <hr> </body></html> '
[2012-11-05 22:21:13] Archived website id = 1, period = week, 0 visits, Time elapsed: 438.503s
[2012-11-05 22:23:03] Archived website id = 1, period = month, 1546585 visits, Time elapsed: 110.095s
[2012-11-05 22:24:06] Archived website id = 1, period = year, 1546585 visits, Time elapsed: 63.008s
[2012-11-05 22:24:06] Archived website id = 1, today = 0 visits, last days = 469572 visits, 4 API requests, Time elapsed: 612.265s [1/12 done]

and this is all that pops up in the error log:


[Mon Nov 05 15:21:13 2012] [warn] (104)Connection reset by peer: mod_fcgid: read data from fastcgi server error.
[Mon Nov 05 15:21:13 2012] [error]  Premature end of script headers: index.php

This one site is not currently doing any active tracking; I wanted to get the historical data caught up before adding the tracking for the current visits… is that what might be causing this error? I hadn’t seen this error pop up on previous log imports for newer data (smaller logs that were only 3 weeks to 6 weeks old), so I’m wondering if that’s the cause, or if maybe the amount of data I’m asking it to process in one bite is too much?

what is the php.ini memeory limit set to? If its around 512M maybe bump it up to 1024 and see if that helps?

I also found this link that may help.

http://www.virtualmin.com/node/19119

Also if you’re doing major log import, please use the script from trunk as Cyril has made a few speed improvements in: Log analytics list of improvements · Issue #3163 · matomo-org/matomo · GitHub

Since this is the only active site on this (about to be decommissioned) server now, mem limit is set to 2048M, with an unlimited execution time, but I left the input parsing at 180 seconds… should I increase that?

Actually, now that I’m looking at another run of the archive.php script, I think it’s only throwing that error when it comes across 0 visits for the current day… it just spit out two of them on this most recent run, the second being for a site that gets a very low amount of visits typically.

Matt, I’m using the import_logs.py that came with v1.9 – is there a better one to download? I didn’t see an obvious link to grab that version of the script.

also, what’s the best way (if any?) to manually purge old visitor log data?

I keep getting that same FastCGI error when attempting to manually purge old visitor data via the Privacy Settings page.

To purge old data see: Troubleshooting - Analytics Platform - Matomo

It won’t complete the Purge logs task… it keeps failing before completing, and the error logs show:


[warn] (104)Connection reset by peer: mod_fcgid: read data from fastcgi server error. 
[error] Premature end of script headers: piwik.php

It looks like some time limit or memory limit reached?

I was wondering who is your host provider?

For this particular Piwik site, there isn’t one, we have a dedicated server running Virtualmin. The reason it’s the last site on the server is we’ve moved everything to a new dedicated server, and I’m taking advantage of the fact there aren’t any other sites on it to crunch 25Gb of logs from the past year :slight_smile:

Currently, PHP memory limit is set to 2048Mb, and PHP input parsing I bumped up to 600 seconds. Not sure what else to increase the limits on, but I thought that since I could run archive.php from the shell, maybe there was a way for me to purge log files the same way. I’d initially thought the purge logs problem was because of the database size (currently sitting at about 2.5Gb), but I don’t believe that’s the case anymore.

Either way, I would like to purge any logs older than 120 days before I move the site to its new home (after all the access log crunching is done, of course)

Heres another article that looked interesting.

From: Julien Jabouin <chatlumo.ovh@gm…> - 2010-09-25 10:44
Thanks,

I go to try this.

2010/9/25 Travers Carter <tcarter@…>:

Hi,

I’m not sure if this is the exact cause of your problem, but a few
config changes I would try are below:

On Sat, 25 Sep 2010 04:11:18 +0200, Julien Jabouin <chatlumo.ovh@…>
wrote:

#cat /etc/apache2/mods-available/fcgid.conf

AddHandler fcgid-script .fcgi
IPCConnectTimeout 20

IdleTimeout 300

IdleScanInterval 240

BusyTimeout 300

BusyScanInterval 120

ErrorScanInterval 6

ZombieScanInterval 3

ProcessLifeTime 3600

SpawnScoreUpLimit 10

IPCConnectTimeout 300

IPCCommTimeout 300

You should add “MaxRequestsPerProcess 1000” to the above list,
where 1000 is whatever value you set PHP_FCGI_MAX_REQUESTS to.

If the two don’t match some requests can fail.

For the newer (Apache) version of mod_fcgid the directive is called
FcgidMaxRequestsPerProcess, but it is the same thing.

See
http://httpd.apache.org/mod_fcgid/mod/mod_fcgid.html#fcgidmaxrequestsperprocess

cat /var/www/site1/php-cgi

#!/bin/bash
PHP_FCGI_CHILDREN=4
PHP_FCGI_MAX_REQUESTS=1000
export PHP_FCGI_CHILDREN
export PHP_FCGI_MAX_REQUESTS
exec /usr/bin/php5-cgi

Don’t set PHP_FCGI_CHILDREN, mod_fcgid won’t send more than one request at
a time
to a process, so you should increase the (Fcgid)MaxProcessesPerClass
setting in
apache instead.

See
http://wherethebitsroam.com/blogs/jeffw/apache-php-fastcgi-and-phpfcgichildren

–