Importing multiple IIS logs from multiple servers automated, unattended

First off, I’m new here but Piwik is great work. Thanks for making it.

I’ve installed and configured Piwik on Ubuntu 12 for the main purpose of parsing logs FTPd to me from a web server farm I do not have access to. I extracted one of their uploaded archives and it expands to a folder with dozens of files from multiple servers each, so I have hundreds of files to import into Piwik.

Note that these are log files for the same domain spread via load balancer to multiple web servers, not separate websites.

I have been able to successfully run python import_logs.py a single file when I reference it by its exact name.

I’ve been through the docs and every related forum thread which specifies multiple server logs and/or parsing IIS logs, but haven’t found the answers to these two questions:

  1. Is piwik’s import_logs.py script or the DB schema smart enough to only import log data once? In other words, if I mistakenly run import_logs.py on the same file twice, is my data duplicated?

  2. Is Piwik capable of dealing with log files from multiple web servers all serving the same domain? (eg a load balanced cluster of web servers)

  3. Does someone have an example script using find | xargs or some such, which would find all the log files and pipe them to import_logs.py? I’ve been unable to get this working right; only a one-off import of a single file.

  4. As I watch import_logs.py run on a single file, I see:

116386 lines parsed, 74800 lines recorded, 36 records/sec (avg), 200 records/sec (current)
116386 lines parsed, 74800 lines recorded, 36 records/sec (avg), 0 records/sec (current)
116386 lines parsed, 74800 lines recorded, 36 records/sec (avg), 0 records/sec (current)
116386 lines parsed, 74800 lines recorded, 36 records/sec (avg), 0 records/sec (current)
116386 lines parsed, 74800 lines recorded, 36 records/sec (avg), 0 records/sec (current)
116705 lines parsed, 75000 lines recorded, 36 records/sec (avg), 200 records/sec (current)
116705 lines parsed, 75000 lines recorded, 36 records/sec (avg), 0 records/sec (current)
116705 lines parsed, 75000 lines recorded, 36 records/sec (avg), 0 records/sec (current)
116705 lines parsed, 75000 lines recorded, 36 records/sec (avg), 0 records/sec (current)
116705 lines parsed, 75000 lines recorded, 36 records/sec (avg), 0 records/sec (current)

Does this output indicate i have a problem? Or just a very slow server? (It’s a t1.micro at Amazon with no usage other than running import_logs.py but I think I will need to greatly boost it to a much larger instance size given how slow this appears to be running)

Thanks for your answers.

  1. not yet, feature asked in Log analytics list of improvements · Issue #3163 · matomo-org/matomo · GitHub

  2. yes

  3. Piwik does bulk import, so it’s normal behavior (takes more than 1 seconds to import 200 req)

But yes, 36 req/s on avg is slow (we do run at 100 or 200 req/s on clients servers)

Regarding (2) above, if Piwik import_logs.py deals properly with logs from multiple servers, I am running a script on a folder full of logs, and see this output:

Purging Piwik archives for dates: 2013-06-09 2013-06-10
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: How to Set up Auto-Archiving of Your Reports - Analytics Platform - Matomo for more info.

This is while processing a log from these dates from Server 2 or 3 when it already processed the file for the same date range from server 1.

Is this a misleading message (it seems like it purged the data for server 1’s import) or are there additional steps needed to ensure the logs from server 2 through N are added to the data imported from server 1?

Maybe it’s a misleading message, you’ll have to tell me :wink:

but basically it says that you must execute archive.php to see the new data imported, in your report

I tried running archive.php per your suggestion. It failed several times; each time I doubled the amount of memory to PHP scripts (it’s now set to 512MB) and the script timeout limit (it’s now set to 300sec). It gives this message…

SQLSTATE[HY000]: General error: 126 Incorrect key file for table ‘/tmp/#sql_31c3_7.MYI’; try to repair i

in mysql, run REPAIR TABLE xxx