Working on Apache Logs

Hey folks!

Yes, I did read the FAQ, Manual and spent about an hour of google’ing around. So far I came across outdated versions or no longer valid informations.
Summary: I do have piwik up and running, even with GeoIP stuff, so far it works great. Now, I can’t modify the websites of the pages, so I have to rely on the apache logs, which I already modified to include the requested (sub)domain. Logs are in /var/log/http/$domain/$fqdn-access.log, ie, /var/log/http/tree.com/stump.tree.com-access.log. There are always new hosts / domains coming and going, so there is an unknown number of subdomains to be handled.

I built my log-import script like this:


#! /usr/local/bin/bash

# Configuration
BIN="/usr/local/www/piwik/misc/log-analytics/import_logs.py"
URL="http://server/piwik/"
SMP="4"
EXTRA="--enable-http-errors --enable-http-redirects --enable-static --enable-http-redirects --enable-reverse-dns --enable-bots --add-sites-new-hosts"

find /var/log/httpd/ -type f -iname "*access*" | xargs $BIN --url=$URL --recorders=$SMP $EXTRA

This actually adds new sites for… new sites (duh), as required. All nice and easy - yay!

Now the tricky part, also known as Problem (dun-dun-dun)…
The logfiles are deleted each month (actually backed-up, then deleted). This also means that until then, the logs are not rotated. The beforementioned script does run every 3 hours, yielding in what I can see in pretty much duplictated entries. Hence the problem. (Is this a bug or a missing feature?)

I also noticed there is an archive.php script, which use is currently a mistery to me. Does this delete the duplicated entries? After a run of archive.php a site that had 4 visits (really 2 visits, but with one duplictate log entry 4) still remains 4. It does drop if I delete all the piwik_archive_* tables, but uh… This would mean I’d have to:

  • run the update script,
  • run the archive script,
  • drop all archive dbs.

every 3 hours!

I know I am missing something blatantly obvious here. The question is: How do I really go for updating Piwik with apache logs (which can’t be rotated)?

Thank you very much in advance,
great work with piwik,
-Christian.

Import the NEW logs, every 3 hours or so. (Piwik cant detect duplicate logs <- missing feature, hopefully someone could sponsor this in the future)

Then setup Auto archiving (archive.php), see: How to Set up Auto-Archiving of Your Reports - Analytics Platform - Matomo (this is important to pre-process reports)