Importing apache logs as long-term strategy


#1

I’m the admin for hundreds of client websites . We’ve outgrown Urchin and we like Piwik as an alternative for analytics/reporting for our sites. So far it looks awesome and meets all of our needs for features and API. I need to write a script to reproduce the functionality of Urchin’s scheduler, to import apache logs for each site with import_logs.py on a scheduled basis. I already have piwik installed and import_logs.py is working well for select sites and logs. But there are a few things I need to understand before making import_logs.py our long-term strategy for analytics data.

  1. When I import the same apache log later to collect new data, I get:

Purging Piwik archives for dates: 2012-11-27 2012-11-25 2012-11-28 2012-11-26 2012-11-29
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: How to Set up Auto-Archiving of Your Reports - Analytics Platform - Matomo for more info.

Does purging archives mean that Piiwk is deleting data for duplicate dates? So if Piwik encounters a data that it already has data for in a log, it will delete all data for that data? Or, does purging archives mean that its leaving the data in place but purging only piwik’s internal logs for for those dates from it creates the data reports?

  1. In general what is the behavior of Piwik if we import a log that has some dates that have already been imported for a site? Will duplicate dates be ignored, or will they be imported as normal, adding more data to the same dates?

If our apache logs are rotated once a week, we don’t want to wait a week before being able to import new data for a site. But if importing the same log again in order to get new data from the same log causes all the previous data to dupliate, then we’d have to wait for logs to be rotated and no new data being written to them before importing to Piwik, correct?

Good job on piwik! Thanks!


How to prevent duplicates in log analytics?
(Matthieu Aubry) #2
  1. Purging means the data is “marked as invalid” and will be re-processed the next time it has opportunity

  2. duplicates are not ignored, this is a missing feature

If you need Professional Support plan please consider contacting us at: http://piwik.org/consulting/


#3

some months are passed but if you are still facing the problem here is my solution:

in /etc/cron.hourly (or whatever you prefer)


read number of processed records

eval $(awk ‘{ print “count=”$1}’ /var/log/httpd/count.url)

import log files skipping processed records

python /var/www/usage/misc/log-analytics/import_logs.py --idsite=3 --recorders=2 –skip=$count --enable-http-errors --enable-http-redirects --enable-static --url=http://piwik.url /var/log/httpd/access_log >/dev/null

save number of processed records

wc /var/log/httpd/access_log > /var/log/httpd/count.url

bye,
tnapul


#4

Hi Mate

I am having similar issues.

IS this the best way do deal with it

or do you have any updated script/procedure

Thanks

PS: i know this is a very old thread


#5

can anyon help… this doesnt seem to work very well with the lastest version of piwik.

is there any other way of doing it