How to prevent duplicates in log analytics?


#1

Hello,
I’m trying to setup Piwik log analytics to complement my already working JavaScript analytics. What I got is shared hosting which gives me access to an (archived) log file which is being updated every hour with new log entries. I red all of the guides, FAQs and tutorials I could find on how to go about importing access logs and I successfully imported log for my website. The next obvious step would be to setup a cron job to pull new log entries every hour, however, I’m not sure how to prevent duplicate data being sucked in. I couldn’t find any info whether import_logs.py has any built in mechanism to prevent duplicates so I’m assuming that it doesn’t. I’m also not technical enough to come up with my own solution so my question is: can anybody point me to a place where I could find some info on how to solve this problem or maybe even share their own solution?


(Steve Mercer) #3

The Log Analytics FAQ covers this.


#4

As I mentioned in my original post I’m not a technical person but I have red the FAQ and to be honest I do not see the issue of duplicates mentioned anywhere. I apologize for asking something that maybe obvious but does it mean that import_logs.py has a built in mechanism to prevent duplicates?

Response from matthieu in a post from 2012 which can be found here Importing apache logs as long-term strategy
suggests that “duplicates are not ignored, this is a missing feature” but was it fixed since? Further responses even from 2015 imply that this is still an issue.


(Steve Mercer) #5

I haven’t tried it but the FAQ suggests that you “likely would import log files hourly or daily into Piwik” and shows the command to put in a cron job.


#6

Yep and it works. Tried it myself, no problem there. The issue I’m concerned with is that my logs are being rotated on monthly basis. If I put a cron job once a day or once an hour it would import the same data multiple times and I’m just not knowledgeable enough to figure out how to handle it.


(Steve Mercer) #7

In that case I’ll have to defer to someone else who may know.


(Ilmtr) #8

Any updates on preventing duplicates in log analytics? Setting up a pre-processor to split up these appending log files into unique hourly / daily / etc. log files is best practice currently? Would be nice if Matomo could handle this build in.