Script with import logs backup

toomanylogins · August 2, 2022, 4:50pm

If I have the scripts on the page and also import the server logs will I be double counting the visitors or does the import script de-duplicate ?

This is similar to this very old request which is also relevant. Did this ever get implemented ?

Thanks
Paul

heurteph-ei · August 3, 2022, 9:33am

Reading the feature request related to your problem:

github.com/matomo-org/matomo-log-analytics

Log Analytics could detect log lines that were already imported and skip them automatically

opened 10:52PM - 12 Jul 16 UTC

mattab

enhancement

[Log Analytics](http://piwik.org/log-analytics/) is a powerful tool of the Piwik… platform, and used by thousands of people in many interesting use cases. It is quite powerful and relatively easy to use, and offers has many options and features. We like to make our tools as easy as possible to use... this issue is about making Log Analytics easier to use and even more flexible for people. ## Issue: the log data is not deduplicated When you import logs in Piwik, Piwik will always import and track all the logs. When you import again the same log file in Piwik, it will be imported again by the Tracking API, and the data will end up being duplicated in the Piwik database. ## Why this is not good enough our users rightfully expect Piwik to be easy to use and do the right thing. Recently @Synchro reported this issue and did not expect Log Analytics to import the data again and again. See the description at: https://github.com/piwik/piwik/issues/10248#issuecomment-232085596 Over the years many users have reported experiencing this issue. ## Existing workaround So far most people manage to use Log Analytics despite this limitation. The common workaround is to create one log file per hour, or one log file per day, and import each log file only once. Commonly, people write a script which makes sure that each log file is imported only once. For example, the log files may be ingested into Piwik while/after they have been rotated. ## Solution Ideally, we do not want people to worry whether they have imported a given log file, or even whether a log file was partially imported before and is re-imported again. We want Piwik to automatically [deduplicate](https://en.wikipedia.org/wiki/Data_deduplication) the [tracking API](http://developer.piwik.org/api-reference/tracking-api) data. so far I see two possible ways to fix this issue: ### 1. new Piwik Tracking API feature: request id deduplicator The [Tracking API](http://piwik.org/docs/tracking-api/) could introduce a new feature, to let tracking api users specify a `request ID` for the given request. Piwik would store the `request ID` for each request and use this `request ID` as a unique key. If any tracking API request for a given `date` with a given `request ID` has already been tracked/imported in this `date`, then the request would be skipped. Each `request id` will be imported at most once for a given day. The Log Analytics tool will then simply create, for each log file's line that is parsed, a request ID and pass it to the tracking api request to let the Tracking API deduplicate the requests. Log analytics could create this request id as a hash of the log line or so. - Pros: other Tracking API SDKs and clients will be able to use this feature to deduplicate the data. ### 2. Implement request ID deduplicator in Log Analytics only Alternatively, we could implement this feature exclusively in the Log Analytics, and make this tool clever enough so that it will only send each Log Line's tracking data once to the Piwik Tracking API. The Log Analytics Python app could for example keep track of the list of log files that were imported before, as well as a list of the request ID /hashes of all the log lines that were imported before, indexed by date or so. Maybe in SQLite database or so. - pros: maybe easier to implement. - cons: this will work only when people import their data on one server only (when several servers are using log analytics they would not share the "request id database" amongst them so may import the same data. ## Summary this feature would be awesome to have, and will make log analytics much more flexible and easier to use and setup. What do you think?

It seems it has not been solved yet.

Also, the above GitHub feature request concerns only the duplicates coming from several server logs files imports. But not about hits coming from both server log file and HTTP (or JavaScript) tracking API. How would then Matomo identify a hit is the same from server log file and from HTTP API? Okay, for a hit, the IP and page URL may be the same (not sure for the URL if for example the tracking API changes it)… But log files have no knowledge about the visitor ID. Also you can’t rely on times, because there is some delays between:

the server time for the page hit
the client time for the page loaded
the Matomo time when receiving the tracking