Discrepancies between Google Analytics and matomo server side imports with import_logs.py discrepancies between Google Analytics and matomo server side imports with import_logs.py - with ratio 40 to 1

We are seeing a significant discrepancy between matomo imports through log reading and Google analytics (kept in parallel active).

To give a practical example, in the month of July 2023 Matomo recorded 4,112,600 visits while Analytics a more credible 94,097.

a forty-fold difference ratio.

We cannot understand the problem but we are certain that Matomo is exaggerating the statistics also considering the load capacity of our hosting service which would not be able to handle such volumes in the slightest.

The import system is set up via crontab in the following way:

0 22 * * * python3 /var/www/html/matomo/misc/log-analytics/import_logs.py --url=http://19*...1/matomo/ --idsite=2 --recorders=4 --enable-http-errors --enable-http-redirects --enable-static --enable-bots /var/log/httpd/443-access_log > /home//***logs/matomo_import.log

logs show this data:

Logs import summary

532450 requests imported successfully
5191 requests were downloads
14095 requests ignored:
    0 HTTP errors
    0 HTTP redirects
    14095 invalid log lines
    0 filtered log lines
    0 requests did not match any known site
    0 requests did not match any --hostname
    0 requests done by bots, search engines...
    0 requests to static resources (css, js, images, ico, ttf...)
    0 requests to file downloads did not match any --download-extensions

Website import summary

532450 requests imported to 1 sites
    1 sites already existed
    0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary

Total time: 3900 seconds
Requests imported per second: 136.51 requests per second

The strangest thing is that it doesn’t track static requests but the log is flat

Like:

*** - - [04/Oct/2023:15:12:41 +0200] “GET /templates/*/js/_box.js?ver=1696425161 HTTP/1.1” 200 7596

From your experience, is there anything we can do to get credible visits from Matomo using reading Apache logs?

Perhaps the problem lies in the parameters passed to the crontab. I’m afraid those parameters disable the controls that make the import similar to that via Javascript is this possible?

I’m here to list the problems I’m still experiencing.

From a first analysis the problem seemed to be the crontab in which there were parameters that said to import everything.

The new crontab is therefore like this:

0 22 * * * python3 /var/www/html/matomo/misc/log-analytics/import_logs.py --url=http://19...1/matomo/ --idsite=2 /var/log/httpd/443-access_log > /home/* /***logs/matomo_import.log

But the system continues to not work and BOTs continue to be entered in the visit logs.

We then activated the user agents within the access log file.

192.168.32.229 - - [22/Oct/2023:03:28:20 +0200] “GET /templates/jsn_solid_pro/js/jsn_link_profession_selected.js?ver=1697760000 HTTP/1.1” 200 3440 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.5993.70 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

But even in this way the import system continues to import incorrectly.

Logs import summary

129793 requests imported successfully
4453 requests were downloads
193049 requests ignored:
    625 HTTP errors
    1116 HTTP redirects
    13443 invalid log lines
    0 filtered log lines
    0 requests did not match any known site
    0 requests did not match any --hostname
    0 requests done by bots, search engines...
    177865 requests to static resources (css, js, images, ico, ttf...)
    0 requests to file downloads did not match any --download-extensions

At the moment I don’t know how to solve the problem. It would seem that the import system via logs does not work well.

Do you have any suggestions?

That is a little bit curious.

On the other side. My webhoster generates also statistics from the access logs. I track only with matomo JavaScript with few restrictions. The difference is big. Matomo tracks only the half of visits, as well as the webhoster tracks the double of visits. Both without bots.

i found solution. But we must modify script.

Add new format to the regular expression:

_TEST_EXTENDED_LOG_FORMAT = (_COMMON_LOG_FORMAT +
    r'\s+(?P<user_agent>.+)'
)

then added in FORMAT that script know:

FORMATS = {
    'common': RegexFormat('common', _COMMON_LOG_FORMAT),
    'test': RegexFormat('test', _TEST_EXTENDED_LOG_FORMAT),

force use of this format during import command

python3 import_logs.py --url=http://localhost/y-analytics 443-access_log.txt --idsite=8 –log-format-name=“test”

is a dirty way but it works and BOT now are founded with this particular access log:

Logs import summary

79 requests imported successfully
10 requests were downloads
780 requests ignored:
    3 HTTP errors
    2 HTTP redirects
    21 invalid log lines
    0 filtered log lines
    0 requests did not match any known site
    0 requests did not match any --hostname
    **39 requests done by bots, search engines...**
    715 requests to static resources (css, js, images, ico, ttf...)
    0 requests to file downloads did not match any --download-extensions
1 Like