Pageview meaning with import_log.py


#1

I am running piwik 2.04-b1.

I made some test importing apache log file for our static web server and I use the import in the following way from a shell script

/usr/bin/python /var/www/html/piwik/misc/log-analytics/import_logs.py --debug --url=https://w3stat.unil.ch/piwik/ /var/tmp/stats/$server/access --idsite=xxx --config=/var/www/html/piwik/config/config.ini.php --recorders=2 --log-hostname=www2.unil.ch --hostname=www2.unil.ch --enable-static --enable-bots --enable-http-errors --enable-http-redirects --enable-reverse-dns --strip-query-string --output=/var/log/piwik/prod.out 2>&1

I find some strange results… For example for a specific url , I get

/ci/situnil/img/vd_blue.gif 26 unique downloads 26 downloads.

However using a simple shell command on the logfile used to import data from such as :

grep “ci/situnil/img/vd_blue.gif” /var/tmp/stats/prod/access_full | wc -l
1698

How can you explain such a difference ?? I am not sure to really understand the meaning of pageview in piwik. I expected it would correspond to the number of lines with this entry “GET ci/situnil/img/vd_blue.gif” in the logfile ??

I am really confused …

I have the same remark for some specific page and not download.

Any help is welcome

Best regards


(Matthieu Aubry) #2

Do you compare the right dates, ie you look at piwik dates the same dates as your log files?


#3

Yes I do. …


(Matthieu Aubry) #4

Can you find out a small log file, where this error can be reproduced? ie. a log file with 5-10 lines, showing the problem clearly?


#5

Ok So I made the following test

I run the command
/usr/bin/python /var/www/html/piwik/misc/log-analytics/import_logs.py --url=https://w3stat.unil.ch/piwik/ /var/tmp/stats/app/access_test --idsite=564 --config=/var/www/html/piwik/config/config.ini.php --recorders=2 --log-hostname=www3.unil.ch --hostname=www3.unil.ch --enable-static --enable-bots --enable-http-errors --enable-http-redirects --enable-reverse-dns --strip-query-string --output=/var/log/piwik/test.out

You can have a look to the results for this site on our piwik site : https://w3stat.unil.ch/piwik using piwik/debug4piwik as user/pwd.

You wil see that the piwik results are wrong both for the visitor log ( some IP are ignored) and the actions > pages report.

I am realli confused about that…
I have the same results for all my parsed logfiles. They all come from an apache webserver with combined ( ncsa… ) format…

find below the content of the acces_test logfile

46.229.160.208 - - [22/Jan/2014:00:08:50 +0100] “GET /wpmu/dalai-lama/files/2013/04/dalai_lama_fr.pdf HTTP/1.1” 200 5754305 “-” "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9) Gecko"
83.139.189.139 - - [22/Jan/2014:00:08:53 +0100] “GET /wpmu/alumnil/tag/technology/ HTTP/1.0” 200 30463 “http://www3.unil.ch/wpmu/alumnil/tag/technology/” "Mozilla/5.0 (Windows NT 5.2; rv:17.0) Gecko/20100101 Firefox/17.0"
83.139.189.139 - - [22/Jan/2014:00:08:54 +0100] “GET /wpmu/alumnil/participez-a-la-construction-dun-nouvel-avenir-technologique-et-social/ HTTP/1.0” 200 34367 “http://www3.unil.ch/wpmu/alumnil/participez-a-la-construction-dun-nouvel-avenir-technologique-et-social/” "Mozilla/5.0 (Windows NT 5.2; rv:17.0) Gecko/20100101 Firefox/17.0"
83.139.189.139 - - [22/Jan/2014:00:08:55 +0100] “GET /wpmu/alumnil/participez-a-la-construction-dun-nouvel-avenir-technologique-et-social/index.php HTTP/1.0” 301 - “http://www3.unil.ch/index.php” "Mozilla/5.0 (Windows NT 5.2; rv:17.0) Gecko/20100101 Firefox/17.0"
83.139.189.139 - - [22/Jan/2014:00:08:55 +0100] “GET /wpmu/alumnil/participez-a-la-construction-dun-nouvel-avenir-technologique-et-social/index.php HTTP/1.0” 301 - “http://www3.unil.ch/index.php” "Mozilla/5.0 (Windows NT 5.2; rv:17.0) Gecko/20100101 Firefox/17.0"
83.139.189.139 - - [22/Jan/2014:00:08:56 +0100] “GET /wpmu/alumnil/participez-a-la-construction-dun-nouvel-avenir-technologique-et-social/index.php HTTP/1.0” 301 - “http://www3.unil.ch/index.php” "Mozilla/5.0 (Windows NT 5.2; rv:17.0) Gecko/20100101 Firefox/17.0"
150.82.33.22 - - [22/Jan/2014:00:09:05 +0100] “GET /wpmu/pgact/files/2011/02/DPPG_Meeting_2011_PROGRAM_A4_ohne-header.jpg HTTP/1.1” 200 106526 “-” "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:26.0) Gecko/20100101 Firefox/26.0"
130.223.16.101 - - [22/Jan/2014:00:09:14 +0100] “GET /wpmu/cinn/feed/ HTTP/1.1” 304 - “-” "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 Lightning/2.6.3"
65.55.24.218 - - [22/Jan/2014:00:09:14 +0100] “GET /wpmu/musique/feed/ HTTP/1.1” 200 54282 “-” "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
130.223.16.101 - - [22/Jan/2014:00:09:14 +0100] “GET /wpmu/allezsavoir/feed/ HTTP/1.1” 304 - “-” "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 Lightning/2.6.3"
65.55.24.218 - - [22/Jan/2014:00:09:15 +0100] “GET /wpmu/fae/fagenda/action~agenda/tag_ids~69971,69999,69988,8732/ HTTP/1.1” 200 73655 “-” "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
83.233.207.74 - - [22/Jan/2014:00:11:11 +0100] “GET /wpmu/esvdc/forums/users/foresvdc/ HTTP/1.1” 200 11085 “-” "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0"
83.233.207.74 - - [22/Jan/2014:00:11:12 +0100] “GET /wpmu/esvdc/forums/forum/research-2/ HTTP/1.1” 200 13327 “-” “Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0”
~
~
and then the content of the the /var/log/piwik/test.out

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log /var/tmp/stats/app/access_test…
Purging Piwik archives for dates: 2014-01-21
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: How to Set up Auto-Archiving of Your Reports - Analytics Platform - Matomo for more info.

Logs import summary

13 requests imported successfully
2 requests were downloads
0 requests ignored:
    0 invalid log lines
    0 requests done by bots, search engines, ...
    0 HTTP errors
    0 HTTP redirects
    0 requests to static resources (css, js, ...)
    0 requests did not match any known site
    0 requests did not match any requested hostname

Website import summary

13 requests imported to 1 sites
    1 sites already existed
    0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary

Thanks a lot for your help …

Best regards
Total time: 0 seconds
Requests imported per second: 37.22 requests per second


#6

Hello,
Any suggestion for that problem ? Thanks …

Isa


(Matthieu Aubry) #7

Can you reproduce this problem in a new piwik server, with a particular log file?

If you can give me exact steps to reproduce the problem I can look whether we can fix it?


#8

I have the problem with ALL my access_log files …
As you asked me I tried to select a small logfile for you in order to check that.

I don’t understand what you mean with a new piwik server .
And what should that change? I already tried to define a new piwik site ansd the problem was there … and I have the same problem with all the piwik sites for which I generate the statistics using import_log.py script
Coud you have a look to the results for the site related to the logfile I sent to you on our piwik site : https://w3stat.unil.ch/piwik (using piwik/debug4piwik as user/pwd. ) ?

Best regards


(Matthieu Aubry) #9

You wil see that the piwik results are wrong both for the visitor log ( some IP are ignored) and the actions > pages report.

Could you explain in detail what is wrong in your opinion?


#10

For example, some IP are missing in Log visitor report for day 22 of january

65.55.24.218 and 83.233.207.74 are not there while they are present in the log files… ( see my preceding message)

And the actions > pages report is empty !!! while I have some access such as

83.139.189.139 - - [22/Jan/2014:00:08:54 +0100] “GET /wpmu/alumnil/participez-a-la-construction-dun-nouvel-avenir-technologique-et-social/ HTTP/1.0” 200 34367 “http://www3.unil.ch/wpmu/alumnil/participez-a-la-construction-dun-nouvel-avenir-technologique-et-social/” “Mozilla/5.0 (Windows NT 5.2; rv:17.0) Gecko/20100101 Firefox/17.0”

in my logfile

I also mention that for the same site I can see some access ( in actions > pages report) as I use the WP piwik plugin for this site !!

The actions > downloads report is the only one which seem to be correct.

So in conculsion I cannot compare my results for each individual Wordpress site generated using WP PIwik plugin and the results for all my WOrdpress sites generated using import_log.py. Indeed the result for all ( ie 250 sites !!) iWP sites are much less than for one indivudual site. That 's the reason which alerts me somethng was wrong with import_log.py !!

Thanks for your help…


(Matthieu Aubry) #11

Thanks for the report! using your log file I found a bug in Piwik :slight_smile:

Created ticket at: Log Analytics --enable-bots does not enable bots tracking · Issue #4628 · matomo-org/matomo · GitHub and also fixed it, so can you try the small change in import_logs.py see the one at the top here Fixes #4628 --enable-bots now works as expected + our importLogs inte… · matomo-org/matomo@63f1ce6 · GitHub


#12

I just can test anything right now since import_logs.py from version 2.04b8 does not work at all now :X ( see my report 301 Moved Permanently ) …

So I am waiting first for a corrected version of import_logs.py before testing.

Is your correction only related to bots access counting ?


(Matthieu Aubry) #13

can you please try with beta 10? it works fine for me?


#14

import_logs.py ofversion 2.04b11 still the same error . so no way to test yet
!!


#15

OK now I have import_logs.py working with python 2.7 and piwik 2.1rc1 I could come back and test that problem of pageviews.

And I 'am sorry to notice that I get exactly the same problem witnh my import logs. Some pages and IP are missing there ( see my tickets of 4th of feb) . So is there a particular definition of pages for piwik and import_logs.py.

Thanks for your answers


(Matthieu Aubry) #16

can you describe exactly if there is still a bug with 2.1 RC? as far as i know the import logs is working fine. if you find a bug please create a ticket or post a comment with a log file to reproduce the problem + explanations of what you think the bug is. Thanks!


#17

The problem is exactly the same… I cannot tell you more than I describe at the beginning of this topic… the test logfile is always the same and the pages and IP missing are exactly the same as what I submitted at the beginning of this topic…
I definitively think import_logs has some problems and think seriously to go back to wusage . less attractive but … working.


#18

Another question that may be help me tio understand

Where can be counted such a request in piwik

85.3.20.113 - - [16/Feb/2014:23:23:14 +0100] “GET /dilps/image.php?PHPSESSID=mt3bemjvgemu2rv1b8ct8jb1d7&id=12:53820&resolution=120x90 HTTP/1.1” 200 55774 “Base de données DILPS” “Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:13.0) Gecko/20100101 Firefox/13.0.1”

???


#19

i went back and looked over the problems you are facing. it seems maybe getting a bit ack to basics may help solve this mystery…

you are using the wordpress plugin correct?

could you pick a small site from your group and not use the wordpress plugin but have the code manually on there then use stats from ther to test the log importing? im concerned as the plugin is not developed by the piwik team we have an extra layer of a x factor we could then eliminate?


#20

Thanks for youre reply BUT the problem is not the Worspress plugin which works perfectly on my individual WP sites …

The problem is when I want to use piwik from logfiles via the import_logs.py script. In this case I obtain crazy results… I could check that when I compared WP piwik plugin results to results from logfile for all my WP sites ( i run a multisite instance of Wordpress… ) . I got some ridiculous pageviews count for all my WP sites which were less tahn the results for an individual WP site … That is what alerted me !! to make all those tests with the import_logs.py script…

In summary nothng to do with WP plugin. I made some tests on several independant logfiles… from my apache servers… not related to Wordpress at all.