How to show more accurate download stats using import_logs.py


#1

Hi,

This is my first post on these forums, so first off I’d like to give a big thanks to all the developers who work on Piwik for creating such a great product!

We’ve been using Piwik for around 6 months now, always using import_logs.py. We currently track 15 sites this way on a dedicated server. Everything is working very well.

However, when we view the download statistics, we notice two things that seem a little off:

1 - The downloads are tracking almost only png, jpg and js downloads (some PDFs and other formats as well, but a very short list of those, see #2 below). Since these are not in the DOWNLOAD_EXTENSIONS setting in import_logs.py, I’m guessing that download stats tracks the most popular extensions and ignores this param? Is this correct?

Hopefully to remedy this, we’ve just now added --download-extensions= and a comma separated list of extensions we would like to track. Is this the only way to track only download files we wish to track?

2 - When we view PDF downloads, we’re only seeing downloads for some of our PDFs. It is very important that we have a count of ALL PDF downloads. Again I’m guessing that Piwik only records the more popular downloads, and drops stats for the less popular downloads. How can I force Piwik to show stats for ALL downloads in my --download-extensions= param.

This is my current EXTRA_PARAMS setting:

EXTRA_PARAMS=’–enable-http-errors --enable-http-redirects --enable-static --enable-bots --recorders=8 --recorder-max-payload-size=500 --download-extensions=bin,doc,docx,exe,gz,gzip,mpg,mp3,mp4,mpeg,pdf,ppt,pptx,rar,tar,tbz,bz2,tbz,tgz,txt,wav,wma,wmv,xls,xlsx,xml,xsd,zip’

Is there anything I’m missing, or more I can tweak to get more accurate download stats?

Many thanks for any suggestions you might have.


(Matthieu Aubry) #2

Since these are not in the DOWNLOAD_EXTENSIONS setting in import_logs.py, I’m guessing that download stats tracks the most popular extensions and ignores this param? Is this correct?

Correct, but we often add new extensions to this list. Maybe we could add yours?

Hopefully to remedy this, we’ve just now added --download-extensions= and a comma separated list of extensions we would like to track. Is this the only way to track only download files we wish to track?

I would recommend instead to do a pull request on Piwik import_logs.py,
if you think it makes sense for all piwik users to track those files as downloads?

Again I’m guessing that Piwik only records the more popular downloads, and drops stats for the less popular downloads.

Yes, see: After the top 500 or top 1000 rows, Piwik automatically groups pages, keywords, websites, etc. under the label "Others"; How do I force Piwik to not limit the data? - Analytics Platform - Matomo


#3

Since these are not in the DOWNLOAD_EXTENSIONS setting in import_logs.py, I’m guessing that download stats tracks the most popular extensions and ignores this param? Is this correct?

For clarification:
The above was referring to .png, .jpg, .gif extensions, as the extensions not in DOWNLOAD_EXTENSIONS. However, we are seeing mostly .png, .jpg, and .gif files in the top downloads when I view my download stats in Piwik. To me, Piwik appears to be ignoring what is actually in DOWNLOAD_EXTENSIONS in import_logs.py. We’ve tried to manually force what we want tracked as downloads by adding our preferred extensions to --download-extensions= when import_logs.py is called (see first post). We should see results later today hopefully once the current run is complete.

I guess what I’m asking is:
1 - Does import_logs.py really ignore DOWNLOAD_EXTENSIONS in import_logs.py, or have we misconfigured something on our end?
2 - Will add our preferred extensions to --download-extensions= force import_logs.py to track ONLY those extensions, or will .PNG, .JPG and others still appear in our stats?

Again I’m guessing that Piwik only records the more popular downloads, and drops stats for the less popular downloads.
Yes, see: [piwik.org]

Ah, Thank you so much! And I do remember seeing this when I first set up Piwik, but then forgot it was there :frowning:

Many thanks for your patience, and apologies for the confusion in my original questions.


#4

Our logs just finished rotating, and unfortunately, we’ve not improved anything. Only some PDFs are listed for downloads and no .tgz files at all, but plenty of png, jpg, and js files. This despite using this when running import_logs.py :


#!/bin/sh

PYTHON='/usr/bin/python'
IMPORT_LOGS='/var/www/html/misc/log-analytics/import_logs.py'
PIWIKURL='https://weblogs.oursite.com'
EXTRA_PARAMS='--enable-http-errors --enable-http-redirects --enable-static --enable-bots --recorders=8 --recorder-max-payload-size=500 --download-extensions=bin,doc,docx,exe,gz,gzip,mpg,mp3,mp4,mpeg,pdf,ppt,pptx,rar,tar,tbz,bz2,tbz,tgz,txt,wav,wma,wmv,xls,xlsx,xml,xsd,zip'
LOG_LOCATION='/var/www/logs'
LOG_BACKUPS='/var/www/logs_backup'

for i in $(ls -1 /var/www/logs/*.gz)
do
        base=$(basename $i)
        SITE_URL=$(echo $base | sed s/-.*//)
        SITE_ID=$(mysql -s <<< "use piwik;select idsite from piwik_site where main_url='http://$SITE_URL'")

        re='^[0-9]+$'
        if ! [[ $SITE_ID =~ $re ]]
        then
            ID=$(curl -# "$PIWIKURL/index.php?module=API&format=json&token_auth=XXXXXXXXXXXXXXXXXXXXXX&method=SitesManager.addSite&siteName=$SITE_URL&urls=$SITE_URL")
            SITE_ID=$(echo $ID | /root/bin/jsawk 'return this.value' | sed 's/\(\[\|\]\)//g')
        fi

        if [ "$SITE_ID" -lt "1" ]
        then
            echo "$SITE_URL couldn't be created"
        else
                echo "$SITE_URL ($SITE_ID) starting to parse logs";
                $PYTHON $IMPORT_LOGS --url=$PIWIKURL $EXTRA_PARAMS --idsite=$SITE_ID $LOG_LOCATION/$SITE_URL-*
                mv -f $LOG_LOCATION/$SITE_URL-* $LOG_BACKUPS/
                echo "*****  $SITE_URL logged and backed up  *****";
        fi
done

If I run the following directly on our gzipped log files I get the following output:


# zgrep  ".tgz" oursite.com-ssl_log-Feb-20150210* | awk '{print $7}' | sed s#.*\/## | sort | uniq -c | sort -n
      1 ESCTS-3.0.3.0-20140530.tgz
      1 ESCTS-3.1.0.0-20140530.tgz
      1 ktx20.tgz
      2 tgz.png
      4 khronos_headers.tgz
     14 opencl-icd-1.2.11.0.tgz

We should have the above tgz files listed, and no png, jpg or js files listed.

Here is a sample line from our log file:


nnn.nnn.nnn.nnn - - [10/Feb/2015:02:33:16 -0800] "GET /path/to/files/KTX/downloads/ktx20.tgz HTTP/1.1" 200 349303 "https://oursite.com/path/to/files/KTX/" "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.111 Safari/537.36"

What have we missed?


#5

Adding today, the second day with our changes, still no difference, we’re seeing a lot of JPG, PNG and JS under actions->downloads, when we’ve specifically not put those in our download_extensions param.

Any know how we might go about fixing this bug. It’s a bit of a show stopper for us.

Further research, shows that what I have above is “supposed” to work:

From: Log Analytics: new parameter --download-extensions to override list of files tracked as downloads · Issue #6214 · matomo-org/piwik · GitHub

One solid use case is “Track only PDF and doc files and ignore all the rest”, which this new parameter will provide via --download-extensions=pdf,doc

In our case, the rest are NOT being ignore. Ours is, as shown above:


--download-extensions=bin,doc,docx,exe,gz,gzip,mpg,mp3,mp4,mpeg,pdf,ppt,pptx,rar,tar,tbz,bz2,tbz,tgz,txt,wav,wma,wmv,xls,xlsx,xml,xsd,zip


#6

This is still an issue for us. After testing for another couple of months, our download stats consistently show every file and image, seemingly ignoring our --download-extensions=bin,doc,docx,exe,gz,gzip,mpg,mp3,mp4,mpeg,pdf,ppt,pptx,rar,tar,tbz,bz2,tbz,tgz,txt,wav,wma,wmv,xls,xlsx,xml,xsd,zip setting.

Is this a known bug, or are we still missing something in our settings. Althought we’d hate to do it, we’re about to drop Piwik and go back to GA and search for a new way to track downloads.

Any advice would be much appreciated.


(Matthieu Aubry) #7

Hi there,

the doc says:

–download-extensions=DOWNLOAD_EXTENSIONS
By default Piwik tracks as Downloads the most popular
file extensions. If you set this parameter (format:
pdf,doc,…) then files with an extension found in the
list will be imported as Downloads, other file
extensions downloads will be skipped.

If you find that it does not work as advertised then it’s for sure a bug in Piwik.
If so create a bug report to make sure that the core developers will see it. The log analytics is managed here: GitHub - matomo-org/piwik-log-analytics: Import any kind of server logs in Piwik for powerful log analytics. Universal log file parsing and reporting.