Details on import_logs.py bug that was possibly unnoticed but fixed

I’m not a software developer nor have I had much experience with a bug tracking system, so I figured I would format the data as best I could in a forum post so at least someone could see the problem/solution.

I have Piwik 1.11.1 setup exclusively with Apache logs for ~6 websites on a dedicated Linux server running Debian 6, so there’s no Javascript tracking to worry about. I have every website setup to automatically rotate the logs, import the data into Piwik, have Piwik archive and then archive the log files in a tar ball, but when I came in today I noticed there was nothing in Piwik for the weekend. I looked through my scripts log file and found that Piwik was labeling nearly every line as “invalid” during the import but just with a few websites, not all of them. After researching all day, I found that it seems to be a bug in the import_logs.py script for 1.11.1, but not for the script in 1.10.1 or in the master branch off of PiWiks git.

Here’s the beginning of a log file that causes this issue in 1.11.1:


127.0.0.1 - - [24/Mar/2013:00:05:04 -0400] "GET / HTTP/1.0" 200 2873 "-" "Wget/1.12 (linux-gnu)"
127.0.0.1 - - [24/Mar/2013:00:10:03 -0400] "GET / HTTP/1.0" 200 2873 "-" "Wget/1.12 (linux-gnu)"
127.0.0.1 - - [24/Mar/2013:00:10:03 -0400] "GET / HTTP/1.0" 200 2873 "-" "Wget/1.12 (linux-gnu)"
69.181.99.999 - - [24/Mar/2013:00:14:48 -0400] "GET /ca_bested/ HTTP/1.1" 200 664 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; FunWebProducts; SearchToolbar 1.1; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618; ClickPotato 1.0.546.0; CP_IT_E8790372B276545230AA99)"
69.181.99.999 - - [24/Mar/2013:00:14:55 -0400] "GET /ca_bested/index.php?cacheGood=false&shortName=ca_bested HTTP/1.1" 302 703 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; FunWebProducts; SearchToolbar 1.1; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618; ClickPotato 1.0.546.0; CP_IT_E8790372B276545230AA99)"
69.181.99.999 - - [24/Mar/2013:00:14:55 -0400] "GET /ca_bested/dashboard/Interface.html HTTP/1.1" 304 210 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; FunWebProducts; SearchToolbar 1.1; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618; ClickPotato 1.0.546.0; CP_IT_E8790372B276545230AA99)"
69.181.99.999 - - [24/Mar/2013:00:14:56 -0400] "GET /ca_bested/dashboard/history/history.css HTTP/1.1" 304 210 "http://blah.blahblah.com/ca_bested/dashboard/Interface.html" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; FunWebProducts; SearchToolbar 1.1; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618; ClickPotato 1.0.546.0; CP_IT_E8790372B276545230AA99)"
69.181.99.999 - - [24/Mar/2013:00:14:56 -0400] "GET /ca_bested/dashboard/AC_OETags.js HTTP/1.1" 304 211 "http://blah.blahblah.com/ca_bested/dashboard/Interface.html" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; FunWebProducts; SearchToolbar 1.1; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618; ClickPotato 1.0.546.0; CP_IT_E8790372B276545230AA99)"
69.181.99.999 - - [24/Mar/2013:00:14:56 -0400] "GET /ca_bested/dashboard/history/history.js HTTP/1.1" 304 211 "http://blah.blahblah.com/ca_bested/dashboard/Interface.html" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; FunWebProducts; SearchToolbar 1.1; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30618; ClickPotato 1.0.546.0; CP_IT_E8790372B276545230AA99)"

And this is what the output of import_logs.py looks like


Parsing log /home/mattw/stupidLogs/www_blah.blahblah.com:_apache_access.log...
6717 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
12044 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 1 records/sec (avg), 5 records/sec (current)
15245 lines parsed, 5 lines recorded, 1 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
15245 lines parsed, 5 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Purging Piwik archives for dates: 2013-03-24 2013-03-25
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.

Logs import summary
-------------------

    578 requests imported successfully
    0 requests were downloads
    14667 requests ignored:
        14667 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    578 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:



Performance summary
-------------------

    Total time: 33 seconds
    Requests imported per second: 17.4 requests per second


The first three lines of the log file is what the script is getting caught up on. Upon removing all three of those lines (Not just one), the script handles the log just fine with the following output:


Parsing log /home/mattw/stupidLogs/www_blah.blahblah.com:_apache_access.log...
3741 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 64 lines recorded, 10 records/sec (avg), 64 records/sec (current)
4634 lines parsed, 64 lines recorded, 9 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 64 lines recorded, 7 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 64 lines recorded, 7 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 64 lines recorded, 6 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 64 lines recorded, 5 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 64 lines recorded, 5 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 64 lines recorded, 4 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 64 lines recorded, 4 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 64 lines recorded, 4 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 218 lines recorded, 13 records/sec (avg), 154 records/sec (current)
4634 lines parsed, 229 lines recorded, 13 records/sec (avg), 11 records/sec (current)
4634 lines parsed, 229 lines recorded, 12 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 229 lines recorded, 12 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 428 lines recorded, 21 records/sec (avg), 199 records/sec (current)
4634 lines parsed, 428 lines recorded, 20 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 428 lines recorded, 19 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 428 lines recorded, 18 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 428 lines recorded, 17 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 428 lines recorded, 17 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 428 lines recorded, 16 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 428 lines recorded, 15 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 428 lines recorded, 15 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 428 lines recorded, 14 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 562 lines recorded, 18 records/sec (avg), 134 records/sec (current)
4634 lines parsed, 562 lines recorded, 18 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 562 lines recorded, 17 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 562 lines recorded, 17 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 853 lines recorded, 25 records/sec (avg), 291 records/sec (current)
4634 lines parsed, 867 lines recorded, 24 records/sec (avg), 14 records/sec (current)
4634 lines parsed, 867 lines recorded, 24 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 867 lines recorded, 23 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 867 lines recorded, 22 records/sec (avg), 0 records/sec (current)
4634 lines parsed, 867 lines recorded, 22 records/sec (avg), 0 records/sec (current)
5631 lines parsed, 1250 lines recorded, 31 records/sec (avg), 383 records/sec (current)
5631 lines parsed, 1250 lines recorded, 30 records/sec (avg), 0 records/sec (current)
5631 lines parsed, 1250 lines recorded, 29 records/sec (avg), 0 records/sec (current)
5631 lines parsed, 1250 lines recorded, 29 records/sec (avg), 0 records/sec (current)
5631 lines parsed, 1250 lines recorded, 28 records/sec (avg), 0 records/sec (current)
5631 lines parsed, 1250 lines recorded, 27 records/sec (avg), 0 records/sec (current)
5631 lines parsed, 1250 lines recorded, 27 records/sec (avg), 0 records/sec (current)
5631 lines parsed, 1250 lines recorded, 26 records/sec (avg), 0 records/sec (current)
5631 lines parsed, 1330 lines recorded, 27 records/sec (avg), 80 records/sec (current)
5631 lines parsed, 1330 lines recorded, 27 records/sec (avg), 0 records/sec (current)
....

Once the import_logs.py script is replaced with the 1.10.1[/url] version or the [url=https://raw.github.com/piwik/piwik/master/misc/log-analytics/import_logs.py]master version from GitHub, it handles the log file fine without having to edit it, so the bug seems to be fixed as of now but I thought others might have the same frustrations as me.

I guess this is related to this one - 1.11 - import_logs.py broken parser · Issue #3805 · matomo-org/matomo · GitHub

Try to use the script from the latest GIT repository as we fixed this bug, then it should work. we will release new beta & version in a couple weeks

Where is that github version? If i browse in your signature i end 17 jan 2013 -> History for misc/log-analytics/import_logs.py - mattab/piwik · GitHub

My importlogs also stopped working after latest update and im not sure why. Whatever i type and try i always get the “option page” who say to me what i can do and how. Script worked since month without any problems or changes. And now i cant get it back to work.