import_log, log-format-regex, and invalid lines

[size=medium][/size]I posted this on stack overflow, but hopefully this is a better place to get an answer …

I’m new to piwik and trying to import a bunch of logs. I need help with the log-format-regex. A sample line from the log is:


"1.1.1.1" 2.2.2.2 - myuser [09/Dec/2012:04:03:29 -0500] "GET /signon.html HTTP/1.1" 304 "http://www.example.com/example" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1"

The command I’m running is here:


python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://ec2-1-1-1-1.compute-1.amazonaws.com/piwik/ /disk2/httpd_prod-psweb1/access_log.2012-11-09 --enable-static --idsite=1 --dry-run --log-format-regex='\\\\"(?P<ip>\\\\S+)\\\\" \\\\S+ \\\\S+ \\\\S+ \\\\[(?P<date>.*?) (?P<timezone>.*?)\\\\] \\\\"\\\\S+ (?P<path>.*?) \\\\S+\\\\" (?P<status>\\\\S+) (?P<length>\\\\S+) \\\\"(?P<referrer>.*?)\\\\" \\\\"(?P<user_agent>.*?)\\\\"'

I’m consistently getting all “requests ignored” and “invalid log lines”. For example:


Logs import summary
0 requests imported successfully
0 requests were downloads
236252 requests ignored:
    236252 invalid log lines
    0 requests done by bots, search engines, ...
    0 HTTP errors
    0 HTTP redirects
    0 requests to static resources (css, js, ...)
    0 requests did not match any known site
    0 requests did not match any requested hostname

How can I fix log-format-regex?

TIA,
dan

Edit: Environment is:
Piwik 1.9.2
Python 2.7.3
PHP 5.3.10
on Ubuntu 12.04

what server version? what php version? what python version?

Aw, sorry, I shoulda added the server info. Edited OP so its in there now.

Edit: Environment is:
Piwik 1.9.2
Python 2.7.3
PHP 5.3.10
on Ubuntu 12.04

Is there any error logs output outside of what you provided?

No other error logs.

what is the source of the logs?

Hmm the python version is ok as its at least 2.7.x

The php also any chance you could try php 5.4.x to see if that helps?

Here is another user with an issue does this shed any light?

I hope this is helpful! (This is still a work in progress and I encourage feed back or help! )

This was most useful in working the live regex custom log format option:

http://ksamuel.pythonanywhere.com/

if you know the valve variables from server.xml (tomcat), like:

common - %h %l %u %t “%r” %s %b
combined - %h %l %u %t “%r” %s %b “%{Referer}i” “%{User-Agent}i”

in my case I used:
pattern=’%h %S %t %s %b %D %m %U “%{User-Agent}i”’

I identified what was currently in the code pulling this from the import_log.py (so I had a clue about what I was attempting to do):

_HOST_PREFIX = ‘(?P[\w-.])(?::\d+)? ‘
_COMMON_LOG_FORMAT = (
’(?P\S+) \S+ \S+ [(?P.
?) (?P.?)] ‘
’"\S+ (?P.
?) \S+" (?P\S+) (?P\S+)’
)
_NCSA_EXTENDED_LOG_FORMAT = (_COMMON_LOG_FORMAT +
’ “(?P.?)" "(?P<user_agent>.?)”’
)
_S3_LOG_FORMAT = (
’\S+ (?P\S+) [(?P.?) (?P.?)] (?P\S+) ‘
’\S+ \S+ \S+ \S+ “\S+ (?P.?) \S+" (?P\S+) \S+ (?P\S+) ‘
’\S+ \S+ \S+ "(?P.
?)” “(?P<user_agent>.*?)”’
)
_ICECAST2_LOG_FORMAT = ( _NCSA_EXTENDED_LOG_FORMAT +
’ (?P<session_time>\S+)’
)

FORMATS = {
‘common’: RegexFormat(‘common’, _COMMON_LOG_FORMAT),
‘common_vhost’: RegexFormat(‘common_vhost’, _HOST_PREFIX + _COMMON_LOG_FORMAT),
‘ncsa_extended’: RegexFormat(‘ncsa_extended’, _NCSA_EXTENDED_LOG_FORMAT),
‘common_complete’: RegexFormat(‘common_complete’, _HOST_PREFIX + _NCSA_EXTENDED_LOG_FORMAT),
‘iis’: IisFormat(),
‘s3’: RegexFormat(‘s3’, _S3_LOG_FORMAT),
‘icecast2’: RegexFormat(‘icecast2’, _ICECAST2_LOG_FORMAT),
}

Then pieced this together:

(?P[\w-.])(?::\d+)? \S+ [(?P.?) (?P.?)] (?P\S+)? \S+ (?P\S+) (?P\S+) (?P.?) “(?P<user_agent>.*?)”

and looking at one log line:

Raw:
10.88.168.198 - [15/May/2013:19:55:38 +0000] 302 - 64 GET / “Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31”

match.group():
u’10.88.168.198 - [15/May/2013:19:55:38 +0000] 302 - 64 GET / “Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31”’

match.groupdict():
{u’date’: u’15/May/2013:19:55:38’, u’host’: u’10.88.168.198’, u’length’: u’64’, u’path’: u’/’, u’request’: u’GET’, u’status’: u’302’, u’timezone’: u’+0000’, u’user_agent’: u’Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31’}

and adding it back to --log-format-regex=’(?P[\w-.])(?::\d+)? \S+ [(?P.?) (?P.?)] (?P\S+)? \S+ (?P\S+) (?P\S+) (?P.?) “(?P<user_agent>.*?)”’

BOOM … Logs imported.

Can you post your valve pattern?

What about this log format
#LogFormat “%h %v %u %t “%r” %>s %b “%{Referer}i” “%{User-Agent}i” “%{Cookie}i”” best1

URCHIN log format

Guys, if you manage to detect a known format, and that would be useful to other users, it would be GREAT If you could contribute back the information to us. For example we’d like to add a test file in: https://github.com/piwik/piwik/tree/master/misc/log-analytics/tests/logs

Also, we’d like to update the README: https://github.com/piwik/piwik/tree/master/misc/log-analytics#piwik-server-log-analytics-import-your-server-logs-in-piwik if it can be improved!

thanks for letting me know