I hope this is helpful! (This is still a work in progress and I encourage feed back or help! )
This was most useful in working the live regex custom log format option:
http://ksamuel.pythonanywhere.com/
if you know the valve variables from server.xml (tomcat), like:
common - %h %l %u %t “%r” %s %b
combined - %h %l %u %t “%r” %s %b “%{Referer}i” “%{User-Agent}i”
in my case I used:
pattern=’%h %S %t %s %b %D %m %U “%{User-Agent}i”’
I identified what was currently in the code pulling this from the import_log.py (so I had a clue about what I was attempting to do):
_HOST_PREFIX = ‘(?P[\w-.])(?::\d+)? ‘
_COMMON_LOG_FORMAT = (
’(?P\S+) \S+ \S+ [(?P.?) (?P.?)] ‘
’"\S+ (?P.?) \S+" (?P\S+) (?P\S+)’
)
_NCSA_EXTENDED_LOG_FORMAT = (_COMMON_LOG_FORMAT +
’ “(?P.?)" "(?P<user_agent>.?)”’
)
_S3_LOG_FORMAT = (
’\S+ (?P\S+) [(?P.?) (?P.?)] (?P\S+) ‘
’\S+ \S+ \S+ \S+ “\S+ (?P.?) \S+" (?P\S+) \S+ (?P\S+) ‘
’\S+ \S+ \S+ "(?P.?)” “(?P<user_agent>.*?)”’
)
_ICECAST2_LOG_FORMAT = ( _NCSA_EXTENDED_LOG_FORMAT +
’ (?P<session_time>\S+)’
)
FORMATS = {
‘common’: RegexFormat(‘common’, _COMMON_LOG_FORMAT),
‘common_vhost’: RegexFormat(‘common_vhost’, _HOST_PREFIX + _COMMON_LOG_FORMAT),
‘ncsa_extended’: RegexFormat(‘ncsa_extended’, _NCSA_EXTENDED_LOG_FORMAT),
‘common_complete’: RegexFormat(‘common_complete’, _HOST_PREFIX + _NCSA_EXTENDED_LOG_FORMAT),
‘iis’: IisFormat(),
‘s3’: RegexFormat(‘s3’, _S3_LOG_FORMAT),
‘icecast2’: RegexFormat(‘icecast2’, _ICECAST2_LOG_FORMAT),
}
Then pieced this together:
(?P[\w-.])(?::\d+)? \S+ [(?P.?) (?P.?)] (?P\S+)? \S+ (?P\S+) (?P\S+) (?P.?) “(?P<user_agent>.*?)”
and looking at one log line:
Raw:
10.88.168.198 - [15/May/2013:19:55:38 +0000] 302 - 64 GET / “Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31”
match.group():
u’10.88.168.198 - [15/May/2013:19:55:38 +0000] 302 - 64 GET / “Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31”’
match.groupdict():
{u’date’: u’15/May/2013:19:55:38’, u’host’: u’10.88.168.198’, u’length’: u’64’, u’path’: u’/’, u’request’: u’GET’, u’status’: u’302’, u’timezone’: u’+0000’, u’user_agent’: u’Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31’}
and adding it back to --log-format-regex=’(?P[\w-.])(?::\d+)? \S+ [(?P.?) (?P.?)] (?P\S+)? \S+ (?P\S+) (?P\S+) (?P.?) “(?P<user_agent>.*?)”’
BOOM … Logs imported. although I’m having an issue with the actual browser type. I’ll update the final once I have it.