Difficulty with IIS w3c format via stdin via syslog-ng


I have my Windows system logs, including IIS logs in w3c format sent to a syslog-ng collector.
The syslog-ng collector uses a child process to import them automatically via the python script.
I’ve downloaded the latest import_logs.py from git for the syslog server, but my piwik base is 2.4.1.
The syslog server’s python is 2.7.5.
The log Fields are: date time cs-method cs-uri-stem cs-uri-query cs-username c-ip cs(User-Agent) cs(Referer) cs-host sc-status sc-bytes time-taken

My syslog-ng script is:
exec python /tools/piwik/scripts/import_logs.py
–idsite-fallback=13 --url=https://myinternal.server.yo/piwik/
–config=/tools/piwik/config/config.ini.php --enable-http-errors
–enable-http-redirects --enable-static --enable-bots --token-auth=****
–log-format-name=w3c_extended --w3c-time-taken-millisecs
–log-format-regex=’(?P^\d+[-\d+]+[\d+:]+) \S+ (?P/\S*) (?P<query_string>\S*) (?P\S+) (?P[\d*.]) (?P<user_agent>".?"|\S+) (?P\S+) (?P\S+) (?P\d+) (?P\S+) (?P<generation_time_secs>[.\d]+)’ \

That regex is what I pieced together from the latest import_logs.py script looking at the w3c section.

And the error output is:

cat iis.log |./test_iis.sh

2015-01-07 16:21:34,188: [DEBUG] Accepted hostnames: all
2015-01-07 16:21:34,189: [DEBUG] Piwik URL is: https://myinternal.server.yo/piwik/
2015-01-07 16:21:34,189: [DEBUG] Authentication token token_auth is: *****
2015-01-07 16:21:34,189: [DEBUG] Resolver: dynamic
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2015-01-07 16:21:34,190: [DEBUG] Launched recorder
Parsing log (stdin)…
2015-01-07 16:21:34,190: [DEBUG] Invalid line detected (line did not match): 2015-01-07 19:47:39 POST /onebanana/sp3/rifd/ query=fuseaction=planData.manf_d johndoe Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:31.0)+Gecko/20100101+Firefox/31.0 https://myinternal.server.yo/oranges/sp3/rifd/?fuseaction=planData.incView myinternal.server.yo 200 37080 370

2015-01-07 16:21:34,191: [DEBUG] Invalid line detected (line did not match): 2015-01-07 19:47:39 GET /twobanana/sp3/rifd/_scripts/showGrid.js _=1420660061830 johndoe Mozilla/5.0+(Windows+NT+6.1;+WOW64;+rv:31.0)+Gecko/20100101+Firefox/31.0 https://myinternal.server.yo/oranges/sp3/rifd/?fuseaction=planData.incView myinternal.server.yo 200 8602 17

An earlier attempt using the python script from 2.4.1, I was able to get closed except it said “invalid date”. That regex was:
–log-format-regex=’(?P^\d+[-\d+]+[\d+:]+) \S+ (?P.?) (?P<query_string>\S) (?P\S+) (?P[\d*.]) (?P<user_agent>.?) (?P.?) ((?P[\w-.])(?::\d+)?) (?P\d+) (?P\d+) (?P<generation_time_secs>\d+)’ - \

So… What I was wondering is… what am I doing wrong?
Or, is the regex-format method able to process the w3c date field? I looked at the python and I see entries like so –
self.date_format = '%d/%b/%Y:%H:%M:%S’
self.date_format = '%Y-%m-%dT%H:%M:%S’
super(W3cExtendedFormat, self).init(‘w3c_extended’, None, ‘%Y-%m-%d %H:%M:%S’)

And you can see I tried to tell it to expect the w3c_extended format before the regex… but I’m at a loss.
When I tried using just format-name=w3c_extended I get a whole different set of errors:
Parsing log (stdin)…
Traceback (most recent call last):
File “/tools/piwik/scripts/import_logs.py”, line 1900, in
File “/tools/piwik/scripts/import_logs.py”, line 1871, in main
File “/tools/piwik/scripts/import_logs.py”, line 1689, in parse
File “/tools/piwik/scripts/import_logs.py”, line 1234, in check_format
elif ‘host’ not in format.regex.groupindex and not config.options.log_hostname:
AttributeError: ‘NoneType’ object has no attribute ‘groupindex’


I am really thinking the current import_logs.py script is missing some of the routines so it can handle the IIS/W3c logs.

If I try it using the log-format-name=w3c_extended it has an error regarding the field setup-- “AttributeError: ‘NoneType’ object has no attribute ‘group index’”. There is a bug in git for this already.

If I used the following regex to override the log-format-name, it fails on “invalid date”. Regex is:
’(?P^\d+[-\d]+\s\d+[:\d+]+) (\S+) (?P.?) (?P<query_string>\S) (?P\S+) (?P[\d*.]) (?P<user_agent>.?) (?P.?) ((?P[\w-.])(?::\d+)?) (?P\d+) (?P\S+) (?P<generation_time_secs>\d+)’ \

This regex passes test at http://ksamuel.pythonanywhere.com.

I looked at the Python code (which is new to me) and I see a class for defining two date field types:
#1, class BaseFormat(object), using pattern ‘%d/%b/%Y:%H:%M:%S’, the common/apache/etc…
#2, class JsonFormat(BaseFormat), using pattern ‘%Y-%m-%dT%H:%M:%S’.

I am thinking of adding another class like the Json block that does not have the “T” but a space between date and time.
But I’m at a loss on how to actually reference this class using regex inputs. Hopefully someone can chime in here and point me in a good direction? Does this need to be a bug at the git site?

(Matthieu Aubry) #3

Hi there, thanks for the report. Sorry about the bug. We will fix it - but can you please create an issue in the tracker at: Issues · matomo-org/piwik · GitHub and paste there 3-4 lines of your log, and the command used to reproduce?



Thanks! Created bug 6968.