Incremental import_logs.py runs?

dom · February 13, 2014, 9:16am

Is it possible to do incremental runs of import_logs.py? As in is it possible to do continuous runs against a log file without starting from the beginning each time? This is a basic function in old school analyzers like awstats, webalizer etc, as otherwise it’s very inefficient and starts to heavily load the server as the logs get bigger.

I took a quick look at the code but couldn’t see anything that would cater for this (like recording markers of the last run), but maybe there’s another way that piwik handles this?

If not, would it be something that piwik would be interested in, if I implemented it?

matthieu · February 16, 2014, 8:29pm

Some users have done it and they have updated the README to explain to others how to do: https://github.com/piwik/piwik/tree/master/misc/log-analytics/#setup-apache-customlog-that-directly-imports-in-piwik

Please check this out. Maybe this is not what you’re looking for, but if you can send your feeedback on this documentation (or send a pull request with changes) that will be welcome!

dom · February 21, 2014, 12:29am

This is a great solution for small localised installs, but it’s not viable or a good idea for larger centralised server farms for various reasons (runs as root attached to web processes, blocks http worker, doesn’t work with already-centralised logs).

It looks like the existing script can’t keep track of last runs, but I think it would be a valuable addition so I’ll have a go at implementing it. My python is a little rudimentary and I’ve never used git before, so I can’t guarantee I’ll come up with anything sane

dom · February 24, 2014, 5:26pm

OK I’ve had a quick hack at this that seems to work surprisingly well, and it’s only a handful of extra lines of code. It piggybacks the existing lineno counter and --skip option, except it automates it using tracking files that just contain the last value of lineno. It’s implemented as a single boolean option --auto-skip (that is off by default). If enabled, it creates a ‘skip-markers’ directory in the same place the script lives, and then creates .marker files within that directory with the last value of lineno. It then tries to read this at the start of a run and if it contains a valid lineno it skips to that point of the logfile, in the same way that --skip does.

This allows you to run the import_logs.py script against a live logfile continuously (so you can cron it to run every minute, if you want, rather than just once a day and being forced to rotate the log), and is a great alternative if you don’t want to (or can’t) use piped logging. For example I fire all my weblogs from a webfarm over rsyslog/logstash to a central server (and consolidate into a single central logfile for each vhost), and then just run this script continuously against the single central logfile. It’s also much less resource intensive on a piwik system to run this continuously (with lots of smaller hits) rather than one big hit that saturates the webserver/database less frequently.

For the future, it would be even simpler if the Piwik API was extended to include a field for this tracking value, then the local files could be done away with alltogether.

Hope this is of use to others, if it’s something that you’d be interested in adding then the diff against 2.1-rc2 is below (sorry, this forum won’t allow me to attach a diff file), or let me know if you’d like me to attempt a git pull request:

454a455,458

        '--auto-skip', dest='auto_skip', action='store_true', default=False,
        help="Track logfile processing and automatically skip processed lines on the next run.  This allows multiple runs against an active logfile.",
    )
    option_parser.add_option(

1523a1528,1539

    # If auto-skip and a real file, try and read the last marker
    markerdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), "skip-markers")
    markerfile = os.path.join(markerdir, os.path.basename(filename) + ".marker")
    skipmarker = None
    if config.options.auto_skip and file != sys.stdin:
        if os.path.exists(markerfile):
            with open(markerfile, 'r') as f:
                try:
                    skipmarker = int(f.readline())
                except:
                    skipmarker = None

1555c1571
< if stats.count_lines_parsed.value <= config.options.skip:

        if stats.count_lines_parsed.value <= config.options.skip or (skipmarker and (stats.count_lines_parsed.value <= skipmarker)):

1670,1671c1686,1693
<
<

    # If auto-skip, write the file marker
    if config.options.auto_skip and file != sys.stdin:
        # First, make sure the directory exists, in the same place as this script
        if not os.path.isdir(markerdir):
            os.makedirs(markerdir)
        # Write the ending lineno to the marker file
        with open(markerfile, 'w') as f:
                f.write(str(lineno))

1711a1734,1743

    # If auto-skip, write the file marker
    if config.options.auto_skip and file != sys.stdin:
        markerdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), "skip-markers")
        markerfile = os.path.join(markerdir, os.path.basename(filename) + ".marker")
        # First, make sure the directory exists, in the same place as this script
        if not os.path.isdir(markerdir):
            os.makedirs(markerdir)
        # Write the ending lineno to the marker file
        with open(markerfile, 'w') as f:
                f.write(str(lineno))

matthieu · February 24, 2014, 10:41pm

Thanks for posting this here! I have commented on the ticket Log analytics list of improvements · Issue #3163 · matomo-org/matomo · GitHub , maybe others will be able to test it and report!

Techwolf · March 3, 2014, 10:07pm

You can use the “Formatted Code” button on the editor to prevent the mangling of your diff. Also, please use diff -u next time. I am defentelly instreated in this as I can not get import.py to work on a named pipe.

Techwolf · March 3, 2014, 10:20pm

You might want to add a md5sum check on the first line or first few lines to make sure the same log file is being used and not "logroate"ed out.

dom · March 3, 2014, 10:30pm

Ah, yes. Here’s the udiff:


--- import_logs.py	2014-02-24 18:00:18.424895491 +0000
+++ import_logs.working.py	2014-02-24 17:11:37.022719486 +0000
@@ -452,6 +452,10 @@
             help="Skip the n first lines to start parsing/importing data at a given line for the specified log file",
         )
         option_parser.add_option(
+            '--auto-skip', dest='auto_skip', action='store_true', default=False,
+            help="Track logfile processing and automatically skip processed lines on the next run.  This allows multiple runs against an active logfile.",
+        )
+        option_parser.add_option(
             '--recorders', dest='recorders', default=1, type='int',
             help="Number of simultaneous recorders (default: %default). "
             "It should be set to the number of CPU cores in your server. "
@@ -1521,6 +1525,18 @@
                     open_func = open
                 file = open_func(filename, 'r')
 
+        # If auto-skip and a real file, try and read the last marker
+        markerdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), "skip-markers")
+        markerfile = os.path.join(markerdir, os.path.basename(filename) + ".marker")
+        skipmarker = None
+        if config.options.auto_skip and file != sys.stdin:
+            if os.path.exists(markerfile):
+                with open(markerfile, 'r') as f:
+                    try:
+                        skipmarker = int(f.readline())
+                    except:
+                        skipmarker = None
+
         if config.options.show_progress:
             print 'Parsing log %s...' % filename
 
@@ -1552,7 +1568,7 @@
                 continue
 
             stats.count_lines_parsed.increment()
-            if stats.count_lines_parsed.value <= config.options.skip:
+            if stats.count_lines_parsed.value <= config.options.skip or (skipmarker and (stats.count_lines_parsed.value <= skipmarker)):
                 continue
 
             match = format.match(line)
@@ -1667,8 +1683,14 @@
         if len(hits) > 0:
             Recorder.add_hits(hits)
 
-
-
+        # If auto-skip, write the file marker
+        if config.options.auto_skip and file != sys.stdin:
+            # First, make sure the directory exists, in the same place as this script
+            if not os.path.isdir(markerdir):
+                os.makedirs(markerdir)
+            # Write the ending lineno to the marker file
+            with open(markerfile, 'w') as f:
+                    f.write(str(lineno))
 
 def main():
     """
@@ -1709,6 +1731,16 @@
             'You can restart the import of "%s" from the point it failed by '
             'specifying --skip=%d on the command line.\n' % (filename, lineno)
         )
+        # If auto-skip, write the file marker
+        if config.options.auto_skip and file != sys.stdin:
+            markerdir = os.path.join(os.path.dirname(os.path.realpath(__file__)), "skip-markers")
+            markerfile = os.path.join(markerdir, os.path.basename(filename) + ".marker")
+            # First, make sure the directory exists, in the same place as this script
+            if not os.path.isdir(markerdir):
+                os.makedirs(markerdir)
+            # Write the ending lineno to the marker file
+            with open(markerfile, 'w') as f:
+                    f.write(str(lineno))
     os._exit(1)

dom · March 3, 2014, 10:34pm

excellent point about md5 to check the file. beware that my hack will certainly fail on log files that are being rotated with the same name (like simple access_log) - it depends on the logfile names to be unique (like with date). I fire all my apache logs through rsyslog to a central log server and name the resulting log files with the date:

$Template apachelog,"/srv/syslog/www/%syslogtag:8:50%/%syslogtag:8:50%.%$YEAR%%$MONTH%%$DAY%.access_log"
$Template apacheformat,"%msg:2:$:drop-last-lf%\r\n"

Techwolf · March 5, 2014, 5:10am

Thanks.

After I posted that, I managed to figure out how to use a named pipe with rsyslog to do realtime. My config is nginx log to named pipe, rsyslog text imput module to read that and exec the helper script that calls import_logs.py I’me thinking of doing a new post with all that info.

Incremental import_logs.py runs?

1555c1571 < if stats.count_lines_parsed.value <= config.options.skip:

1670,1671c1686,1693 < <

1555c1571
< if stats.count_lines_parsed.value <= config.options.skip:

1670,1671c1686,1693
<
<