import_logs.py import is VERY slow

Alf_Watt · July 25, 2014, 9:20am

Trying to import old logs and looking at setting up an Apache logging directive to replace the javascript.

When importing using a nice long command line sequence the process is painfully slow, here for e.g. is 1 second reporting:


python ./import_logs.py --debug --show-progress --idsite=1 --log-hostname=mysite.com --url=http://mysite.com/piwik/ --token-auth=mytoken --enable-http-errors --enable-static --enable-bots /var/log/apache2/mysite.log
2014-07-25 02:09:20,399: [DEBUG] Accepted hostnames: all
2014-07-25 02:09:20,400: [DEBUG] Piwik URL is: http://mysite.com/piwik/
2014-07-25 02:09:20,400: [DEBUG] Authentication token token_auth is: mytoken
2014-07-25 02:09:20,400: [DEBUG] Resolver: static
2014-07-25 02:09:20,400: [DEBUG] SitesManager.getSiteFromId: 1
2014-07-25 02:10:36,331: [DEBUG] SitesManager main_url: https://mysite.com
2014-07-25 02:10:36,332: [DEBUG] Import:Starting
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
2014-07-25 02:10:36,332: [DEBUG] Launched recorder
Parsing log /var/log/apache2/my_log_file...
2014-07-25 02:10:36,333: [DEBUG] Detecting the log format
2014-07-25 02:10:36,333: [DEBUG] Check format icecast2
2014-07-25 02:10:36,333: [DEBUG] Format icecast2 does not match
2014-07-25 02:10:36,333: [DEBUG] Check format iis
2014-07-25 02:10:36,333: [DEBUG] Format iis does not match
2014-07-25 02:10:36,333: [DEBUG] Check format common
2014-07-25 02:10:36,333: [DEBUG] Format common matches
2014-07-25 02:10:36,333: [DEBUG] Check format common_vhost
2014-07-25 02:10:36,333: [DEBUG] Format common_vhost does not match
2014-07-25 02:10:36,333: [DEBUG] Check format nginx_json
2014-07-25 02:10:36,333: [DEBUG] Format nginx_json does not match
2014-07-25 02:10:36,334: [DEBUG] Check format s3
2014-07-25 02:10:36,334: [DEBUG] Format s3 does not match
2014-07-25 02:10:36,334: [DEBUG] Check format ncsa_extended
2014-07-25 02:10:36,334: [DEBUG] Format ncsa_extended matches
2014-07-25 02:10:36,334: [DEBUG] Check format common_complete
2014-07-25 02:10:36,334: [DEBUG] Format common_complete does not match
2014-07-25 02:10:36,334: [DEBUG] Format ncsa_extended is the best match
801 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
/* I deleted a bunch of lines here! */
801 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
1001 lines parsed, 200 lines recorded, 2 records/sec (avg), 200 records/sec (current)

How do I go about debugging this?

Alf_Watt · July 25, 2014, 9:43am

My first clue is the time taken to lookup the --idsite (note my added debug logs):


2014-07-25 02:09:20,400: [DEBUG] SitesManager.getSiteFromId: 1
2014-07-25 02:10:36,331: [DEBUG] SitesManager main_url: https://mysite.com

That’s 76 seconds, which is roughly the interval of the 200 records batches are uploaded to the server.

I can see on the server that the records are showing up, slowly, and that the web UI has become unresponsive (which i expect to recover once the API is less taxed).

Just started a parallel import with --recorders=4 which improves the number of records uploaded but is still bursty on the order of the initial 76 second delay.

Edit: The more --recorders I have the more the rate improves, at 10 the server can make 22 records per second. Still very slow.

lesjokolat · July 28, 2014, 7:23pm

what version of pyhton are you using? Ideally at least vers 2.7.1

Alf_Watt · July 28, 2014, 7:29pm

2.7.5 (OS X default version)

I’m more suspicious that the server is IPv6 enabled, and there is some issue with looking up the name.

Honestly, I’m tempted to write a new import direct to the database using the php classes, instead of debugging this hacky script.

matthieu · August 10, 2014, 3:16pm

@Alf what’s hacky about it that your “quick rewrite” would solve? #lol

ps: some of us achieve 1000 req/s with this hacky script (or more)

lesjokolat · August 11, 2014, 8:57pm

Has there been any osX updates you could apply that might help python? 2.7.6 is released and fixes alot of bugs from 2.7.5

Alf_Watt · August 11, 2014, 9:20pm

@matt you have a python script driving a php web interface over http driving a database over it’s sql network protocol, if you are achieving high import rates it’s due to the ongoing march of moors law.

Arguable that it makes much more sense if you have multiple web servers feeding a central analytics box, but for my use case on a single server it’s a lot of overhead.

Pretty sure the python version isn’t the issue here, it’s a long timeout for each request or batch of requests (more recorders processes faster), I’m just wondering if it wouldn’t be faster to adapt the command line tool to import lines directly, since either solution means getting deep into the code.

matthieu · August 12, 2014, 12:08pm

@matt you have a python script driving a php web interface over http driving a database over it’s sql network protocol, if you are achieving high import rates it’s due to the ongoing march of moors law.

No it’s not… that’s basically how the Internet works and all major services out there… maybe it’s overkill for you but we don’t build the tool for you but for hundreds of thousands of people who need something reliable.

PS: import_logs is NOT VERY slow. It’s in fact fast since 2.4.0
see https://github.com/piwik/piwik/pull/300
https://github.com/piwik/piwik/pull/281

Alf_Watt · August 12, 2014, 4:47pm

I see. Clearly there is no helpful support here. Sorry for wasting everyone’s time.

matthieu · August 12, 2014, 11:17pm

I don’t think you asked for support, it sounds more like you started a discussion with no clear goals. What was your question? What is the performance of Log analytics for you? Likely it should be at least 200 req/ s imported on a solid server. What are specs of your server? etc. etc.

gggeek · October 19, 2014, 10:22pm

Hello - first-time user of the log import script.

I am running it in a Debian VM (default versions of python, php, mysql onboard), with 1vCPU, 4GB RAM.
The physical host is a dell xps15 laptop freshly bought, with a nice fast ssd and plenty of cpu power.

I set the recorders parameter at 2.

While importing, cpu usage stays below 10%, with occasional spikes at 33%. The spikes are from Apache, the python porcess does not even show up in the top “top” results.
I used innotop to monitor the db, but did not find a lot of wrong there either, queries per seconds fluctuate from 50 to 200.

All the while, the number of average records imported per second fluctuates between 25 and 30, which I find extremely low.

For comparison, awstats would import on the same VM 3M records in 3 minutes, pegging the CPU at 100% in the process.

Any tip on speeding things up would be appreciated.

For reference, here is the command line in use - the only “weird” thing is my log format has extra data at beginning of row:

python misc/log-analytics/import_logs.py --idsite=1 --enable-static --enable-bots --enable-http-redirects --strip-query-string --recorders=2 --url=http://localhost/piwik
–log-format-regex=’\S+ \S+ \S+ \S+ \S+ (?P\S+) \S+ \S+ [(?P.?) (?P.?)] “\S+ (?P.?) \S+" (?P\S+) (?P\S+) "(?P.?)” “(?P<user_agent>.*?)”’
/var/www/collectedlogs/apache2-access_log.log.3

matthieu · October 20, 2014, 8:33am

Mysql does most of the work, check your mysql is well configured?

Also check this tip: matomo/README.troubleshooting.md at master · matomo-org/matomo · GitHub

gggeek · October 20, 2014, 8:46am

Thanks for the tip, but I see no obvious mysql misconfiguration

Here s output from mysqltuner:


>>  MySQLTuner 1.3.0 - Major Hayden <major@mhtx.net>
 >>  Bug reports, feature requests, and downloads at http://mysqltuner.com/
 >>  Run with '--help' for additional options and output filtering
Please enter your MySQL administrative login: root
Please enter your MySQL administrative password:
[OK] Currently running supported MySQL version 5.5.38-0+wheezy1
[OK] Operating on 64-bit architecture

-------- Storage Engine Statistics -------------------------------------------
[--] Status: +ARCHIVE +BLACKHOLE +CSV -FEDERATED +InnoDB +MRG_MYISAM
[--] Data in MyISAM tables: 0B (Tables: 10)
[--] Data in InnoDB tables: 37M (Tables: 148)
[--] Data in PERFORMANCE_SCHEMA tables: 0B (Tables: 17)
[!!] Total fragmented tables: 148

-------- Security Recommendations  -------------------------------------------
[OK] All database users have passwords assigned

-------- Performance Metrics -------------------------------------------------
[--] Up for: 13h 38m 12s (1M q [32.516 qps], 2K conn, TX: 1B, RX: 542M)
[--] Reads / Writes: 47% / 53%
[--] Total buffers: 192.0M global + 2.7M per thread (151 max threads)
[OK] Maximum possible memory usage: 597.8M (15% of installed RAM)
[OK] Slow queries: 0% (1/1M)
[OK] Highest usage of available connections: 4% (7/151)
[OK] Key buffer size / total MyISAM indexes: 16.0M/127.0K
[OK] Key buffer hit rate: 99.2% (356 cached / 3 reads)
[OK] Query cache efficiency: 36.8% (345K cached / 939K selects)
[OK] Query cache prunes per day: 0
[OK] Sorts requiring temporary tables: 0% (0 temp sorts / 4K sorts)
[OK] Temporary tables created on disk: 16% (378 on disk / 2K total)
[OK] Thread cache hit rate: 99% (7 created / 2K connections)
[OK] Table cache hit rate: 24% (205 open / 840 opened)
[OK] Open file limit used: 6% (68/1K)
[OK] Table locks acquired immediately: 100% (1M immediate / 1M locks)
[OK] InnoDB buffer pool / data size: 128.0M/37.7M
[OK] InnoDB log waits: 0
-------- Recommendations -----------------------------------------------------
General recommendations:
    Run OPTIMIZE TABLE to defragment tables for better performance
    MySQL started within last 24 hours - recommendations may be inaccurate
    Enable the slow query log to troubleshoot bad queries

gggeek · October 20, 2014, 9:14am

ps: I am currently testing with 10 recorders, but the average records/sec is still 30

The only reason I see for the slow db could be the fact that the virtual disk is set to grow dynamically - but I would be surprised for this to have such a big impact.

ps: in fact I just checked on the host machine, and the size of the virtual hdd file is not growing during the import script. mtab tells me I am not running out of space either…

ClemD · January 5, 2015, 3:27pm

Hello,

Did you find any answer to this problem ? I am in the same situation (Debian on VM), I try with 4 vCPU, and changing recorders=4 or more don’t give better result… (from 20 to 30 rec/s, sometimes 40).

In mysql configuration, the only impactful parameter is “innodb_buffer_pool_size = 2800M” (10M by default). If I don’t put this parameter, mysql fails in some process, the CPU grows to 99% with 0 rec/s…
I had to put the MYSQLI adapter. Otherwise, import_logs.py somtimes finishes with a fatal error.

(I apologize for my poor english)