I’ll see what I can come up with for a procedural/white paper idea for you. My primary goal for this was to compare a year’s worth of actual access_log traffic to what Google Analytics had said we’d been doing for the past year, mostly because the numbers Google Analytics had been reporting were lower than the numbers of impressions that Google Adsense had said we were getting on those same pages, and Quantcast was reporting even lower metrics.
So we wanted a new solution, and the only way to compare was to see what the actual logs tracked, and see what the Piwik js tracked, then see how those both compare to GA and Quantcast, at least for November and December (comparing log data vs live tracking data).
What the import_logs.py did was crunch 25Gb of raw access_log data (in chunks, not one big bite) into about 3Gb of visitor data in the database. Not bad at all
As for the visitor log purging, I do have slow query tracking on, but it’s not telling me anything useful yet. I have dozens of queries like this:
# Query_time: 10 Lock_time: 0 Rows_sent: 0 Rows_examined: 0
LOAD DATA LOCAL INFILE
'/tmp/assets/piwik_archive_blob_2012_01-aa52ede77e7cabbda103a4042a9144f7.csv'
REPLACE
INTO TABLE
piwik_archive_blob_2012_01
FIELDS TERMINATED BY
' '
ENCLOSED BY
'"'
ESCAPED BY '\\'
LINES TERMINATED BY
'
'
(idarchive,idsite,date1,date2,period,ts_archived,name,value);
complete with the weird tabs, plus several like this:
# Query_time: 18 Lock_time: 0 Rows_sent: 0 Rows_examined: 0
DELETE
FROM piwik_archive_blob_2012_01
WHERE idarchive IN (803,802,801,800,820,819,818,817,816,815,814,813,804,805,806,807,808,809,810,811,812);
and this:
# Query_time: 16 Lock_time: 0 Rows_sent: 1 Rows_examined: 5969945
SELECT COUNT(*) FROM piwik_log_link_visit_action WHERE idvisit <= '1364813';
# Query_time: 202 Lock_time: 0 Rows_sent: 1 Rows_examined: 12910042
SELECT COUNT(*) FROM piwik_log_link_visit_action WHERE idvisit <= '2376138';
# Query_time: 17 Lock_time: 0 Rows_sent: 1 Rows_examined: 2324618
SELECT COUNT(*) FROM piwik_log_visit WHERE idvisit <= '2376138';
# Query_time: 200 Lock_time: 0 Rows_sent: 1 Rows_examined: 12810320
SELECT COUNT(*) FROM piwik_log_link_visit_action WHERE idvisit <= '2376138';
# Query_time: 213 Lock_time: 0 Rows_sent: 1 Rows_examined: 12641093
SELECT COUNT(*) FROM piwik_log_link_visit_action WHERE idvisit <= '2376138';
but there’s nothing in the error logs at all. Also, currently have archiving done by cron, every 7200 seconds.
Also, I see a lot of entries for DELETE FROM piwik_archive_blob_2012_somedate, but nothing for deleting visitor logs…
I thought that with the “regularly delete old visitor logs” set to “yes”, and “regularly delete old reports” set to “no”, that would clear out the visitor logs but leave the data (in the blob tables) be… is my understanding of that correct?