Yes - that is close to what I’m thinking of.
I think the option to rollup apache logs would have certain requirements to generally match the information we’re getting from a PIWIK_PIXEL fire.
In terms of my line of thinking, I’ve got the apache logs for my server coming in using the following format
LogFormat "%V %h %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined
which is equivalent to saying:
$website_name $user_ip $time $full_page_request $http_status_code $response_size_in_bytes $http_referrer $user_agent
In comparison to what we get from a piwik pixel-fire I think the only thing I’m missing is just the $_SERVER user_accept_language - as it turns out, we can maybe look at adding that to the logs via the “%…{Foobar}i” directive in apache. (mod_log_config - Apache HTTP Server for reference)
What I’m thinking of doing for the purposes of a PIWIK-centric rollup of my logs, I’m thinking of a script that (in pseudo code)
read_log_folders
foreach ($log_files as $log_file) {
open($log_file) {
// read one line at a time
foreach ($line) {
// read line and assign to variables as above
// create a unique identifier code for the user, possibly based on an MD5 hash of their IP address and USER_AGENT if possible, maybe add other options like MAC address if we can get it, etc...
// check PIWIK for a site in the system under the $website_name
// if it does not exist, create it - include some subdomain logic here
// check PIWIK database for a previous "hit" from the unique user
// if the user exists, increment the hit and visit length values in the database
// if the user does not exist, create the user's first visit
// add all "additional information" that we can relative to this particular user's visit (file name/ server path, user agent, etc...)
}
}
At this stage - the idea is obviously at very early stages. I have the logs, I’ve got the database, I’m happy to actually write the script the script that rolls things up - however - I’d Ideally like to write the module in a way that ‘spoofs’ piwik as closely as possible.
ie: uses as many of the piwik internals as I can - by using your internal systems, the script would be as ‘future proof’ as it can be (allowing for piwik to upgrade around it without breaking).
I’d also be honoured to share and contribute whatever I write back to the project as a whole.
In the short term though, I’d find some guidance on the ‘data flow’ process exceptionally valuable. What happens when a hit comes into PIWIK? What classes called and methods run? Maybe I’ve just got to open my Zend debugger, step through and map it myself. That way I’ll be able to “spoof it” using the least number of steps possible.
I hope that was clear.
Feel free to let me know where you think I should start looking.