Piwik stats rollup using Weblogs only? or Adding a response to the webserver?

Hi there,

I’ve just found Piwik this morning. I’ve checked out the Dev trunk and have had a bit of a poke around in the code so far, but I’m at the point where I’m wonderring if there’s a way to generate test data by creating a ‘spoof’ of the tracking pixel and rolling up my weblogs?

If I’ve got a web server handling two or three domains, with multiple subdomains and robust logs (including referrer, user agent, etc…) where would I start looking for the scripts that I would use to spoof these rollups? I’m certain there’s nothing written yet, but I’d be happy to do some of the grunt work and let people know what I come back with.

I’ll work through some reverse engineering and step throughs in the meantime - but - any advice that could steer me in the right direction really helpful.

You maybe describe the ticket: Bulk load Piwik logs with documented API: improved tracking performance, allow performance testing · Issue #5554 · matomo-org/matomo · GitHub

If you are interested to work on it feel free to give it a try, or maybe post your thoughts here first so we can help

Yes - that is close to what I’m thinking of.

I think the option to rollup apache logs would have certain requirements to generally match the information we’re getting from a PIWIK_PIXEL fire.

In terms of my line of thinking, I’ve got the apache logs for my server coming in using the following format


LogFormat "%V %h %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

which is equivalent to saying:


$website_name $user_ip $time $full_page_request $http_status_code $response_size_in_bytes $http_referrer $user_agent 

In comparison to what we get from a piwik pixel-fire I think the only thing I’m missing is just the $_SERVER user_accept_language - as it turns out, we can maybe look at adding that to the logs via the “%…{Foobar}i” directive in apache. (mod_log_config - Apache HTTP Server for reference)

What I’m thinking of doing for the purposes of a PIWIK-centric rollup of my logs, I’m thinking of a script that (in pseudo code)


read_log_folders
foreach ($log_files as $log_file) {
     open($log_file) {
          // read one line at a time
          foreach ($line) {
               // read line and assign to variables as above
               // create a unique identifier code for the user, possibly based on an MD5 hash of their IP address and USER_AGENT if possible, maybe add other options like MAC address if we can get it, etc...  

               // check PIWIK for a site in the system under the $website_name
               // if it does not exist, create it - include some subdomain logic here

               // check PIWIK database for a previous "hit" from the unique user
               // if the user exists, increment the hit and visit length values in the database
               // if the user does not exist, create the user's first visit 
               // add all "additional information" that we can relative to this particular user's visit (file name/ server path, user agent, etc...)
          }
     }

At this stage - the idea is obviously at very early stages. I have the logs, I’ve got the database, I’m happy to actually write the script the script that rolls things up - however - I’d Ideally like to write the module in a way that ‘spoofs’ piwik as closely as possible.

ie: uses as many of the piwik internals as I can - by using your internal systems, the script would be as ‘future proof’ as it can be (allowing for piwik to upgrade around it without breaking).

I’d also be honoured to share and contribute whatever I write back to the project as a whole.

In the short term though, I’d find some guidance on the ‘data flow’ process exceptionally valuable. What happens when a hit comes into PIWIK? What classes called and methods run? Maybe I’ve just got to open my Zend debugger, step through and map it myself. That way I’ll be able to “spoof it” using the least number of steps possible.

I hope that was clear.

Feel free to let me know where you think I should start looking.

  • Alex

This is more of a “thinking out loud, & publicly” but I’d very much welcome further feedback from people who have got more information than me, or who can steer me in the right direction. I’ve never been an ‘Open Source’ developer so I really know nothing about the whole contribution process and social hierarchy of this stuff. (please please please forgive any social faux pas) … for example: is there a better place to have ‘developer discussions’? or is here fine?

As a sample of the log file:


DOMAIN	IP	DATE	TIMEZONE	REQUEST	RESPONSE	SIZE	REFERRER	USERAGENT
towerdefensegame.net	81.176.230.15	[08/Dec/2010:02:55:53	-0500]	GET / HTTP/1.1	302	147	-	Mozilla/5.0
towerdefensegame.net	208.80.194.29	[08/Dec/2010:12:28:25	-0500]	GET / HTTP/1.0	302	-	-	Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; Comcast; InfoPath.1; Comcast)
towerdefensegame.net	220.181.7.38	[08/Dec/2010:13:06:29	-0500]	GET /robots.txt HTTP/1.1	200	27	-	Baiduspider+(+http://www.baidu.com/search/spider.htm)
towerdefensegame.net	212.230.166.1	[08/Dec/2010:14:08:58	-0500]	GET /Alien-Hominid-22 HTTP/1.1	302	-	-	Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
towerdefensegame.net	180.180.241.84	[08/Dec/2010:14:14:52	-0500]	GET /Async-Racing-19 HTTP/1.0	302	-	-	Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
towerdefensegame.net	72.225.154.117	[08/Dec/2010:14:47:32	-0500]	GET /Hujetower2-16 HTTP/1.1	302	-	-	Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
towerdefensegame.net	190.136.179.114	[08/Dec/2010:14:49:13	-0500]	GET /play-game/when-penguins-attack-td-25 HTTP/1.1	302	-	-	Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
towerdefensegame.net	190.152.146.74	[08/Dec/2010:14:49:32	-0500]	GET /Infectonator!-:-World-Dominator-15 HTTP/1.1	302	-	-	Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
towerdefensegame.net	111.1.32.56	[08/Dec/2010:14:57:38	-0500]	GET /fstr.net/Protector-IV-8 HTTP/1.1	302	-	-	Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
towerdefensegame.net	174.121.88.26	[08/Dec/2010:21:30:46	-0500]	GET /Reachin-Pichin-16 HTTP/1.0	302	-	-	GeoHasher/Nutch-1.0 (GeoHasher Web Search Engine; geohasher.gotdns.org; geo_hasher at yahoo * com)
www.towerdefensegame.net	72.94.249.38	[08/Dec/2010:00:24:30	-0500]	GET / HTTP/1.1	302	-	-	Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
www.towerdefensegame.net	72.94.249.38	[08/Dec/2010:00:24:31	-0500]	GET / HTTP/1.1	302	-	-	Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
www.towerdefensegame.net	72.30.161.220	[08/Dec/2010:14:47:01	-0500]	GET /robots.txt HTTP/1.0	200	27	-	Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
www.towerdefensegame.net	82.117.198.6	[08/Dec/2010:14:49:14	-0500]	GET /Reachin-Pichin-18 HTTP/1.1	302	-	-	Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
www.towerdefensegame.net	211.138.124.203	[08/Dec/2010:14:52:25	-0500]	GET /Flood-Filler-11 HTTP/1.1	302	-	-	Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
www.towerdefensegame.net	200.5.72.74	[08/Dec/2010:14:57:28	-0500]	GET /GlueFO-3:-Asteroid-Wars-2 HTTP/1.1	302	-	-	Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)
www.towerdefensegame.net	208.80.194.37	[08/Dec/2010:22:18:58	-0500]	GET / HTTP/1.0	302	-	-	Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; YComp 5.0.2.6; YPC 3.2.0)

And now looking at the page tracking from the API documentation…


// note, this is still a *bit* pseudo code as I'm still trying to sort out the process to take on,
// there will likely be commas and things missing 

$requests = explode("\n",$logfile);
foreach ($requests as $request) { 
     $data = explode("\t",$request);
     $d['DOMAIN'] = $data[0];
     $d['IP'] = $data[1];
     $d['DATE'] = cleanup_date($data[2]);
     $d['TIME'] = cleanup_time($data[2]);
     $d['TIMEZONE'] =  $data[3];
     $d['REQUEST'] = cleanup_request($data[4]);  
     $d['HTTP_STATUS'] = $data[5];
     $d['SIZE'] = $data[6];
     $d['REFERRER'] = $data[7];
     $d['USERAGENT'] = $data[8];

     /*
          write a function that will parse the DOMAIN off from the subdomain and return the PIWIK site id
          $root_domain 
    */

    $piwik_site = "SELECT * FROM piwik_site WHERE name = '$root_domain';";

     $t = new PiwikTracker( $idSite = $piwik_site['idsite']);
     // Optional tracking
     $t->setUserAgent($d['USERAGENT']);
//   $t->setBrowserLanguage('fr');
     $t->setLocalTime( $d['time'] );

     $t->urlReferer =$d['REFERRER'];
     $t->ip =$d['IP'];     

//   $t->setResolution( 1024, 768 );
//   $t->setBrowserHasCookies(true);
//   $t->setCustomData( array('id' => 10, 'name' => 'test') );
//   $t->setPlugins($flash = true, $java = true, $director = false);
     // Mandatory
     $t->setUrl( $url = "http://{$d['DOMAIN']}{$d['REQUEST']}" );
     $t->doTrackPageView($d['REQUEST']);
}

Still quite a bit of work to do to make this work right - but - If I take it in chunks, then maybe I’ll get something working. Any guidance along the way is just gravy really :slight_smile:

  • Alex

// TODO : figure out how to FORCE the DATE at time of hit
// TODO : figure out how to integrate maxmind’s MOD_GEOIP database to override location service
// TODO : figure out how to manually specify the ‘unique identifier’ that is normally set in the COOKIE (an MD5 hash of IP+UserAgent)
// TODO : figure out how to add sites that have not yet been created ‘on the fly’ (as well as separate subdomains from domains, while also addressing issues of “double dip” TLDs like ‘.co.uk’

Hi Alex,
an incoming request to piwik will hit the piwik.php script
which calls:
$process = new Piwik_Tracker();
$process->main();

I highly recommend you fire your IDE and look at the code indeed, which is a requirement to write the apache logs loader.
You can set [‘PIWIK_TRACKER_DEBUG’] to true to see the debug output.

To fetch websites info or create new websites, you should use the Sites API. See http://dev.piwik.org/trac/wiki/API/Reference#SitesManager

The challenges with apache logs loading is that:

we would be happy to include your log loader in the core, let us know how it goes and if you have questions

thanks for all information, i was need that to :slight_smile:

We’ve just released a new plugin which could really helpful to you: Roll-Up Reporting aggregates data from multiple websites, mobile apps and shops into a Roll-Up site to gain new insights and save time.

Roll-Up ReportingRoll-Up Reporting lets you aggregate data from multiple websites and apps into one single site. It lets you answer questions like “How many visits happened on all of my websites and apps?” and “Which campaigns contributed the most across several of my websites?” or “How do my various Brands overall compare with each other?” When you have several shops (eg white-labels) it is very valuable to see how your ecommerce shops are performing overall. Or when you are a web agency and you are serving many customers and want to provide each customer with a single aggregate view of all their web properties. Roll-Up Reporting lets you analyze this aggregated data in one site so easily. It saves you lots of time and helps you gain the insights you need instantly.

The Roll-Up Reporting User Guide and the
Roll-Up Reporting FAQ cover how to get the most out of this plugin.

For any other question feel free to contact us.