Automated Site ID lookup - how?

Hello. I’m new to Piwik. I’m just installing it on our system (well, on staging first, before production) which hosts several hundred websites, all served via a WordPress Multisite via a load-balanced cluster on Amazon EC2. All the sites are ‘ours’, rather than belonging to our customers or anything like that.

Now, firstly, I see that the web-site list has to be created manually by adding sites one-by-one. While a bulk import feature would have been nice, I believe I’ll be able to automate creation of sites using the API - for example by calling SitesManager.addSite() on all our existing sites and adding a call to that in our new site creation code (which is admin-back-end for our IT team only).

However, it appears that the tracking code needs the siteId rather than the domain. Is that correct, or is there a way to supply the JS tracker with the domain name instead of siteId and have piwik look it up? or a way to have the JS tracker just obtain the domain from the DOM document?

If not, I’m guessing the other way to implement it would be for the server-side page rendering to make an API call to SitesManager.getAllSitesId() to obtain the siteId for a given request based on the HTTP HOST header and inject that into the tracking code in the response (- obviously, the ids will need to be cached for efficiency).

Is there another and/or better way? I don’t see any JS tracker functions to map a domain to the siteId.
If not, perhaps this could be a feature request? - to allow the JS tracker constructor to be passed a site main domain URL instead of siteId? (or even some way to have it obtain it from the DOM itself - such as passing -1 as the siteId?)

Apologies if I’ve missed something obvious or a common answer in the forums (I did search…) - since I’ve only just had contact with Piwik for a couple hours now… :slight_smile:

Help appreciated.
Cheers,
-David.

So, I’ve implemented this and it seems to work ok. The getAllSites() call that is documented seems not to be in the API any longer. However, I was able to just call the getSitesIdFromSiteUrl() instead and cache that. If it doesn’t find the site, I just call addSite() to create them on the fly.

As soon as I went live, the sites quickly populated (though most of the hits were from spiders initially).

The only issue I have remaining is that the IPs are being logged as 10.x.x.x - the IP of the Amazon ELB load-balancers. I have the proxy_client_headers[] setting in [General] set to “HTTP_X_FORWARDED_FOR” but it doesn’t help.

The way you implemented is the only way to do it right now, we don’t provide the feature of detecting the website ID from the URL but as you showed here it can be done relatively easily on the client side.

Your setup is definitely interesting and we would like to hear from you! if you need any help you can also email me directly. Hope Piwik will work perfect for you! :slight_smile:

One minor question I had:

We don’t have SSL actively in use on our sites right now, though we’re about to implement it. The SLL connection will be terminated at the load-balancer and hence the connection to the web instances will be HTTP. The HTTP_X_FORWARDED_PROTO header will have the value ‘https’.

Although Piwik is installed on its own domain, it is still served by the same set of web instances behind the load-balancer as the tracked sites. So, even HTTPS requests to the Piwik domain will actually be passed through as HTTP to Piwik.

Do I need to do anything special to keep Piwik functioning? I believe the JS tracking code will appropriate render https links since that is what the client sees. When I create sites using the API addSite() function, I passed both the http:// and https:// URLs - was that the appropriate thing to do? (we don’t need to distinguish from a tracking perspective).

The only other task I didn’t perform yet, was to get Piwik using our global memcached compatible cache (actually, Amazon’s AWS ElastiCache). The web-instances to have APC installed - though I don’t know if Piwik uses that explicitly or not (aside from PHP source caching that it).

Thanks.

Although Piwik is installed on its own domain, it is still served by the same set of web instances behind the load-balancer as the tracked sites. So, even HTTPS requests to the Piwik domain will actually be passed through as HTTP to Piwik.

I remember previously users using https → http proxy, and I think it’s working fine. But only testing can tell.

Do I need to do anything special to keep Piwik functioning? I believe the JS tracking code will appropriate render https links since that is what the client sees. When I create sites using the API addSite() function, I passed both the http:// and https:// URLs - was that the appropriate thing to do? (we don’t need to distinguish from a tracking perspective).

you don’t need to pass both http and https, one is enough (piwik only matches on the domain name)

The only other task I didn’t perform yet, was to get Piwik using our global memcached compatible cache (actually, Amazon’s AWS ElastiCache).

Piwik does not cache data this way, only uses mysql for fast pre-processed report lookups

Make sure you read: Piwik & High traffic websites

Thanks for the informative post.

David: Have you checked out WP-Piwik? We have recently rolled it out on our multisite installation and it’s worked pretty well for us.
It’s still a bit buggy but shows promise. It also allows each multisite user to see their own stats.

It seems to be working well for us since I excluded IPs 10...* from the forward proxy headers. Haven’t gotten SSL setup yet though (unrelated to Piwik). Thanks for the info.

@littleguy - thanks for the tip. We actually have our own custom plugin that contains a lot of logic specific to our application and I just threw the tracking code in there (- especially since it needs to use the Piwik API to lookup site Ids and poke them into our database of site domains and properties). Currently, we only have 1 user - us - we’re not using WP MS in a way that allows any users to signup or create sites, but more as a platform for creating hundreds of sites that have partly static content and partly dynamic content that varies according the specific site (- we’ve implemented our own in-page WP tempting engine for the designers to use).