Performance effects of archiving 22000+ custom variables

martinc · January 22, 2014, 6:23pm

Hi Piwik forums.

I am looking for feedback regarding FAQ54, which states "For performance reasons, Piwik will by default only archive the first top 500 pages, top 1000 keywords, top 1000 websites, Custom Variables, etc."
http://piwik.org/faq/how-to/faq_54/#faq_54

Does anyone have experience of increasing the archive size of “top downloads” or “custom variable” and the effects this has on the speed of return from various calls to the Piwik API. I need to get “views” and “downloads” based around a code that might be stored in a page’s custom variables. It looks like the ecommerce module/plugin, that lets you track views of products, works in a similar way, by storing the product code in custom variable 3.

It might be I should just add an index to the custom variable being used instead. Assuming this would have no negative effect whilst collecting the raw data and speed up certain requests to the reporting API.

Background

A bit of background to why I’m asking this and a question about if a plugin exists or a possible alternative. Apologies if it is a bit long for my first post to the Piwik forum.

I have a dataset of 20000+ codes I need to track, increasing by at least 5000 per year over the next few decades. I can’t predict which of these will be queried. For example, one report would have required querying the API 1300*4 times. Another requirement is that for a given code I need total downloads for all associated files broken down for each month over a 12-48 months period, arranged by academic year (eg. Aug 2013 - Jul 2014).

We currently use the default URL stored by Piwik when recording data. This means our “codes” are embedded in at least 2 different URLs depending on how the user accessed the item. This requires at least 4 calls to the API to get stats for views and downloads.

Possible plugin

My current solution is to pre-compile views and downloads per code offline so totals for both can be retrieved with a single query when required.

A prototype solution (which you can argue should be implemented as a plugin) runs a cronjob that caches views and downloads that have occurred since it was last run. I currently store these in separate DB that has a lot of indexes. Whilst this does produce pre-sorted results very quickly it doesn’t use the API.

Does anyone know of a plugin that does a similar job? Perhaps adding 2 new API calls:

For code X return page views and total downloads for all assigned files associated with this code.
A page like this would us it: ORA Conference: "INSTITUTIONAL WORK TAKEN LITERALLY: HOW INSTITUTIONAL LOGICS SHIFT AS BANKING LAWYERS ‘GET THE DEAL DONE’" - uuid:1c5c4d13-04ac-4cb1-9d63-ee1fc313144f

My worst test case had 10 files requiring at least 20 API calls to get “total downloads”. An earlier prototype ended up enabling a single query using something like this, where a cronjob would update custom_var_v2 based upon a field taken the action table.
select count(custom_var_v1) from log_link_action where custom_var_v1=code and custom_var_v2=download

For code X give me total downloads of all associated files per month over the last M months.
The separate DB that has a lot of derived variables that can be recreated from the raw Piwik data when need, it means I can roughly do this.
select count(code) from cache table where code=code and type=download group by month sorted by academic years and months limited to last 4 years.

It seems the time taken to return results does not seem to vary much between getting 12 months worth of figures or 48 months worth or the number of files associated with a code.

I am not sure how the archives Piwik creates per month work, if or how it effects calls to the views and downloads API. Does getting multiple months worth of data just query the log tables directly since this is usually quick enough. Or does it extract the totals saved in the archived tables?

Change URLS stored instead?

Could I replace the default URLs stored by Piwik when recording page views and downloads so it only contains our “code” without different host names or paths or filenames. I can then call the views and downloads API with the same code. Looking at the reporting API it appears I can get breakdowns by month by adding this to the query: &date=last48&period=month

Does anyone have experience how quick Piwik API returns multiple months worth of data?

Looking at how the data gets recorded in the table as described on the following page, the “name” field would be duplicated but the “type” field would let the system distinguish between views and downloads when I call the API.
http://developer.piwik.org/guides/persistence-and-the-mysql-backend#action-types

My impression is this would work, but we would lose an ability to track downloads for individual files if a code had multiple files associated with it.

Nearly finished

I’m writing a script that will take our existing data and prepare it in certain various ways for loading into a more recent version of Piwik, meaning I can try out different options. If you can answer any of my questions it might save me trying something that would have no impact.

Thanks for your time.

Martin

matthieu · January 22, 2014, 10:06pm

The reports are pre-processed if you setup: How to Set up Auto-Archiving of Your Reports - Analytics Platform - Matomo

so the API should be very fast. However, archiving will be slower. Yes, it should work with 22,000 custom variables. Please let me know here how it goes for you! enjoy

martinc · January 23, 2014, 5:49am

Thank Matt for the quick reply.

Will try upping the totals and see what happens. But first I need to populate the custom variables from our existing data by extracting codes from URLs.

Martin