Requesting VisitsSummary.get by a custom variable segment uses the log tables?

Hello,

I’m requesting data from piwik with lots of generated requests.
A segment with the customVariablePageValue1 is in use.

The request against the API is

?module=API&method=VisitsSummary.get&idSite=1Site&period=day&date=yesterday&format=JSON&token_auth=&segment=customVariablePageValue1==16419

As the request got quite slow I’m checking the database queries here and find stuff like the following:


SELECT
				count(distinct log_inner.idvisitor) as nb_uniq_visitors,
				count(*) as nb_visits,
				sum(log_inner.visit_total_actions) as nb_actions,
				max(log_inner.visit_total_actions) as max_actions,
				sum(log_inner.visit_total_time) as sum_visit_length,
				sum(case log_inner.visit_total_actions when 1 then 1 when 0 then 1 else 0 end) as bounce_count,
				sum(case log_inner.visit_goal_converted when 1 then 1 else 0 end) as nb_visits_converted
			
			FROM
				(
			SELECT
				log_visit.idvisitor,
				log_visit.visit_total_actions,
				log_visit.visit_total_time,
				log_visit.visit_goal_converted
			FROM
				piwik_log_visit AS log_visit
				LEFT JOIN piwik_log_link_visit_action AS log_link_visit_action ON log_link_visit_action.idvisit = log_visit.idvisit
			WHERE
				( log_visit.visit_last_action_time >= '2013-02-18 23:00:00'
				AND log_visit.visit_last_action_time <= '2013-02-19 22:59:59'
				AND log_visit.idsite = '1'
			 )
				AND
				( log_link_visit_action.custom_var_v1 = '16419' )
			GROUP BY log_visit.idvisit
				) AS log_inner


So that API request is using the log tables which are not supposed to hold data for too long (max. 30 days) here.

Is that the way its supposed to be? Why is it not using the archived data? If so, is it a problem with segmentation at all or where does this come from? If its a limitation of segmentation its a big drawback here.

Yes it’s a limitation and yes it’s a big drawback, but it’s by design, to be able to “segment” across any custom dimensions, this requires the logs… The “archived data” contains only aggregates.

Please keep the logs in place. I know it’s a problem to scale. Maybe we could improve this aspect, to move older logs to another table (eg. yearly logs) that will be queried instead of the “current” log tables if the date is old.

Previously there was also archiving of segments if I remember correctly. Was that removed on purpose?

Its already aching slow with data from 2 days - that spoils the whole concept for the use case here :(.
I guess I have to aggregate the data from the logs into another table with manual processing to have it available in the future without the lots of logs and have it quickly available.

It’s possible to setup Segments, so they are pre-archived. But, one needs to know the segment before hand (to have pre-processed history).

To pre-process segment you can add them to config.ini.php see: matomo/global.ini.php at master · matomo-org/matomo · GitHub

It is slow, that’s for sure. I guess many people will complain about this, and will be unhappy.
Unfortunately, unless we use column based database engine, these requests are just slow to run on big websites.

It should work really fast on sites with < 1000 pages per day or so, but will run slower as traffic is bigger.