Usage of RegEx in Custom Dimensions

Hello All,

I have a question about content grouping with RegEx URL extractions. (We use Matomo On-premise.)

The goal: we have a multi language site and I want to create a custom action dimension to be able to group the traffic of publications spread over different language pages.

The URLs are built like this:

www.sitename.com/two-digit-language-identifier/section-name/translated-article-name/article-ID

Here are three example URLs covering a single article published in three different languages:

sitename/en/news/article-on-horses/12305
sitename/de/news/artikel-über-pferde/12305
sitename/fr/news/cikula-ata-lolalala/12305

So, I’d need to be able to group and report any content by either their

  1. Language (en, de, fr, etc)
  2. Content category (section name, like news, articles, videos, etc)
  3. Content ID (various digits)

and combine these groupings in reports, like the followings:

ID 12305 (63 visits)

  • en (30 visits)
  • fr (12 visits)
  • de (21 visits)

or

fr (french) (135000 visits)

  • ID 1263 (123 visits)
  • ID 124 (1241 visits)
  • ID 4236 (1114 visits)

or

News

  • en (3500 visits)
  • fr (12000 visits)
  • de (213400 visits)

I’ve successfully used the following RegEx (grouping only by content ID) to extract data from URLs (containing the news section) in Google Analytics, and it worked fine:

\/news\/.*?\/([^&|?]+)

When I tested this in Matomo via Custom Dimensions (action dimension) (past reports were of course invalidated), it worked in some cases during this January, but even then, the results weren’t grouped as the regex syntax stated, but by “translated article name” which means there were as many instances as the number of different translations. This is bad. Maybe a bug or syntax error? Then after a few weeks it stopped working, even though I know there is related traffic.

Then I moved forward with testing, so I extended my regex to group by 2 digit language code (group1), content category (group2) and by content ID (group3) like this:

sitename.com\/(\w{2})\/(news)\/.*?\/(\d{1,})

Could this logic with multiple capturing groups (subexpressions) work for the above explained goal or do I need to create separate custom dimensions for each of these groupings?

Would you please advise me how to solve this issue?
Is there a specific RegEx syntax to be used here?

Many thanks,
Aladar

Anyone knowing the answer?

In the meanwhile, I’ve experimented with different regex extractions, and had partial results.

Again, the URL looks like this example:
https://www.sitename/en/news/article-on-horses/12305

.*?\/(\d{1,})
This gives results, but interestingly, gives back ANY numbers ANYWHERE in the entire page URL, not what’s requested in the bracket (grouping extraction). On the other hand, testing with regex101.com, it works as desired: group1 is the ID at the end of the URL. It’s Matomo that interprets differently.

\/(news)\/.*?\/(\d{1,})
When I input a stricter extraction criteria, it does not give back any results at all (after using Invalidate Reports).

Additionally, another issue has appeared: even with using Invalidate Reports, the system only shows newly acquired data from the past 24hrs and nothing from the earlier period, even though there were, and those are visible in normal page reports too.