New Index Need For piwik_log_visit Table And Code Change for PHP as well


(Borrenjeng) #1

Hi,

We have close to 83 millions rows of records for the table “piwik_log_visit” at our Matomo (v3.7.0) with MySQL (v5.7.24).

It takes more than 21 minutes to execute the following SQL (see bewlo #1).
Then I created an index with “force index” to the query (see below # 2) . the same query goes from 21 minutes to 86 seconds.

Can the new index and “force index” to be included in the future Matomo release (Visit.php or other php files)? Thank you~

Borren

#1: Before the new Index was created

mysql> explain SELECT
    -> count(distinct log_visit.idvisitor) AS `1`,
    -> count(distinct log_visit.user_id) AS `39`
    -> FROM
    -> piwik_log_visit AS log_visit
    -> WHERE
    -> log_visit.visit_last_action_time >= '2019-01-01 05:00:00'
    -> AND log_visit.visit_last_action_time <= '2019-02-01 04:59:59'
    -> AND log_visit.idsite IN ('1');
+----+-------------+-----------+------------+------+---------------------------------------------------------------------------+------------------------------+---------+-------+----------+----------+-----------------------+
| id | select_type | table     | partitions | type | possible_keys                                                             | key                          | key_len | ref   | rows     | filtered | Extra                 |
+----+-------------+-----------+------------+------+---------------------------------------------------------------------------+------------------------------+---------+-------+----------+----------+-----------------------+
|  1 | SIMPLE      | log_visit | NULL       | ref  | index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor | index_idsite_config_datetime | 4       | const | 36788570 |    11.11 | Using index condition |
+----+-------------+-----------+------------+------+---------------------------------------------------------------------------+------------------------------+---------+-------+----------+----------+-----------------------+
1 row in set, 1 warning (0.00 sec)

SELECT
COUNT(DISTINCT log_visit.idvisitor) AS `1`,
COUNT(DISTINCT log_visit.user_id) AS `39`
FROM
piwik_log_visit AS log_visit
WHERE
log_visit.visit_last_action_time &gt;= '2019-01-01 05:00:00'
AND log_visit.visit_last_action_time &lt;= '2019-02-01 04:59:59'
AND log_visit.idsite IN ('1')

**+----------+----+**
**| 1 | 39 |**
**+----------+----+**
**| 32957523 | 11 |**
**+----------+----+**
**1 row in set (21 min 5.10 sec)**

#2
create index ind_piwik_log_visit_01 on fdsys_mysql_piwik.piwik_log_visit(visit_last_action_time,idsite,idvisitor,user_id);

After: (the same query took only 86 seconds)

SELECT 
    COUNT(DISTINCT log_visit.idvisitor) AS `1`,
    COUNT(DISTINCT log_visit.user_id) AS `39`
FROM
    piwik_log_visit AS log_visit
    FORCE INDEX (ind_piwik_log_visit_01)
WHERE
    log_visit.visit_last_action_time >= '2019-01-01 05:00:00'
        AND log_visit.visit_last_action_time <= '2019-02-01 04:59:59'
        AND log_visit.idsite IN ('1')
| 1        | 39 |
+----------+----+
| 34514368 | 11 |
+----------+----+
1 row in set (1 min 26.17 sec)


#2

Hi Borrenjeng, this is a remarkable improvement! Would you be able to say how big the new index is compared to your old index? And do you use the User ID feature at all?


(Borrenjeng) #3

Hi,

  1. No. We don’t use User_Id feature at all.
  2. The new index size is only 78% of the old index (see below).
  3. So are you able to include this change to the new release?

Thank you~

+-----------------+------------------------------+------------+
| table_name      | index_name                   | size_in_mb |
+-----------------+------------------------------+------------+
| piwik_log_visit | index_idsite_config_datetime |    3437.64 |
| piwik_log_visit | ind_piwik_log_visit_01       |    2674.98 |
+-----------------+------------------------------+------------+


#4

That would depend on multiple factors, including whether the size of the index is still reasonable when the user ID feature is used heavily.


#5

How big is your log_visit table, and do you know where this query is generated / from which report or API?


#6

Actually, we know where the query is coming from. The question is: How many entries do you have in log-visit table in total, and can you let us know what numbers the query returns?


(Fabian Dellwing) #7

We have close to 83 millions rows of records for the table “piwik_log_visit” at our Matomo (v3.7.0) with MySQL (v5.7.24).
[…]

**+----------+----+**
**| 1 | 39 |**
**+----------+----+**
**| 32957523 | 11 |**
**+----------+----+**
**1 row in set (21 min 5.10 sec)**

(Borrenjeng) #8

If you are asking size of the table. here you go. yes it is truth with the following
“We have close to 83 millions rows of records for the table “piwik_log_visit” at our Matomo (v3.7.0) with MySQL (v5.7.24).”

Of course, I’m still waiting the answer for the following question.

“3. So are you able to include this change to the new release?”

SELECT
table_name AS `Table`,
round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB`
FROM information_schema.TABLES
WHERE table_name = 'piwik_log_visit';
+-----------------+------------+
| Table           | Size in MB |
+-----------------+------------+
| piwik_log_visit |   28979.45 |
+-----------------+------------+

#9

I reckon we won’t include this in core. We ran the queries on similar DB sizes and it was fairly fast. For most Matomo’s this index would make tracking etc slower without bringing too much of a benefit. I ran it before on a 20GB table and it took 5 seconds. I reckon best would be to update this FAQ: How do I enable “Unique Visitors” metric for Yearly reports and for Custom Date Ranges? - Analytics Platform - Matomo and mention the performance can be improved for high traffic matomo instances if there is a need or if they still want to monthly unique visitors.Alternative is to disable the monthly unique visitors (not a great alternative)


#10

Can you double check whether the force index is needed? Just realising otherwise it won’t be useful to mention the index in the FAQ if it’s not being used.


(Borrenjeng) #11

the “force index” is needed. Without it, it takes index " index_idsite_config_datetime" over the new index.

So I decided to do another experiment.

  1. I dropped the new index ind_piwik_log_visit_01, explain the query without using the “force index”. the 2nd query is without column “idsite” in the WHERE clause. And MySQL gave it a “FULL TABLE SCAN”. it doesn’t even recognize the other indice “index_idsite_datetime or index_idsite_idvisitor” with column “visit_last_action_time”

mysql> explain SELECT       COUNT(log_visit.idvisitor) AS `1`,      COUNT(log_visit.user_id) AS `39`  
FROM     piwik_log_visit log_visit          
WHERE     visit_last_action_time >= '2019-01-0                 1 00:00:00'      
and visit_last_action_time <= '2019-02-00 00:00:00'           
AND idsite IN ('1');
+----+-------------+-----------+------------+------+---------------------------------------------------------------------------+------------------------------+---------+-------+----------+----                 ------+-----------------------+
| id | select_type | table     | partitions | type | possible_keys                                                             | key                          | key_len | ref   | rows     | fil                 tered | Extra                 |
+----+-------------+-----------+------------+------+---------------------------------------------------------------------------+------------------------------+---------+-------+----------+----                 ------+-----------------------+
|  1 | SIMPLE      | log_visit | NULL       | ref  | index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor | index_idsite_config_datetime | 4       | const | 42417112 |                     11.11 | Using index condition |
+----+-------------+-----------+------------+------+---------------------------------------------------------------------------+------------------------------+---------+-------+----------+----                 ------+-----------------------+
1 row in set, 2 warnings (0.37 sec)

mysql> mysql> explain SELECT       COUNT(log_visit.idvisitor) AS `1`,      COUNT(log_visit.user_id) AS `39`  
FROM     piwik_log_visit log_visit          
WHERE     visit_last_action_time >= '2019-01-01 00      
and visind visit_last_action_time <= '2019-02-00 00:00:00'    ;
+----+-------------+-----------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| id | select_type | table     | partitions | type | possible_keys | key  | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+-----------+------------+------+---------------+------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | log_visit | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 84834762 |    11.11 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+----------+----------+-------------+
1 row in set, 2 warnings (0.00 sec)

  1. I created the new index again and analyzed the table. Then re-ran the same 2 queries from above.
mysql> create index ind_piwik_log_visit_01 on  piwik_log_visit(visit_last_action_time,idsite,idvisitor);                                                                                                        
Query OK, 0 rows affected (7 min 25.46 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> analyze table piwik_log_visit;
+-----------------------------------+---------+----------+----------+
| Table                             | Op      | Msg_type | Msg_text |
+-----------------------------------+---------+----------+----------+
| fdsys_mysql_piwik.piwik_log_visit | analyze | status   | OK       |
+-----------------------------------+---------+----------+----------+
1 row in set (3.32 sec)

mysql> explain SELECT       COUNT(log_visit.idvisitor) AS `1`,      COUNT(log_visit.user_id) AS `39`  
FROM     piwik_log_visit log_visit          
WHERE     visit_last_action_time >= '2019-01-0                 1 00:00:00'      
and visit_last_action_time <= '2019-02-00 00:00:00'           
AND idsite IN ('1');
+----+-------------+-----------+------------+------+--------------------------------------------------------------------------------------------------+------------------------------+---------+-------+----------+----------+-----------------------+
| id | select_type | table     | partitions | type | possible_keys                                                                                    | key                          | key_len | ref   | rows     | filtered | Extra                 |
+----+-------------+-----------+------------+------+--------------------------------------------------------------------------------------------------+------------------------------+---------+-------+----------+----------+-----------------------+
|  1 | SIMPLE      | log_visit | NULL       | ref  | index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor,ind_piwik_log_visit_01 | index_idsite_config_datetime | 4       | const | 38831407 |    50.00 | Using index condition |
+----+-------------+-----------+------------+------+--------------------------------------------------------------------------------------------------+------------------------------+---------+-------+----------+----------+-----------------------+
1 row in set, 6 warnings (0.81 sec)

mysql> explain SELECT       COUNT(log_visit.idvisitor) AS `1`,      COUNT(log_visit.user_id) AS `39`  
FROM     piwik_log_visit log_visit          
WHERE     visit_last_action_time >= '2019-01-0                 1 00:00:00'      
and visit_last_action_time <= '2019-02-00 00:00:00';
+----+-------------+-----------+------------+-------+------------------------+------------------------+---------+------+----------+----------+--------------------------+
| id | select_type | table     | partitions | type  | possible_keys          | key                    | key_len | ref  | rows     | filtered | Extra                    |
+----+-------------+-----------+------------+-------+------------------------+------------------------+---------+------+----------+----------+--------------------------+
|  1 | SIMPLE      | log_visit | NULL       | range | ind_piwik_log_visit_01 | ind_piwik_log_visit_01 | 5       | NULL | 38831407 |   100.00 | Using where; Using index |
+----+-------------+-----------+------------+-------+------------------------+------------------------+---------+------+----------+----------+--------------------------+
1 row in set, 4 warnings (0.00 sec)

  1. I dropped and re-created the new index with different order of the columns
mysql> drop index ind_piwik_log_visit_01 on piwik_log_visit;
Query OK, 0 rows affected (0.57 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> create index ind_piwik_log_visit_01 on  piwik_log_visit(idsite,visit_last_action_time,idvisitor);

Query OK, 0 rows affected (33 min 52.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql>
mysql> analyze table piwik_log_visit;                                                                                                                                                                                             +-----------------------------------+---------+----------+----------+
| Table                             | Op      | Msg_type | Msg_text |
+-----------------------------------+---------+----------+----------+
| fdsys_mysql_piwik.piwik_log_visit | analyze | status   | OK       |
+-----------------------------------+---------+----------+----------+
1 row in set (0.81 sec)

mysql> explain SELECT       COUNT(log_visit.idvisitor) AS `1`,      COUNT(log_visit.user_id) AS `39`  FROM     piwik_log_visit log_visit          WHERE     visit_last_action_time >= '2019-01-01 00:00:00'  AND idsite IN ('1');
+----+-------------+-----------+------------+------+--------------------------------------------------------------------------------------------------+------------------------------+---------+-------+----------+----------+-----------------------+
| id | select_type | table     | partitions | type | possible_keys                                                                                    | key                          | key_len | ref   | rows     | filtered | Extra                 |
+----+-------------+-----------+------------+------+--------------------------------------------------------------------------------------------------+------------------------------+---------+-------+----------+----------+-----------------------+
|  1 | SIMPLE      | log_visit | NULL       | ref  | index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor,ind_piwik_log_visit_01 | index_idsite_config_datetime | 4       | const | 38873071 |    33.33 | Using index condition |
+----+-------------+-----------+------------+------+--------------------------------------------------------------------------------------------------+------------------------------+---------+-------+----------+----------+-----------------------+
1 row in set, 1 warning (0.15 sec)

mysql> explain SELECT       COUNT(log_visit.idvisitor) AS `1`,      COUNT(log_visit.user_id) AS `39`  FROM     piwik_log_visit log_visit          WHERE     visit_last_action_time >= '2019-01-01 00:00:00' ;
+----+-------------+-----------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| id | select_type | table     | partitions | type | possible_keys | key  | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+-----------+------------+------+---------------+------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | log_visit | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 77746143 |    33.33 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+----------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

Based on my observation, MySQL database utilizes only the index starting with a column that appears in the WHERE clause. If there are multiple options, then it takes what’s available first based on the indices creation sequence. MySQL doesn’t recognize a index if the WHERE clause columns are not at the beginning of the column’s order. For example, The Index “index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor” are all started with column “idsite”.

In case 3, when I re-created index with different column order => ind_piwik_log_visit_01 “idsite,visit_last_action_time,idvisitor”, then 2nd explain at case3 has a full table scan. it doesn’t utilize ind_piwik_log_visit_01

Conclusion: it will be easier for Matomo to include the “force index” into the query. So the query will utilize the best fit index, instead of picking “first available” index.

@thomas_matomo I don’t know what’s your server specs. But when you said “5 seconds” of the query, I assumed you ran it without creating the new index. Also we don’t use “user id” since we don’t track “user id” at all. And all users are anonymous.

here is my specs.

# lscpu|grep -E '^Thread|^Core|^Socket|^CPU'
CPU op-mode(s):        32-bit, 64-bit
CPU(s):                8
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             8
CPU family:            6
CPU MHz:               2394.231


# free -m
             total       used       free     shared    buffers     cached
Mem:         32109      29548       2560          0        131       6351
-/+ buffers/cache:      23065       9043
Swap:         1023          0       1023

/etc/my.cnf

innodb_buffer_pool_size = 20G
innodb_log_buffer_size = 512M
innodb_log_file_size = 1G
innodb_write_io_threads = 32
innodb_read_io_threads = 32
innodb_io_capacity = 5000
innodb_flush_log_at_trx_commit = 2
innodb_file_per_table = 1
max_allowed_packet = 256M
innodb_flush_method = O_DIRECT
innodb_doublewrite = 0
innodb_support_xa = 0
innodb_checksums = 0
wait_timeout = 30000
max_connections = 250


I’m still hoping you can re-consider including the new index and “force index” to the new release. Thank you~


#12

Thanks for this. We’ll discuss it in the team in around 2.5 weeks and I’ll keep you updated.


(Borrenjeng) #13

@thomas_matomo Really appreciate it. Another reason is the behavior of how MySQL utilizes the index. MySQL picks index by the first available index, not by the best fit index. That’s why “force index” is so important to the new index.

My other solution is to drop and recreate all the table indices with my new index be the first one. But I don’t know what’s the impact to Matomo. What’s your take on this suggestion? Thank you.


#14

My other solution is to drop and recreate all the table indices with my new index be the first one. But I don’t know what’s the impact to Matomo. What’s your take on this suggestion? Thank you.

It’s really hard to tell as there are so many different queries fired. I reckon it should be fine but I would make sure to have slow query log enabled and check if some new queries pop up that are slow but weren’t slow before.


#15

Can you test how long these queries take with your index?

SELECT
COUNT(DISTINCT log_visit.idvisitor) AS `1`,
COUNT(DISTINCT log_visit.user_id) AS `39`
FROM
piwik_log_visit AS log_visit
LEFT JOIN 
piwik_log_link_visit_action AS lva ON lva.idvisit = log_visit.idvisit
WHERE
log_visit.visit_last_action_time > '2019-01-01 05:00:00'
AND log_visit.visit_last_action_time < '2019-02-01 04:59:59'
AND log_visit.idsite IN ('1')
AND lva.idaction_name_ref != 1

and


SELECT
COUNT(DISTINCT log_visit.idvisitor) AS `1`,
COUNT(DISTINCT log_visit.user_id) AS `39`
FROM
piwik_log_visit AS log_visit
WHERE
log_visit.visit_last_action_time > '2019-01-01 05:00:00'
AND log_visit.visit_last_action_time < '2019-02-01 04:59:59'
AND log_visit.idsite IN ('1')
AND log_visit.config_os = 'MAC'

(Borrenjeng) #16

See the results below. Thank you.

mysql> explain SELECT
    -> COUNT(DISTINCT log_visit.idvisitor) AS `1`,
    -> COUNT(DISTINCT log_visit.user_id) AS `39`
    -> FROM
    -> piwik_log_visit AS log_visit
    -> WHERE
    -> log_visit.visit_last_action_time > '2019-01-01 05:00:00'
    -> AND log_visit.visit_last_action_time < '2019-02-01 04:59:59'
    -> AND log_visit.idsite IN ('1')
    -> AND log_visit.config_os = 'MAC'
    -> ;
+----+-------------+-----------+------------+------+---------------------------------------------------------------------------+------------------------------+---------+-------+----------+----------+------------------------------------+
| id | select_type | table     | partitions | type | possible_keys                                                             | key                          | key_len | ref   | rows     | filtered | Extra                              |
+----+-------------+-----------+------------+------+---------------------------------------------------------------------------+------------------------------+---------+-------+----------+----------+------------------------------------+
|  1 | SIMPLE      | log_visit | NULL       | ref  | index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor | index_idsite_config_datetime | 4       | const | 42414052 |     1.11 | Using index condition; Using where |
+----+-------------+-----------+------------+------+---------------------------------------------------------------------------+------------------------------+---------+-------+----------+----------+------------------------------------+
1 row in set, 1 warning (0.00 sec)


mysql> SELECT
    -> COUNT(DISTINCT log_visit.idvisitor) AS `1`,
    -> COUNT(DISTINCT log_visit.user_id) AS `39`
    -> FROM
    -> piwik_log_visit AS log_visit
    -> WHERE
    -> log_visit.visit_last_action_time > '2019-01-01 05:00:00'
    -> AND log_visit.visit_last_action_time < '2019-02-01 04:59:59'
    -> AND log_visit.idsite IN ('1')
    -> AND log_visit.config_os = 'MAC'
    -> ;
+-------+----+
| 1     | 39 |
+-------+----+
| 19716 |  0 |
+-------+----+
1 row in set (6 min 52.84 sec)


mysql> explain SELECT
    -> COUNT(DISTINCT log_visit.idvisitor) AS `1`,
    -> COUNT(DISTINCT log_visit.user_id) AS `39`
    -> FROM
    -> piwik_log_visit AS log_visit
    -> LEFT JOIN
    -> piwik_log_link_visit_action AS lva ON lva.idvisit = log_visit.idvisit
    -> WHERE
    -> log_visit.visit_last_action_time > '2019-01-01 05:00:00'
    -> AND log_visit.visit_last_action_time < '2019-02-01 04:59:59'
    -> AND log_visit.idsite IN ('1')
    -> AND lva.idaction_name_ref != 1
    -> ;
+----+-------------+-----------+------------+------+-----------------------------------------------------------------------------------+------------------------------+---------+-------------------------------------+----------+----------+-----------------------+
| id | select_type | table     | partitions | type | possible_keys                                                                     | key                          | key_len | ref                                 | rows     | filtered | Extra                 |
+----+-------------+-----------+------------+------+-----------------------------------------------------------------------------------+------------------------------+---------+-------------------------------------+----------+----------+-----------------------+
|  1 | SIMPLE      | log_visit | NULL       | ref  | PRIMARY,index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor | index_idsite_config_datetime | 4       | const                               | 42414052 |    11.11 | Using index condition |
|  1 | SIMPLE      | lva       | NULL       | ref  | index_idvisit                                                                     | index_idvisit                | 8       | fdsys_mysql_piwik.log_visit.idvisit |        1 |    90.00 | Using where           |
+----+-------------+-----------+------------+------+-----------------------------------------------------------------------------------+------------------------------+---------+-------------------------------------+----------+----------+-----------------------+
2 rows in set, 1 warning (0.00 sec)

mysql> SELECT
    -> COUNT(DISTINCT log_visit.idvisitor) AS `1`,
    -> COUNT(DISTINCT log_visit.user_id) AS `39`
    -> FROM
    -> piwik_log_visit AS log_visit
    -> LEFT JOIN
    -> piwik_log_link_visit_action AS lva ON lva.idvisit = log_visit.idvisit
    -> WHERE
    -> log_visit.visit_last_action_time > '2019-01-01 05:00:00'
    -> AND log_visit.visit_last_action_time < '2019-02-01 04:59:59'
    -> AND log_visit.idsite IN ('1')
    -> AND lva.idaction_name_ref != 1
    -> ;
+----------+----+
| 1        | 39 |
+----------+----+
| 42404139 | 12 |
+----------+----+
1 row in set (35 min 54.32 sec)



#17

As suspected the queries take long because they are not fully index covered compared to the other query. We won’t be adding the index to core as the benefit is minimal and only applies to a few high traffic instances while it makes it worse for > 99% of other Matomo instances because of the additional tracking overhead etc.

If it made the segment queries faster as well this would be a different story. In the end, even if you have only 10 segments, you make this particular query only 10% faster. That might be only a 1% improved of the overall archiving (guessed, depends on data, config, segments, etc).

Hope you understand. Maybe you can still add the index for you and rearrange the order of the index or so and see if it makes things better, or if needed apply a patch after each new release to force the index.


(Borrenjeng) #18

What I don’t understand it the MySQL optimizer choose index index_idsite_config_datetime (idsite,config_id,visit_last_action_time) over the new index ind_piwik_log_visit_01(idsite,visit_last_action_time,idvisitor).
The index index_idsite_config_datetime has a column “config_id”, which is not even in the SELECT statement. It probably picks it because it is the first available index with column “idsite”, and the rest of columns are not important. And I think that’s why the “force index” need to be included.

The % is relevant only if the table has hundred of millions records. I’m still hoping you can re-consider my suggestion. Without the “force index”, MySQL doesn’t know how to make the BEST index choice.