When finally running under load, in an East-West Coast USA MariaDB Galera Cluster, I get pretty regularly -
[Warning] WSREP: Failed to report last committed 66266a7c-f62f-11ec-86bf-bfd29572294b:33450999, -60 (Operation timed out)
(which from MariaDB/Galera … is ok, as this is just a warning, it is suppose to be retried, and then must success, as doesn’t log an error. )
The config is
- Baremetal/“internal” cloud (Xen XCP-NG), own colos.
** Cogent Transit, 10Gbps, over IPSEC VPN via Juniper SRX
** ping over IPSEC ~ 70 ms
- MariaDB (10.6.10) with galera Cluster 26.4.12
** five nodes, two WEST, three EAST,
** Using HaProxy to distribute load, but by connections, so 90%+ DB connections go to node1.
- FreeBSD 13-stable (as a jail) on bare metal
- Web servers (6x, three each coast)
** FreeBSD 13-stable
** running under Xen (XCP-NG)
At some point, some process will block or lock or be long running, then ultimately
[Warning] Aborted connection 0 to db: 'unconnected' user: 'unauthenticated' host: 'connecting host' (Too many connections)
And the entire cluster locks up.
The work around, right now, that seems to be running for about 7-8 days without issue, is that I redirected all the traffic to the three EAST Coast web servers, and thus all inserts happening on the EAST Coast DB Nodes. This is obviously not a solution.
I suspect this is a lock or block in …
As I write this, I had the
core:archive running in both colos, thus, different DB nodes, every 5 minutes. It is possible that some of these jobs ran longer than 5 minutes, and
core:archive may not have locking, and should ensure that only one core:archive process is running at a time.
I’m looking at also possibly moving to plugin-QueuedTracking - though would love this to be RabbitMQ based, with active pub/sub instead of polling/batch.
Is anyone else using Matomo + MariaDB Galera Cluster across medium to high latency connections? Any tips/hints or share your config/setup ?