High Availability Cluster with MariaDB Galera

Hi all,

I would like to share with you the following idea and would appreciate the get feedback from you experts.

  • 3 nodes in a cluster connected via MariaDB Galera (multi-master replication)
  • Unfortunately the nodes communicate via public IP - I guess internal IP would make more sense in terms of latency (if the VPS is offering internal IPs)?
  • DNS round robin setup on track.example.com so that every node will have the same load
  • Sharing config files between 3 nodes, dbsessions and tmp clearing
  • If one node fails, the other 2 should handle the requests

Any thoughts on this setup? Do you think this is realiable? During my first tests I saw that the nodes need a bit more RAM due to the replication, but still testing…

Cheers
Michael

Hi,
because of synchronous replication latency can increase if you wrote to all servers simultaneously, additionally they may be row conflicts. You can check your cluster status using tools like GitHub - jayjanssen/myq_gadgets: myq_gadgets is deprecated by myq-tools!
Network latency is also a limiting factor for write throughput, so best if you use internal connections ( some VPS certainly provides internal IP).
To simplify the configuration, you can also use two databases and additional arbitrator.
What you use for DNS configuration?

Thanks @jlubzinski, I should test other vps with internal ips and compare the latency.

Galera guarantees no row conflicts due to live sync replication. But I face the issue that if one node is down, the round robin rotation will still route to this. I configure the DNS via a kind of load balancer from my vps hosting provider.

Or do you think I should set just one server live and collect all the inbound traffic and monitor via heartbeat? If the primary server fails, I could switch to the hot standby server. This would lead to a down time of some seconds and maybe there might be also some data loss… what do you think?

I have read something about queued tracking. Is there any chance to collect inbound traffic to csv files on simple but stable server and push the csv files every 10 minutes to and PHP/sql server to let it aggregate?

Cheers

Sync replication is the reason why conflict may occur (in your example the same data can be updated at the same time on multiple nodes, galera can manage this in most cases, however this still affects the whole transaction).

As laod balancer you can also use glb or haproxy (both with additional stats about connections etc and fast failover).

Queuedtracking plugin give you possibility to handled peaks and even a little downtime (e.g. when master db failover).

I am able and willing to tell you a lot about your ideas because I did exactly the same the last few months. Started as a hobby, to build some Social Rss Feeds, but now it is ‘big business’ and the few servers I was renting are not enough anymore.

So searched and searched and finally found some really awesome solutions. Cheap, very reliable, fast, and no complaining clients. And really, those clients are not even clients, because it is still a hobby of mine and not my work.

You may contact me for sure. One of my sites which is powered by Varnish Cache, HaProxy, MariaDB Galera 10.1 Clusters etc etc Is running on ~800 the most cheapest VPS’s available and it costed a lot of time to find out how everything works but now finally after a few years I am able to sleep normally again. Let me know when you need some ideas.

You may take a look at http://mySocialFeeds.com/ which is not the most beautiful site for sure, but technically, ‘in the background’ it works great. Handling about, and no for sure not a joke, 700+ million Social Rss Feeds. You are welcome and please enjoy your day!

Cutted the text in two. Otherwise you are not able to read it anymore. I am using Piwik to generate statistics / analytics for my Social Rss Feeds. Searched a lot for solutions, most people told me it is not possible what I wanted, but with the great Piwik tools and a lot of time it worked out very well. Not perfect yet, but I am only 41 and so there is still some time to optimize it.