Scaling Piwik

kbergsoe · July 12, 2010, 9:19pm

I have read many of the posts around using non-SQL databases for Piwik. It does not seem like anyone is taking the lead on such a project - and I get the feeling that it would be a significant project to migrate all of Piwik to a non-sql environment.

We are talking about an idea where we move the piwik_log_action, piwik_log_visit and piwik_log_link_visit_action tables onto a MongoDB server, while keeping all of the rest in MySQL.

If we are able to change the tracking script to insert into MongoDB and develop a Piwik plug-in that could do all the redirection of queries from Piwic on these tables to the MongoDB we could potentially have moved the most sensitive parts of the equation away from MySQL. These tables are responsible for the vast majority of the DB storage needs and the massive amount of inserts. All archived data would be preserved in MySQL - so all reporting but LiveVisits should be untouched.

Have anybody tried anything in this direction?

Best regards,
KBergsoe

matthieu · July 13, 2010, 7:39am

Not much work has been done regarding nosql database support. See the thred: http://dev.piwik.org/trac/ticket/386 and links to piwik-hackers discussion for more details.

we were thinking of trying infinidb as it is a mysql engine and sounds like it could work with less work than using other tools. However, feel free to try with MangoDB and report your experience.

kbergsoe · July 14, 2010, 3:25pm

[quote=matthieu @ Jul 13 2010, 07:39 AM]Not much work has been done regarding nosql database support. See the thred: dev.piwik.org/trac/ticket/386 and links to piwik-hackers discussion for more details.

we were thinking of trying infinidb as it is a mysql engine and sounds like it could work with less work than using other tools. However, feel free to try with MangoDB and report your experience.[/quote]

Do you agree that the main issue scaling Piwik is the inserts into and the volume of the mentioned tables?

My problem with infinidb is that it would only scale Piwik to the capabilities of a single box. Moving beyond that point you would have to go infinidb enterprise which is just delivering alle highend Piwik users to Calpont. Also I expect that some rewriting of the Piwic core would be needed in order for everything to play smoothly.

MongoDB combined with MySQL would preserve the existing core while opening up a route to scale the tables that grow and push MySQL to its limits. MongoDB will feature autosharding by July - so that would take out some of the proposals for scaling Piwik allready described in Ticket 471

As far as I can see - MongoDB can take the inserts as it does no locking. MySQL would only contain archived data - which should make it capable of handling an enormous amount of pageviews. When you need to scale - another Mongo box will be added and the tables autosharded.

Any other thoughts?

Best regards,

ts77 · July 14, 2010, 3:42pm

I assume that using some non-standard database by default will make piwik useless to shared-hosting customers.
Not a problem for me but probably something to keep in mind. I don’t have a problem with having more scalable solutions ;-).

halfdan · July 15, 2010, 12:37am

@Thomas: Well, no-one is going to replace the current MySQL based backend. Just a pluggable alternative with MongoDB / Cassandra would be really nice, although quite hard to implement right now.

kbergsoe · July 15, 2010, 1:10pm

Don’t you agree that it would make sense onlye to move the three tables that are the root cause of all performance related issues as long as they are in MySQL?

That should be possible with a redirection strategy - but propably needs core changes. Would love it if it was possible to work it out in a plug-in - but core changes are likely needed.

Best regards.

halfdan · July 16, 2010, 2:04am

[quote=KBergsoe @ Jul 15 2010, 01:10 PM]Don’t you agree that it would make sense onlye to move the three tables that are the root cause of all performance related issues as long as they are in MySQL?

That should be possible with a redirection strategy - but propably needs core changes. Would love it if it was possible to work it out in a plug-in - but core changes are likely needed.

Best regards.[/quote]

You did get me wrong I think - I was just pointing out that his concern about loosing standard shared-hosting users is unnecessary, as there will always be the option to chose MySQL. I totally agree with you that moving parts / everything to a faster NoSQL backend will most likely increase scalability of Piwik. Have a look at the trac Ticket matthieu gave you and work out a proposal of required core changes to make NoSQL databases possible, and we’ll then see if/when they can be integrated.

vipsoft · July 16, 2010, 1:53pm

This isn’t mentioned in ticket 386 or 471, but we are planning to further abstract the data access layer to support other databases. If you want to follow development, it’s http://dev.piwik.org/trac/ticket/1368 . The milestone “Features requests - after Piwik 1.0” simply means it isn’t a blocker for the 1.0 release.

kbergsoe · July 20, 2010, 9:01am

I have yet to understand why we are pursuing a db abstraction just to support more SQL servers that all have scalability issues rather than adressing the core problem of scaling out rather than up.

On another note, there is a japanese guy who have developed a DB engine for MySQL that supports sharding on the DB level. I cannot see how much would have to be changed in the Piwik core to support this - but it looks like this engine is a potential solution of adressing the sharding on the DB level rather than inside the application.

You can read more about the Spider engine at spiderformysql.com, but documentation is scarse (cannot link here so please Google the Spider MySQL engine).

Also there is a couple of blurry presentations on youtube where the developer walks through common issues and tasks.

If this engine actually delivers it could be the easiest way to scale Piwik horizontally.

What is your take?

driverdan · July 21, 2010, 2:14pm

While I would also prefer a key-value or document store for Piwik instead of SQL, have you considered how far it can go with mySQL? A mySQL database setup for inserts can handle a LOT of traffic. Across our sites we’re seeing well over a million visit actions logged a month with some days having huge traffic spikes. We have master-master replication setup on 2 lower end quad core Xeon servers running everything (DBs, Apache, etc) and never see Piwik having a noticeable impact. On average our server load is 0.1-0.2 on one and 0.4-0.5 on the other with very infrequent spikes up to 2. This is running all our sites along side Piwik, with the majority of our DB transactions for all sites being inserts.

You would have to have a top 1000 site running on old hardware to have any real issues with mySQL. As I said I’d like to see it support other storage options but I think there are other, higher priority items that need to be addressed first.

kbergsoe · July 31, 2010, 12:37pm

[quote=DriverDan @ Jul 21 2010, 02:14 PM]While I would also prefer a key-value or document store for Piwik instead of SQL, have you considered how far it can go with mySQL? A mySQL database setup for inserts can handle a LOT of traffic. Across our sites we’re seeing well over a million visit actions logged a month with some days having huge traffic spikes. We have master-master replication setup on 2 lower end quad core Xeon servers running everything (DBs, Apache, etc) and never see Piwik having a noticeable impact. On average our server load is 0.1-0.2 on one and 0.4-0.5 on the other with very infrequent spikes up to 2. This is running all our sites along side Piwik, with the majority of our DB transactions for all sites being inserts.

You would have to have a top 1000 site running on old hardware to have any real issues with mySQL. As I said I’d like to see it support other storage options but I think there are other, higher priority items that need to be addressed first.[/quote]

How have you setup Piwik in order to take that amount of traffic?

Are you running multiple frontend servers?

I have worked with a distributed stats setup of the OpenX adserver that utilize multiple frontends that regularly performs a mass insert into a backend DB (that replicates back to the frontends). This setup should be possible on Piwik and simpler as there should be no need for the replication part.

I am curious about how you have implemented Piwik…

Best regards
Kresten

kbergsoe · July 31, 2010, 1:11pm

[quote=DriverDan @ Jul 21 2010, 02:14 PM]While I would also prefer a key-value or document store for Piwik instead of SQL, have you considered how far it can go with mySQL? A mySQL database setup for inserts can handle a LOT of traffic. Across our sites we’re seeing well over a million visit actions logged a month with some days having huge traffic spikes. We have master-master replication setup on 2 lower end quad core Xeon servers running everything (DBs, Apache, etc) and never see Piwik having a noticeable impact. On average our server load is 0.1-0.2 on one and 0.4-0.5 on the other with very infrequent spikes up to 2. This is running all our sites along side Piwik, with the majority of our DB transactions for all sites being inserts.

You would have to have a top 1000 site running on old hardware to have any real issues with mySQL. As I said I’d like to see it support other storage options but I think there are other, higher priority items that need to be addressed first.[/quote]
One very interesting options is to take a deeper look at the Japanese Spider DB engine for MySQL. If adopted it would propably be the least invasive to existing Piwik code as the mains tructures would seem to be intact.
Spider enables vertical partitioning and sharding - the latter being the real key to longterm scalability of Piwik - and having the sharding done on DB level would be preferable to coding it into the application.

I have asked about Spider in this forum before - but nobody seems to have any experience with it - and the documentation on the Spider website is really really limited.

vipsoft · July 31, 2010, 10:40pm

I already made a note to investigate using the Spider storage engine as an alternative to the Sharding plugin in http://dev.piwik.org/trac/ticket/471 .

Spider has some inherent restrictions that make it hard to use with a schema that is subject to change. There are also many, many tuning parameters. And yes, the documentation is a little sparse…

driverdan · August 1, 2010, 2:49pm

[quote=KBergsoe @ Jul 31 2010, 12:37 PM]How have you setup Piwik in order to take that amount of traffic?

Are you running multiple frontend servers?[/quote]
I described our server setup in my post. 2 servers running everything.

kbergsoe · August 2, 2010, 10:13am

Hmmm - sorry I missed that.

Wonder if master-master replication works beyond two servers…?

Anyway - I think we will follow a track where we would try to separate Apache and MySQL with multiple Apache frontends behind a load balancer. The frontends would then run a cron job to do a bulkinsert every minute - thereby taking the load away from the backend MySQL server. This setup should scale beyond 5-10 frontends.

The same approach has been used in OpenX with its distributed stats setup - however in that setup, the backend replicates a number of tables back to the frontends in order for them to deliver the right ads. With Piwik it should be easier as the Apache frontends would not need any information from the backend server.

But we will try and see what happens.

Regards,

PS. I still think that if Piwik is to be a serious contender as a foundation for other webservices - it really needs to have a simple linary scalable structure. Something really missing in its current state. MySQL is really bad taking inserts - after bulk import - the options dry out quickly.

driverdan · August 2, 2010, 3:28pm

Scaling mySQL beyond 2 servers for something like this w/o sharding complicates the process significantly and is beyond what I have time to post here. There are plenty of good articles available about the topic.

How much traffic are you expecting to deal with? Based on my experience I still believe that 2 dedicated mySQL servers with some front-end servers can handle a TON of Piwik traffic.

kbergsoe · August 2, 2010, 6:42pm

[quote=DriverDan @ Aug 2 2010, 03:28 PM]Scaling mySQL beyond 2 servers for something like this w/o sharding complicates the process significantly and is beyond what I have time to post here. There are plenty of good articles available about the topic.

How much traffic are you expecting to deal with? Based on my experience I still believe that 2 dedicated mySQL servers with some front-end servers can handle a TON of Piwik traffic.[/quote]

Double digit million pageviews/day…?

driverdan · August 5, 2010, 5:34pm

Well like I said, a top 1000 site might have scaling issues and that many page views could be a top 100 site. At that level I’d think you’d have the internal resources to develop your own solution from scratch or add noSQL support to Piwik.

kbergsoe · August 7, 2010, 12:41am

We are building a new webservice using Piwik core for the tracking service. The challenge is that we need to be sure that we can scale Piwik very fast and without a major rewrite. We are inspired by how OpenX (the adserver hosting this forum) has scaled itself using distributed stats on MySQL. We have had it working in a production environment for a year and it scaled very very well. As you can imagine - we do not, unfortunately, have the resources for a major rewrite.

My point is - that for Piwik to become a true platform for other webservices - it has to have scalability beyond what it has today. I am just looking for the easiest way to accomplish that without building everything from scratch.

Maybe the OpenX boys could tip in here?

kbergsoe · October 25, 2010, 9:40pm

OK - just a small status report:

We have successfully changed the trackingcode to post the trackingdata into a MongoDB collection instead of MySQL. This makes it possible to receive an obsene amount of tracks/s.

We have successfully scriptet an export of the 3 logfiles from MongoDB into MySQL (insert into file) accelerating the import of data into MySQL significantly.

We have sharded the Piwik MySQL DB using the Spider storage engine (two shards) and we see that the data is now stored on the Spider datanodes rather than in the main Piwik MySQL DB.

Our last challenge is getting the archiving process to work - and we think it will. If we are lucky, the archiving will also be speeded up significantly as the queries are run in parallel on each datanode rather than on a single table.

Will keep you updated on our progress.