Large deployments with Hadoop and Hypertable support for Piwik

Hi there,

I am looking to use Piwik as the analytics solution for a large number of websites and mobile websites at a national ISP and my page views will likely be in excess of 250 million per month. I am aware that mySQL will die with these kind of volumes so I was wondering if the developers would consider abstracting the database engine to provide support for Hadoop filesystem over multiple machines and use Hypertable as the database backend.

Piwik could greatly benefit from the implementation of Big Data technologies like Hadoop and Hypertable and the product could definitely become a mainstream analytics alternative as companies would be incentivised to take control of their own analytics and have the flexibility to query it in ways which Google Analytics cannot.

Thank you,
Charles

We would like to work on this, but this is not scheduled in the next 6 months, because we are already under-staffed, over-committed and have a vision for Piwik 2.0 which we will stick to.

However if you have plenty of funding for Piwik improvements, or internal coding ninjas who could help us, then we’d love to hear from you :slight_smile:

Thanks for the response Matt.

I provide high-level digital consulting services to a national ISP in South Africa. As a die-hard open-source advocate and sworn enemy of monopolist organisations, I see the stranglehold which Google is having on the web becoming increasingly problematic for me.

As I’ve been using Piwik on a few of my own web projects for some time now, it dawned on me that Piwik could potentially provide a way for companies to regain control of their data and stop being so beholden to the grip of Google as they currently are.

My proposal would perhaps require that the Piwik project split its distro model into a community edition (free and running on mySQL - which is more than adequate for 99% of users out there) and enterprise edition (free but you pay for support - this version will provide Big Data integration features with Hadoop/Hypertable, etc…).

Now I do have some experience setting up Hadoop clusters using Cloudera, but my experience with Hypertable is limited. Also, I’m not quite sure what the development costs of this might be so I’m not sure what an undertaking like this would cost. Perhaps we couldd look at a crowdfunding model to raise the necessary funds to have the Hadoop/Hypertable features added…that way the community will still own it and still have full control of its destiny.

Piwik is definitely ripe for a Big Data make-over as browsing the posts in the forum where users have largee installs and database sizes that run in the terabytes, this is begging for a solid platform with which to breathe some new life into this raw data that can be analysed in new and as yet unimagined ways.

Raising funding for the development of the Big Data features should really not be too big a challenge…but it would be interesting to hear your views on this.

Thanks Matt.

Hi Matt

I have just received a quote from a company that specialises in Hadoop/Hypertable deployments. They say that they can get everything done to spec for the low cost of only $9K.

Now, I’m not sure about you, but I’m sure we can raise this amount using some crowdfunding mechanism of some sort…if every person that uses Piwik could donate $10…I’m sure we could cover this cost and have Piwik become a serious enterprise analytics tool…

Just thought I should let you know…your views?

Well, you are welcome to help us crowdfund. Get in touch at matt@piwik.org if you have time and can work with the piwik team

Wouldn’t it be a good idea to setup a little crowdfunding platform like Kickstarter, for all the features that need crowdfunding?

I fully agree with you Fabian…perhaps crowdfundinig is the way to go with Piwik. Besides Hadoop/Hypertable, how many otheer features would we be looking at funding @ Matt? How much would we need to raise to go from here to 100% of all items on feature-list done and dusted?

Another solution is tp use many mysql instances. Each piwik instances connect once to a centralized db (whatever the kind) to get mysql server info for the relevant idsite (the one tracked or worked on, there is almost always one set in piwik) and set it as THE piwik connection info application wide.

It may of course not be convenient to run many mysql instances, but this way load balancing is made easy on the http servers since each physical piwik instance is able to handle all idsites and each mysql one does only handle few.

Any updates on this topic?

How to use?Please tell me,thanks

Are there any developments in this direction?