How to handle extreme peaks of traffic

Hi Guys,

We are now able to handle extreme peaks of traffic using Piwik.
An extreme peak is dozens millions tracker requests per second for one or two hours.
We started counting in 1st February, everything is going well so far.
We called it phase 1, and the next step will be try to improve the archiving.

The first draft of the how-to is here: Extending Piwik At R7.com

cheers,
-lorieri

Good work.thanks for your info.http://www.HERIMAGES.INFO/avatar1.jpg

our pleasure :slight_smile:

btw, the files are here: GitHub - lorieri/piwik-presentation: piwik-presentation code examples

thanks, good work

small code change:

    public function disconnect()
    {
            @mysqli_close($this->connection);
            $this->connection = null;
    }

on core/Tracker/Db.php

Hi,

First, I’m very happy to say we have an entire month on piwik and everything is going good.

The end of month came, and the data grew a lot. I had to change php’s conf to handle more data in piwik.ini at /etc/php5/conf.d/:
memory_limit = 35G

This is required for the biggest website you have. If you are storing much more actions for lots medium size websites, it will not be a problem for you.

The archive.php is great, but since my data comes not in order, the archive.sh is better to use. It kinda fix some late visits. On archive.sh it is also easier to disable some unnecessary (for me) segments and scheduled tasks not changing piwik’s code. So I went back to it, but running in two different versions:

I’ve removed some lines of archive.sh to create a file called archive_daily.sh, to archive only daily visits:


for TEST_PHP_BIN in php5 php php-cli php-cgi; do
  if which $TEST_PHP_BIN >/dev/null 2>/dev/null; then
    PHP_BIN=`which $TEST_PHP_BIN`
    break
  fi
done
if test -z $PHP_BIN; then
  echo "php binary not found. Make sure php5 or php exists in PATH." >&2
  exit 1
fi

act_path() {
  local pathname="$1"
  readlink -f "$pathname" 2>/dev/null || \
  realpath "$pathname" 2>/dev/null || \
  type -P "$pathname" 2>/dev/null
}

ARCHIVE=`act_path ${0}`
PIWIK_CRON_FOLDER=`dirname ${ARCHIVE}`
PIWIK_PATH="$PIWIK_CRON_FOLDER"/../../index.php
PIWIK_CONFIG="$PIWIK_CRON_FOLDER"/../../config/config.ini.php

PIWIK_SUPERUSER=`sed '/^\[superuser\]/,$!d;/^login[ \t]*=[ \t]*"*/!d;s///;s/"*[ \t]*$//;q' $PIWIK_CONFIG`
PIWIK_SUPERUSER_MD5_PASSWORD=`sed '/^\[superuser\]/,$!d;/^password[ \t]*=[ \t]*"*/!d;s///;s/"*[ \t]*$//;q' $PIWIK_CONFIG`

CMD_TOKEN_AUTH="$PHP_BIN -q $PIWIK_PATH -- module=API&method=UsersManager.getTokenAuth&userLogin=$PIWIK_SUPERUSER&md5Password=$PIWIK_SUPERUSER_MD5_PASSWORD&format=php&serialize=0"
TOKEN_AUTH=`$CMD_TOKEN_AUTH`

CMD_GET_ID_SITES="$PHP_BIN -q $PIWIK_PATH -- module=API&method=SitesManager.getAllSitesId&token_auth=$TOKEN_AUTH&format=csv&convertToUnicode=0"
ID_SITES=`$CMD_GET_ID_SITES`

echo "Starting Piwik reports archiving..."
echo ""
for idsite in $ID_SITES; do
  TEST_IS_NUMERIC=`echo $idsite | egrep '^[0-9]+$'`
  if test -n "$TEST_IS_NUMERIC"; then
#   for period in day week month year; do
    for period in day; do
      echo ""
      echo "Archiving period = $period for idsite = $idsite..."
      CMD="$PHP_BIN -q $PIWIK_PATH -- module=API&method=VisitsSummary.getVisits&idSite=$idsite&period=$period&date=last52&format=xml&token_auth=$TOKEN_AUTH"
      $CMD
    done
    echo ""
    echo "Archiving for idsite = $idsite done!"
  fi
done

Now I run archive_daily.sh every 2 hours, and during the night I run the regular archive.sh

I also found something that can be a problem for huge websites, there is an “optimize table” in the archiving processes, it is dangerous for huge instalations. It could be explicited in the admin or something, and maybe possible to be disabled.

cheers,
-lorieri

I also found something that can be a problem for huge websites, there is an “optimize table” in the archiving processes, it is dangerous for huge instalations. It could be explicited in the admin or something, and maybe possible to be disabled.

Is it the optimize on the piwik_log_* that is dangerous?
Please create a ticket and we will fix it so you don’t have to hack your piwik.

35G of memory limit. WOOOOOOOOW!!! :)o

First, I’m very happy to say we have an entire month on piwik and everything is going good.

I’m amazed myself at that. Could you please PM the total of visits/pages?? :wink:

We are trying to get some public numbers to show, but it is been hard, since most are not accurate.

The biggest site measured is ranked #29 in Brazil, second Top Sites in Brazil - Alexa
ranked #995 globally, sony.com is #1,102

Hi Guys,

I came back from vacations and the piwik had 1,2 Billion registers in the piwik_log_link_visit_action, total DB around 440GB.
So a friend of mine removed the python twisted portion of the code and called the piwik php functions directly from the shell in a stripped down tracker.php script, it made the import much more efficient than called apache to do it, and reuse the mysql connections.

First action was to reduce it to ~700 million deleting old logs.

Mostly because the disc space, I divided the database in 2 machines, and partitioned the tables piwik_log_visit (by id) and the piwik_log_link_visit_action (by date). Now I have 2 piwik installations.
Then I stripped the archive.sh to run only 1 day each 2 hours, and only 2 days, 3 weeks, 2 months and 2 years during the night.

It is still collecting data and the data is still been processed in time, but the interface freezes if I choose a very high traffic website for the today’s statistics. Not a problem for the yesterday’s one.

In one server, it is able to register 2300 log lines per second, the other one 1500 log lines per second (not considering the download from amazon aws s3). Of course it has not so many lines to register all the time :slight_smile: we see it when we leave the logs growing during maintenances.

It takes 15 minutes to process one day in the biggest website (not processing week, month and year).

My archive.sh looks like this for the hourly cronjob:


....
#all the same until this line
echo "Starting Piwik reports archiving..."
echo ""
for idsite in $ID_SITES; do
  TEST_IS_NUMERIC=`echo $idsite | egrep '^[0-9]+$'`
    if test -n "$TEST_IS_NUMERIC"; then
      period=day
      last=last1
      CMD="$PHP_BIN -q $PIWIK_PATH -- module=API&method=VisitsSummary.getVisits&idSite=$idsite&period=$period&date=$last&format=xml&token_auth=$TOKEN_AUTH"
      $CMD &
    fi
done
wait

note two things:

this “&” in front of CMD make all websites run in parallel
and the command ‘wait’, makes the shell script waits until all finish before it continues (for example if you want to run the week after that).

that is it for now :slight_smile:

btw, now we are celebrating we became the 4th brasilian portal in number of visits.

cheers,
-lorieri

I’m interested in how you improved tracking.
how did you reuse mysql connections?
how did you strip down tracker.php to increase performance? can we do it in piwik core? :slight_smile:

@lorieri : why you prefer rest API instead of Log Import solution?

What plugins have you disabled to reach these amazing performance of 1500 lines per second?

Because most users can only do 400 requests per second with our log import script. By the way would you consider using our script rather than yours or is yours better in some way?

What I have seen about this, is that when you use the log importer, there is a lot of reading (log file) and writing (MySQL). That on the same disk is killing. I have my log files on a slow disk, and my MySQL on a SSD. That give an improvement of about 150 lines/sec. From around 350 to 500 lines/sec. The slow disk is fast enough for reading the log file. And SSD is just fast :smiley:

ok someone had played archaeologist here, deleting my post