Special character are not correctly saved as UTF8 in Piwik Table “log_action”

Micha · February 26, 2013, 12:17pm

Hello,

all the page titles with special characters (like umlauts äöü) are not correctly saved in piwik table “log_action”. (The tracked page is utf-8 and the piwik db is utf-8.)

For example:


piwikTracker.setDocumentTitle("Hello World! ä ö ü Ä Ö Ü");

=> Result in db: “Hallo World! Ã¤ Ã¶ Ã¼ Ã„ Ã– Ãœ”

At the piwik frontend, everything is fine: The page is correctly displayed as “Hello World! ä ö ü Ä Ö Ü”.

Is this default behavior or an error at my piwik installation?
How can I fix it?

Thanks,
Micha

Micha · February 27, 2013, 4:44pm

Hello,
could someone pleeease check if it is an error or default behavior.

Thanks,
Micha

canajun2eh · February 27, 2013, 11:22pm

Your problem isn’t with Piwik or the database. The tool you’re using to inspect the database contents is displaying the data in ASCII instead of UTF-8.

Here’s the Wikipedia article that should explain what you’re seeing:

Micha · February 28, 2013, 11:42am

Thank you for your help. I’am using an UTF-8 console and so this isn’t the reason.

I do the following test which shows that the special characters not correctly saved in piwik table “log_action”.
1. I add the following to an utf-8 html page:


piwikTracker.setDocumentTitle("Ä");

2. Than I checked the data in table “log_action”:


select name, hex(name) from piwik_log_action where idaction=<id from the new entry>;

Result: name: Ã¤, hex(name): C383C2A4

=> It’s wrong - the umlaut “Ä” is not saved correctly.

(UTF-8 hex(‘Ä’) is “C384”. If I add the ‘Ä’ with an table-insert, it’s saved as UTF-8: insert into piwik_log_action set name=‘Ä’; )

Could you please check this at your piwik db.
Thanks, Micha

matthieu · March 2, 2013, 12:12am

I just tested and it works 100% fine.

Micha · March 4, 2013, 3:59pm

Hallo Matt,
thanks a lot for testing! (tu) Umlaut “Ä” is saved as hex: C384 in your piwik environment?

What could be the reason for the behaviour as described? (It’s the same error in our development and production environment.)

The charset configs:

Piwik Table log_action is: utf8_general_ci
HTML-Page is: UTF-8 -

Thanks, Micha

matthieu · March 4, 2013, 9:33pm

I didn’t look how it’s recorded but it displays correctly in the API and UI, so it’s fine for me

lesjokolat · March 4, 2013, 9:34pm

Could it be your OSs language setting causing the problem in the dev and and prod environment?

Also see this article below there are a few points(if applicable) from the web server side you should check to how they are set as if they are different may be a source as to why your console is “mysteriously” changing formats somehow. Its not exactly your issue but may give you some ideas as to where to check some things

httpd.conf:
AddCharset UTF-8 .utf8
AddDefaultCharset UTF-8

php.ini
default_charset = “utf-8”

my.cnf
character-set-server=utf8
default-collation=utf8_unicode_ci

http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html

good luck.

Micha · March 6, 2013, 2:43pm

Hello,
thanks a lot for your help.

@matt: Okay, you checked the UI. Here’s everything fine. ‘Ä’ is displayed as ‘Ä’. The error is only in the DB. Could you please check the piwik table “log_action”.


select name, hex(name) from piwik_log_action where idaction=<id from the new entry>;

@lesjokolat: I checked the configs. UTF-8 everywhere. Thank you. The CMS-System use the same DB, same Webserver. Here’s everything fine - the charset error is only in the piwik tables. Is umlaut “Ä” saved as hex “C384” in your piwik table “log_action”?

Micha

lesjokolat · March 6, 2013, 9:27pm

My page titles show up fine in my page titles report but I wonder what is your CMS? When you say your CMS shares the same database does that mean it shares the same Mysql database that piwik does?

Micha · March 11, 2013, 4:47pm

Hello,
you can see the charset error only in the DB. (Yes in page reports / UI everything is fine. ‘Ä’ is displayed as ‘Ä’. )

Could someone pleeease :)-D track a page with only an umlaut ‘Ä’ in the page title and then check the DB. You can check it with phpMyAdmin or SQL-Console:


SELECT name, hex(name) FROM log_action WHERE hex(name) = 'C383C2A4' OR hex(name) = 'C384' ORDER BY idaction DESC LIMIT 0,10;

What’s your result?
C384 = fine! - utf8-'Ä’
C383C2A4 = wrong!

Thanks, Micha

@lesjokolat : Yes, my cms and piwik are using the same mysql-db. This is the default of typo3-extension “piwikintegration”. (It’s not the cause of the charset error.)

lesjokolat · March 11, 2013, 5:25pm

Is it possible the typo3 settings is the problem? I am guessing here and not sure if its entirely applicable to your setup but the issue of a sort of double encoding sounded similar to the issue you are experiencing.

Particluarly the section on

TYPO3 settings
localconf.php

http://wiki.typo3.org/UTF-8_support#Don.27t_use_SET_CHARACTER_SET_utf8.3B

canajun2eh · March 11, 2013, 7:07pm

Let’s get back to basics here.

The funny characters that seem to represent the letters with Umlauts in UTF-8 text are entirely normal. You get this when the tool used to inspect the UTF-8 encoded string assumes that it’s looking at ASCII instead of UTF-8.

To settle the matter once and for all, you need to inspect the text using a Hex editor that does not attempt to produce a graphic representation of the text. You will, in all likelihood, see that the letters with Umlaut actually occupy two octets, where the letters without Umlaut occupy only one octet. When the paired octets are displayed as a pair of ASCII letters instead of being combined to form a single character from the UTF-8 character set, you’ll see those funny characters.

The description of UTF-8 encoding that I pointed you at right near the beginning of this discussion is quite clear that UTF-8 encoding is variable length for each character, with the encoding method self-defining the length of each character. The method is also self-synchronizing, so that if you pick up the UTF-8 string in the middle of a variable-length character, you can easily find the correct end of that character.

In short: don’t assume the tool you’re using is telling the truth about what it sees. You need to look at the actual hex representation of the string.

Micha · March 18, 2013, 2:02pm

Hello,
thanks a lot for your postings.
Unfortunately, these aren’t the reasons.

The facts:

I use the mysql command hex() – see above. …and so I get the same result as with an hex editor.
In all other utf8-tables (not piwik-tables) at this DB the umlauts were saved correctly. (Ä as hex C384 – see http://www.fileformat.info/info/charset/UTF-8/list.htm )
The charset error is only in the piwik – tables. (Ä saved as hex C383C2A4)

Please: Could someone do the easy test, which I described above (my post March 11, 2013).
@matt: Could you open a ticket for this issue please?

Micha

halfdan · March 18, 2013, 5:16pm

@Micha, feel free to open a ticket at dev.piwik.org.

I tracked a new request for Piwik in the Chrome console:


_paq.push(["setDocumentTitle", "Ä"])
_paq.push(["trackPageView"])

The result in the DB is:


mysql> SELECT name, hex(name) FROM piwik_log_action WHERE hex(name) = 'C383C2A4' OR hex(name) = 'C384' ORDER BY idaction DESC LIMIT 0,10;
+------+-----------+
| name | hex(name) |
+------+-----------+
| Ä    | C384      |
+------+-----------+
1 row in set (0.00 sec)

You can even see the correct character in the tracking request:


http://localhost:8080/piwik.php?action_name=%C3%84&idsite=3&rec=1&r=808189&h=18&m=18&s=29&url=http%3A%2F%2Flocalhost%3A8080%2Findex.html&_id=1d5a94014da8618c&_idts=1363627110&_idvc=1&_idn=1&_refts=0&_viewts=1363627110&pdf=1&qt=1&realp=0&wma=1&dir=0&fla=1&java=1&gears=0&ag=0&cookie=1&res=1920x1200

See the action_name=%C3%84?

I think that your content management system already broke the “Ä” or the tools you use are not producing correct UTF-8 characters.

chiappa · March 20, 2013, 4:01pm

I have a similar problem. When trying to import a mysql database with Piwik tables on it (using Sequl Pro for mac) I get a “File read error: An error occurred when reading the file, as it could not be read in the encoding you selected (Unicode (UTF-8)).”

The dump I’m trying to move around contains an install of Joomla & Piwik, both which are set as “utf-8” in the database. The Joomla tables move around fine. Some of the Piwik content that does manage to get imported is showing those ascii characters: Ã¤ Ã¶ Ã¼ Ã„ Ã– Ãœ

I tired exporting the dump using mysqldump command line and phpmyadmin. The phpmyadmin dump can execute 33querys of unreadable ascii while the mysqldump none.

Micha · March 26, 2013, 11:56am

Hello Fabian,

thanks a lot for testing. (tu)
I analyzed the piwik tracking request (with a network trace).
Here everything is fine: action_name=%C3%84 - as described by you.
…but the result in the DB (piwik_log_action) is still wrong: hex C383C2A4.

So I can say:

With the content management system everything is fine. Correct UTF-8 send to the client.
The tracking request is okay. Correct UTF-8 send to piwik.
=> The error occurred after that.

Do you have any idea how to find the reason? (piwik_log_action = utf8_general_ci)

Thank you!
Micha

Micha · April 9, 2013, 12:59pm

Hello,

has someone an idea, why the umlauts (like ‘ä’) not saved correctly as UTF8 in piwik table “log_action” at my environments?

table piwik_log_action = utf8_general_ci
correct character in the tracking request send to piwik (‘ä’ as ‘action_name=%C3%84’ - like described in the post of Fabian Becker - March 18, 2013)

Hope someone could help.

Micha