About WordPress, emojis, MySQL and latin1, utf8 and utf8mb4 character sets

PSA: the MySQL utf8 character set is not real Unicode utf8. Instead use utf8mb4.

So you landed here because some parts of your website are garbled. And this happened after a server or website migration. You exported your database and imported this export or dump on the new server. And now your posts look like this:

Strange characters!

When they should look like this:

Emojis!

These are screenshots from this website. This website was an old WordPress installation that still used the latin1 character set. The WordPress installation was up to date, but automatic WordPress updates will never update or change the database character set. So this will always remain what it was on initial installation (which was latin1 in my case).

And latin1 can not store (real) Unicode characters e.g. emojis. You need a Unicode character set. So, just use utf8, right?

Problem 1: exporting and importing the database with the wrong character set

When exporting/dumping a database with mysqldump, this will use the default MySQL servers’ character set (set in my.cnf). In my case this was set to utf8. But by not explicitly telling the mysqldump to use the correct character set for a particular database (which was latin1) my dumped data was messed up.

So when I restored the dump (on the new server) some text was garbled and emojis had completely disappeared from blog posts.

I fixed this with help of these links. Key here is: explicitly set the correct character for a database when exporting this database. Then: change all instances in the dump file from the old character set to the new character set and import this file.

https://theblogpress.com/blog/seeing-weird-characters-on-blog-how-to-fix-wordpress-character-encoding-latin1-to-utf8/

Problem 2: some emojis work, some don’t

After this my text was fine. I exported using the old character set and imported using utf8, what could go wrong! But some emojis were still missing, but others were not?! This was a head-scratcher.

There is a question mark instead of an emoji

How can this be, I had set my database character set to utf8 (with utf8_general_ci collation). This is Unicode, right? Wrong!

MySQL utf8 does not support complete Unicode. MySQL utf8 uses only 3 bytes per character.

Full Unicode support needs 4 bytes per character. So your MySQL installation needs to use the utf8mb4 character set (and utf8mb4_unicode_ci collation) to have real and full Unicode support.

Some strange decisions were made in the 2002. 🤯 Which has given a lot of people headaches.

So, MySQL utf8 only uses three bytes per character, but the explanation for the image with the one missing emoji is, that *some* emojis (like the❤️) will be expressed correctly with only three bytes.

Problem 3: but I used emojis before, in my latin1 WordPress installation?!

Yes, things worked fine, right? And here is the thing: if you dumped your database with the correct (old) character set and imported correctly with this old character set, things would still work!

But you said latin1 does not support (4 byte character) emojis!?

Correct!

However WordPress is smart enough to figure this out: when using emojis in post titles or posts it will check the database character set and change the emoji to (hexadecimal) HTML code — which can be expressed and stored just fine in latin1.

But how do you explain this image? This was after incorrectly dumping and importing a database.

Wait, what?! The URL has the correct emoji but the post title does not!?!

The post URL has two emojis, but one emoji is missing from the title?! I already explained why the emoji is missing from the post title, so how can that emoji be present in the post URL? This is because WordPress stores the post titles differently from the URLs.

How the title and the post name (url) are stored

So the post title field stores real Unicode symbols and the post_name (URL) field stores them encoded. So there you have it 🤓!

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *