Mastering UTF-8

Prerequisites

If you ever had a problem with displaying UTF-8 characters properly, for example on a web page, then you are the target audience for this short tutorial. Understanding and applying the practical tips below does not require any special knowledge.

Should you expect more theory - and understand German - you may want to read the gory details in my German tutorial about character encoding & UTF‑8.

Why dealing with UTF-8?

Why Unicode

The Unicode character set includes more than 100,000 characters (and the underlying schema allows it to further expand). Using a character encoding that is based on Unicode allows to write and cite in any language you can imagine, plus gives access to many useful non-alphanumeric symbols. Even when writing in English language the extended character set makes sense: Currency symbols look quite professional (£, ¥, €); it is good style to write people's names (Lech Wałęsa, Søren Kierkegaard), cities (Haßfurt) or brands (Citroën) correctly; there is elegance in «quote» symbols; and you may want to show useful symbols without messing around with images. The next line is just text!
© ¾ ☎ ☮ ☯ ♬
See more examples in the list of the author's most used unicode symbols.

Why UTF-8

UTF-8 is the most elegant encoding for the Unicode character set, because it combines significant advantages over its predecessors and alternatives (UCS-2, UTF-16, UTF-32):

What are the problems with UTF-8?

Reader Software/Device Limititions

Many special characters require specific fonts to be made available to the reader (browser or other display). With UTF-8 becoming the state-of-the-art character encoding these limitations simply go away for most of what you may need. Since a while I did not run into any problems with writing or reading web pages, even when they contained graphcial symbols and characters of exotic languages like Arabic, Japanese, or Thai. However: Note that there are still limits: At the time of writing my contemporary browsers do not display “New Tai Lue” or “Old Turkic”. Could be worse: If my target audience was using VT100 compatible devices they would not even see the letter “ß” in my last name. So we must not expect all readers' devices or applications to be able to display any UTF-8 characters.

Rule #1:
Know your audience's display tools and use only characters that they all support!

Wrong Encoding

To use UTF-8, the page needs to be encoded in UTF-8. This sounds silly, but often your text editor is a monster and does not really show you what it is doing with your valuable input. The solution: Master your tools! Every good editor allows you to set a specific encoding, and UTF-8 (without BOM) might be the way to go. (If you wonder about the BOM thingy: Just read on; we will get to that as well.)

You may want to test if your file is indeed encoded in UTF-8. For this get a good hex editor, like for example XVI32. Your normal text editor might also offer a hex editing mode, but don't trust it! Now, in your standard editor, copy/paste the following characters to the end of your file:
Ä Ç ß
Open the file with the above mentioned hex editor and scroll to the end. You should see the following sequence:
C3 84 20 C3 87 20 C3 9F
If your file does not end like that now, it simply is not UTF-8.

Rule #2:
Make sure your file is indeed a UTF-8 encoded file!

Unspecified Character Encoding

Computer systems are not too good in guessing the character encoding. If it is not specified, most devices or applications use a default encoding which is most often a more classic one with a fixed length of 8 bit (1 byte) per character like for example ISO-8859-1 in central Europe. Authors of applications and websites should make sure that the character encoding is communicated to the end user's device or application.

For web designers this means that each page's head section should show a meta tag like:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
This is good practice and highly recommended even if the web server already sends an adequate Content-Type header. That may sound redundant, but the file itself is the place where the encoding is best known. You don't want to delegate the responsibility for the character encoding to your server administrator or hosting provider.

Some applications (but not web browsers) extract the encoding information from a Byte Order Mark (BOM). A BOM is composed of two to four bytes at the beginning of each file. Good editing software allows the user to specify whether to add a BOM or not.

Depending on what is done with the data the BOM can be the source of the problem (browser display) or it can be the solution (e.g. for some data importers). More about this in the next section.

Rule #3:
In the head section of web pages always specify the content type including the encoding!

Byte Order Mark (BOM)

Problem caused because the BOM is missing:
As briefly explained above, some software expects your UTF-8 file to start with a corresponding BOM. Such systems can raise problems if your UTF-8 file does not have a UTF-8 BOM. When creating the data you need to configure your editor to add a BOM at the beginning of each file.

Problem caused because the BOM is present:
Other software is not capable of handling BOMs, in which case it may cause a problem if it finds a BOM. The most prominent example: Web pages that show in some browsers the following characters, usually in the upper left corner:


Whether your UTF-8 file must or must not have a BOM: you may want to check if it has one, i.e. to validate if your generator or editor has added a BOM or not. Do not use a sophisticated software for such a validation because it will hide the BOM and might show wrong information - even in hex mode! To read individual bytes I recommend to use a very basic hex editor. On Linux you can use vi in hex mode (on: :%!xxd, off: :%!xxd -r) or use Tweak. On Windows my favourite is XVI32.

When opening the file with your hex editor you can see if it contains a BOM at the very beginning. The BOM for UTF-8 is:
0xEF 0xBB 0xBF

In case you see some other "strange characters" in front of you actual data this might be a different BOM. This can be either because your file is actually not encoded in UTF-8 or because it simply has a wrong BOM. The most popular other BOMs are:
FE FF
FF FE
00 00 FE FF
00 00 FF FE

In case your file has a BOM and must not have one the best way to remove it is to do so in the hex editor. After removing it just save and close the file and re-open it with your normal editor. Make sure that your editor now continues to encode your input as UTF-8.

Rule #4:
Make sure the correct BOM is available when it is required!
Rule #5:
When dealing with web pages make sure they do not contain a BOM!

© Hermann Faß, 2011