Back to 2005 I had no idea what is character encoding, why I see � in text files or what is the difference between Unicode and UTF-8. It looked so complicated, that I felt like a blind man every time I needed to deal with this. In this article you will learn all nuances of encodings, why UTF-8 had become standard and how they call unreadable characters in Japan.
What Is Encoding?
The term character encoding stands for representation table of some characters subset. More specifically, in our case it would be binary representaion of single char, so for example:
Letter A is represented by 01000001 binary representation in UTF-8 character encoding.
In this case ‘A’ is our letter to encode, ‘01000001’ is our binary representation in one specific character encoding ‘UTF-8’.
Other well-known (non binary) encoding is Morse Code. It defines char representation in series of on-off tones. Those are usualy represented by dots and dashes. So our ‘A’ letter in Morse code character encoding would look like .-
Why Should You Worry About It?
Let’s get back to our machine usage of encodings. When you’re providing text input to any application to save it (could be file or database) all letters must be encoded to their representations. Sometimes you can choose encoding, sometimes app has own default.
Every character can have different values in different encodings, so when retrieving data, same encoding must be used to decode values to chars. Why is it important?
Let’s say that you typed cent sign ¢ and then you saved the file with UTF-8 encoding. When you try to open same file with encoding chosen as ISO8859-1, you would see Â¢. That’s because of different understanding bytes in file. Excessive char can always be deleted, but other side of the coin is worse.
If you save user input with encoding that doesn’t cover all characters – it will be lossy. For example: IO8859-1 doesn’t have euro sign. So saving file with € in that encoding will loose it. That’s the reason why sometimes you are seeing unknown � chars in text files. Your database working with mismatched encoding will suffer the same.
There are many encodings out there. Frankly, there are tons of them! So what options do you have when planning to set up your application?
Historically, one of first encodings was ASCII developed in 1963. It’s 52 years, but still 0.1% websites use it. It’s 7-bit encoding which can hold up to 127 different values sufficient to exchange information in modern english.
Other well-known encoding (that used to be a standard) is ISO8859. It’s divided into 16 parts (from ISO8859-1 to ISO8859-16) and each of them contains different subsets of chars. Full specification is available on wiki page.
As usual, Windows had its own approach to encodings. It introduced slightly modified ISO8859 char tables named as CP-1250, CP-1251 and so on.
There were also many approaches that tried to handle locale-specific character sets. So for chinese there is Big5 and GB18030, KOI8-R for russian and JIS, Shift-JIS and EUC-JP for japanese.
Let’s analyze japanese case. In 1969 JIS x 0201 encoding was developed as first widely used single-byte encoding. It covered only katakana (japanese syllabary) but not kanji (about 3,000 commonly used logographic characters).
Need of kanji characters was first split of encodings and Shift-JIS was introduced. It was backwards compatible with JIS x 0201 but had unfortunate property of breaking parsers that’s been not aware of it. So EUC, backwards compatible with ASCII, was developed to solve that problem. But still – it wasn’t compatible with JIS x 0201.
Further problems arised when original internet e-mail standards were designed to handle 7-bit encodings (so none of the above Japanese), so 7-bit JIS was introduced.
All of that makes data interchange within Japan difficult. Sending document from one computer always carries a risk of incompatibility (they had four different encodings!). As a result, garbled and unreadable characters are so common in Japan, that they even created word for it.
文字化け /modʑibake/ is the name for misconverted garbage characters shown when computer software fails to show text correctly.
Nowadays, as world wide web is on, documents are sent everywhere and are received from everywhere around the world. Is there any solution to Japanese problems?
Unicode Consortium is a remedy to all encoding problems. They’re maintaining character specification table that covers at the moment more than 120,000 characters! You have japanese, chinese, arabic, russian and other characters used all over the world. Their goal is to cover all languages that exist/existed.
So Unicode is a standard which assign code point to each char, but implementation as computer encoding is something slightly different. We have implementations such as UTF-8, UTF-16 and UTF-32 that covers all of the unicode characters. What are the differences?
First difference is the number of bytes used to encode single Unicode char. UTF-32 uses always four bytes per character, UTF-16 can use two or four bytes, and UTF-8 can use up to four bytes (minimum one).
Second is that UTF-16 and UTF-32 are not byte-oriented encodings, so a byte order must be specified when transimitting them over a byte-oriented medium.
And lastly, UTF-8 is backwards compatible with ASCII. That has great impact on migrating to a newer standard.
UTF-8 Is The Best!
UTF-8 is variable-length, and number of bytes used depends on value to encode. It’s implemented in pretty clever way. If value is lower than 128 – it’s encoded same as in ASCII, by single byte value. When more bytes are needed, first byte will act like a header and contain information about following bytes (how many of them are coding that specific char).
So if it’s on two bytes it’ll start with 110, on three 1110 and so forth. Following bytes will have always same continuation header 10.
|Bits of code point||First code point||Last code point||Bytes in sequence||Byte 1||Byte 2||Byte 3||Byte 4||Byte 5||Byte 6|
What’s great – it doesn’t need byte order specified, text can be read partially from any point in the file and still parser will decode it. Each character is self-informing!
Finally we’re making it to conclusion. What are advantages of UTF-8 in your application?
- UTF-8 is not limiting you in any way, you can encode all Unicode characters,
- Saving space by using less bytes than UTF-16 or UTF-32,
- Despite that Microsoft introduced its own Windows code pages they’re recommending usage of Unicode encodings! (In their case, they defaults to UTF-16),
- Backwards compatible with ASCII,
- Natively used by XML standard,
- UTF-8 is most used encoding on all web pages around the world (almost 85% at the moment),
And just to convince you, please consider this chart showing usage of different character encodings on web.
As you can see, UTF-8 is leading and everything is going that way. If you’re not using it, consider migration before it’s too late and while it’s still not so costly.
I hope that now character encodings are much clearer to you. Have you seen mojibake recently? Are you secured from it?
If you like this article, please subscribe or share it with your friends!
[do action=”cc-image-attribution” author=”TeppoTK” photourl=”https://www.flickr.com/photos/voidobjects/9675615995″ cclicense=”by”/]
[do action=”cc-image-attribution” author=”Mike” photourl=”https://www.flickr.com/photos/dmje/81437622/” cclicense=”by”/]
[do action=”cc-image-attribution” author=”Ethan Lofton” photourl=”https://www.flickr.com/photos/eleaf/2536358399″ cclicense=”by”/]
[do action=”cc-image-attribution” author=”Travis Juntara” photourl=”https://www.flickr.com/photos/travisjuntara/7275717188″ cclicense=”by”/]