Tuesday, July 23, 2013

Java and Unicode

Programming to support languages that use anything other than the Latin character set has always been a major problem. There are a variety of 8-bit character sets defined for many national languages, but if you want to combine the Latin character set and Cyrillic in the same context, for example, things can get difficult. If you want to handle Japanese as well, it becomes impossible with an 8-bit character set because with 8 bits you have only 256 different codes so there just aren’t enough character codes to go round. Unicode is a standard character set that was developed to allow the characters necessary for almost all languages to be encoded. It uses a 16-bit code to represent a character (so each character occupies 2 bytes), and with 16 bits up to 65,535 non-zero character codes can be distinguished. With so many character codes available, there is enough to allocate each major national character set its own set of codes, including character sets such as Kanji, which is used for Japanese and which requires thousands of character codes. It doesn’t end there though. Unicode supports three encoding forms that allow up to a million additional characters to be represented.

I say each Unicode character usually occupies 2 bytes because Java supports Unicode 4.0, which allows 32-bit characters called surrogates. You might think that the set of 64K characters that you can represent with 16 bits would be sufficient, but it isn’t. Far-eastern languages such as Japanese, Korean, and Chinese alone involve more than 70,000 ideographs, and surrogates are used to represent characters that are not contained within the basic multilingual set that is defined by 16-bit characters.

No comments:

Post a Comment

Thanks for Commenting......