Java and Unicode

From NovaOrdis Knowledge Base
Jump to navigation Jump to search

External

Internal

Overview

Character information is maintained in Java by the primitive type char, which was designed based on the original Unicode 1.0 specification that allowed only 216 code points, so it was defined as a fixed-with 16-bit/2-byte entity. Since then, the Unicode standard has evolved to allow for characters whose representation requires more than 16 bits. Java 5, which supports Unicode 4.0, introduced enhancements to correctly handle Unicode supplementary characters. For details on how characters are represented internally by Java, see the "Character Representation" section.

U+n Notation Support

U+n notation is supported in Java as follows:

String s = "\u0041";

System.out.println(s);

will display "A".

Note that the "\u" notation can only be used with Basic Multilingual Plane (BMP) characters, but not with supplementary characters. For example:

String s = "\u10400";

does not represent "𐐀", but "၀0".

Character Representation

Java platform uses the UTF-16 representation character sequences: char[], java.lang.CharSequence (which String is a subclass of), java.text.CharacterIterator, StringBuffer and StringBuilder classes. UTF-16 uses 16-bit code units.

The Basic Multilingual Plane characters are represented as char instances, which correspond to one code unit, as the char data type provides sufficient storage capacity for the entire BMP range.

The supplementary characters are represented as a pair of char values, or two code units. The first code unit belongs to the high-surrogates range (U+D800 to U+DBFF) and the second belongs to the low-surrogates range (U+DC00 to U+DFFF). The values U+D800 to U+DFFF are reserved for use in UTF-16, no characters are assigned to them as code points, so that means software can tell for each individual code unit in a string whether it represents one-code unit character or whether is the first or second unit of a two-code unit character. This is an optimization and a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.

For example, "A" representation (U+0041) in UTF-16 is 0x0041, "ß" representation (U+00DF) in UTF-16 is 0x00DF, "東" representation (U+6771) in UTF-16 is 0x6771. All these are BMP characters. "𐐀" (U+10400), which is a supplementary character, representation in UTF-16 is 0xD801 0xDC00.

Code Sample

https://github.com/NovaOrdis/playground/blob/master/java/utf16/src/main/java/io/novaordis/playground/java/utf16/Main.java

Relevant API

Conversion of a Supplementary Code Point to UTF-16 Surrogate Code Units

Character.toChars(int codePoint)

Example:

char[] surrogateCodeUnits = Character.toChars(0x10400);

System.out.println("Surrogate Code Units: " +
       Integer.toHexString((int)surrogateCodeUnits[0]) + " " +
       Integer.toHexString((int)surrogateCodeUnits[1]));

Conversion of two UTF-16 Surrogate Code Units to a Supplementary Code Point

The following method converts a surrogate code unit pair to a supplementary code point, returning the code point as an int. Note that the input values are not verified to be valid surrogate code units.

Character.toCodePoint(char high, char low)

Example:

int codePoint = Character.toCodePoint((char)0xD801, (char)0xDC00);
System.out.println("Supplementary character code point: U+" + Integer.toHexString(codePoint));