Java and Unicode: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
Line 31: Line 31:
=Relevant API=
=Relevant API=


==Conversion of two UTF-16 Code Units to a Code Point==
==Conversion of two UTF-16 Surrogate Code Units to a Supplementary Code Point==
 
The following method converts a surrogate code unit pair to a supplementary code point:


<syntaxhighlight lang='java'>
<syntaxhighlight lang='java'>
Character.toCodePoint(char high, char low)
Character.toCodePoint(char high, char low)
</syntaxhighlight>
</syntaxhighlight>

Revision as of 19:41, 26 June 2018

External

Internal

Overview

Character information is maintained in Java by the primitive type char, which was designed based on the original Unicode 1.0 specification that allowed only 216 code points, so it was defined as a fixed-with 16-bit/2-byte entity. Since then, the Unicode standard has evolved to allow for characters whose representation requires more than 16 bits. Java 5, which supports Unicode 4.0, introduced enhancements to correctly handle Unicode supplementary characters. For details on how characters are represented internally by Java, see the "Character Representation" section.

U+n Notation Support

U+n notation is supported in Java as follows:

Character Representation

Java platform uses the UTF-16 representation character sequences: char[], java.lang.CharSequence (which String is a subclass of), java.text.CharacterIterator, StringBuffer and StringBuilder classes.

UTF-16 uses 16-bit code units.

The Basic Multilingual Plane characters are represented as char instances, which correspond to one code unit, as the char data type provides sufficient storage capacity for the entire BMP range.

The supplementary characters are represented as a pair of char values, or two code units. The first code unit belongs to the high-surrogates range (U+D800 to U+DBFF) and the second belongs to the low-surrogates range (U+DC00 to U+DFFF). The values U+D800 to U+DFFF are reserved for use in UTF-16, no characters are assigned to them as code points, so that means software can tell for each individual code unit in a string whether it represents one-code unit character or whether is the first or second unit of a two-code unit character. This is an optimization and a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.

For example, "A" representation (U+0041) in UTF-16 is 0x0041, "ß" representation (U+00DF) in UTF-16 is 0x00DF, "東" representation (U+6771) in UTF-16 is 0x6771. All these are BMP characters. "𐐀" (U+10400), which is a supplementary character, representation in UTF-16 is 0xD801 0xDC00.

Relevant API

Conversion of two UTF-16 Surrogate Code Units to a Supplementary Code Point

The following method converts a surrogate code unit pair to a supplementary code point:

Character.toCodePoint(char high, char low)