Java and Unicode: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
 
(20 intermediate revisions by the same user not shown)
Line 11: Line 11:
=Overview=
=Overview=


Character information is maintained in Java by the primitive type <tt>char</tt>, which was designed based on the original Unicode 1.0 specification that allowed only 2<sup>16</sup> code points, so it was defined as a fixed-with 16-bit/2-byte entity. Since then, [[Character_Encoding#Unicode|the Unicode standard has evolved]] to allow for characters whose representation requires [[Character_Encoding#Unicode_Code_Points|more than 16 bits]].  
Character information is maintained in Java by the primitive type <tt>char</tt>, which was designed based on the original Unicode 1.0 specification that allowed only 2<sup>16</sup> code points, so it was defined as a fixed-with 16-bit/2-byte entity. Since then, [[Character_Encoding#Unicode|the Unicode standard has evolved]] to allow for characters whose representation requires [[Character_Encoding#Unicode_Code_Points|more than 16 bits]]. [[Java#Java_5|Java 5]], which supports Unicode 4.0, introduced enhancements to correctly handle Unicode supplementary characters. For details on how characters are represented internally by Java, see the "[[#Character_Representation|Character Representation]]" section.


Java platform uses the [[Character_Encoding#UTF-16|UTF-16]] representation in char arrays and in the String, StringBuffer and StringBuilder classes.
=U+n Notation Support=
 
[[Character_Encoding#U.2Bn_Notation|U+n notation]] is supported in Java as follows:


The [[Character_Encoding#Basic_Multilingual_Plane_.28BMP.29|Basic Multilingual Plane characters]] are represented as <tt>char</tt> instances, while the supplementary characters are represented as a pair of <tt>char</tt> values. [[Java#Java_5|Java 5]], which supports Unicode 4.0, introduced enhancements to correctly handle Unicode supplementary characters.
<syntaxhighlight lang='java'>
String s = "\u0041";


=U+n Notation Support=
System.out.println(s);
</syntaxhighlight>
 
will display "A".
 
Note that the "\u" notation can only be used with Basic Multilingual Plane (BMP) characters, but not with supplementary characters. For example:
 
<syntaxhighlight lang='java'>
String s = "\u10400";
</syntaxhighlight>


[[Character_Encoding#U.2Bn_Notation|U+n notation]] is supported in Java as follows:
does not represent "𐐀", but "၀0".


=Character Representation=
=Character Representation=
Java platform uses the [[Character_Encoding#UTF-16|UTF-16]] representation character sequences: <tt>char[]</tt>, java.lang.CharSequence (which String is a subclass of), java.text.CharacterIterator, StringBuffer and StringBuilder classes.
<tt>
UTF-16 uses 16-bit code units.
The [[Character_Encoding#Basic_Multilingual_Plane_.28BMP.29|Basic Multilingual Plane characters]] are represented as <tt>char</tt> instances, which correspond to one code unit, as the <tt>char</tt> data type provides sufficient storage capacity for the entire BMP range.
The supplementary characters are represented as a pair of <tt>char</tt> values, or two code units. The first code unit belongs to the high-surrogates range (U+D800 to U+DBFF) and the second belongs to the low-surrogates range (U+DC00 to U+DFFF). The values U+D800 to U+DFFF are reserved for use in UTF-16, no characters are assigned to them as code points, so that means software can tell for each individual code unit in a string whether it represents one-code unit character or whether is the first or second unit of a two-code unit character. This is an optimization and a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.
For example, "A" representation (U+0041) in UTF-16 is 0x0041, "ß" representation (U+00DF) in UTF-16 is 0x00DF, "東" representation (U+6771) in UTF-16 is 0x6771. All these are BMP characters. "𐐀" (U+10400), which is a supplementary character, representation in UTF-16 is 0xD801 0xDC00.
=Code Sample=
{{External|https://github.com/NovaOrdis/playground/blob/master/java/utf16/src/main/java/io/novaordis/playground/java/utf16/Main.java}}
=Relevant API=
==Conversion of a Supplementary Code Point to UTF-16 Surrogate Code Units==
<syntaxhighlight lang='java'>
Character.toChars(int codePoint)
</syntaxhighlight>
Example:
<syntaxhighlight lang='java'>
char[] surrogateCodeUnits = Character.toChars(0x10400);
System.out.println("Surrogate Code Units: " +
      Integer.toHexString((int)surrogateCodeUnits[0]) + " " +
      Integer.toHexString((int)surrogateCodeUnits[1]));
</syntaxhighlight>
==Conversion of two UTF-16 Surrogate Code Units to a Supplementary Code Point==
The following method converts a surrogate code unit pair to a supplementary code point, returning the code point as an int. Note that the input values are not verified to be valid surrogate code units.
<syntaxhighlight lang='java'>
Character.toCodePoint(char high, char low)
</syntaxhighlight>
Example:
<syntaxhighlight lang='java'>
int codePoint = Character.toCodePoint((char)0xD801, (char)0xDC00);
System.out.println("Supplementary character code point: U+" + Integer.toHexString(codePoint));
</syntaxhighlight>

Latest revision as of 20:08, 26 June 2018

External

Internal

Overview

Character information is maintained in Java by the primitive type char, which was designed based on the original Unicode 1.0 specification that allowed only 216 code points, so it was defined as a fixed-with 16-bit/2-byte entity. Since then, the Unicode standard has evolved to allow for characters whose representation requires more than 16 bits. Java 5, which supports Unicode 4.0, introduced enhancements to correctly handle Unicode supplementary characters. For details on how characters are represented internally by Java, see the "Character Representation" section.

U+n Notation Support

U+n notation is supported in Java as follows:

String s = "\u0041";

System.out.println(s);

will display "A".

Note that the "\u" notation can only be used with Basic Multilingual Plane (BMP) characters, but not with supplementary characters. For example:

String s = "\u10400";

does not represent "𐐀", but "၀0".

Character Representation

Java platform uses the UTF-16 representation character sequences: char[], java.lang.CharSequence (which String is a subclass of), java.text.CharacterIterator, StringBuffer and StringBuilder classes. UTF-16 uses 16-bit code units.

The Basic Multilingual Plane characters are represented as char instances, which correspond to one code unit, as the char data type provides sufficient storage capacity for the entire BMP range.

The supplementary characters are represented as a pair of char values, or two code units. The first code unit belongs to the high-surrogates range (U+D800 to U+DBFF) and the second belongs to the low-surrogates range (U+DC00 to U+DFFF). The values U+D800 to U+DFFF are reserved for use in UTF-16, no characters are assigned to them as code points, so that means software can tell for each individual code unit in a string whether it represents one-code unit character or whether is the first or second unit of a two-code unit character. This is an optimization and a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.

For example, "A" representation (U+0041) in UTF-16 is 0x0041, "ß" representation (U+00DF) in UTF-16 is 0x00DF, "東" representation (U+6771) in UTF-16 is 0x6771. All these are BMP characters. "𐐀" (U+10400), which is a supplementary character, representation in UTF-16 is 0xD801 0xDC00.

Code Sample

https://github.com/NovaOrdis/playground/blob/master/java/utf16/src/main/java/io/novaordis/playground/java/utf16/Main.java

Relevant API

Conversion of a Supplementary Code Point to UTF-16 Surrogate Code Units

Character.toChars(int codePoint)

Example:

char[] surrogateCodeUnits = Character.toChars(0x10400);

System.out.println("Surrogate Code Units: " +
       Integer.toHexString((int)surrogateCodeUnits[0]) + " " +
       Integer.toHexString((int)surrogateCodeUnits[1]));

Conversion of two UTF-16 Surrogate Code Units to a Supplementary Code Point

The following method converts a surrogate code unit pair to a supplementary code point, returning the code point as an int. Note that the input values are not verified to be valid surrogate code units.

Character.toCodePoint(char high, char low)

Example:

int codePoint = Character.toCodePoint((char)0xD801, (char)0xDC00);
System.out.println("Supplementary character code point: U+" + Integer.toHexString(codePoint));