Character Encoding: Difference between revisions
(→UTF-8) |
|||
(197 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
* https://en.wikipedia.org/wiki/Character_encoding | * https://en.wikipedia.org/wiki/Character_encoding | ||
* https://unicode.org/glossary/ | |||
* http://www.cl.cam.ac.uk/~mgk25/unicode.html | |||
=Internal= | =Internal= | ||
* [[Java_Language#char|Java char]] | |||
* [[Go_Strings#Overview|Go Strings]] | |||
=Overview= | =Overview= | ||
Character encoding is the process though which characters within a text document are represented by numeric codes. Depending of the character encoding used, the same text will end up with different binary representations. Common character encoding standards are [[#ASCII|ASCII]], [[#Unicode|Unicode]] and [[#UCS|UCS]]. | Character encoding is the process though which characters within a text document are represented by numeric codes and ultimately translated into sequence of bits stored on persistent storage or sent over the wire. Depending of the character encoding convention used, the same text will end up with different binary representations. Common character encoding standards are [[#US-ASCII|US-ASCII]], [[#Extended_ASCII|Extended ASCII]], [[#Unicode|Unicode]] and [[#UCS|UCS]]. | ||
= | =Character= | ||
A character is the smallest component of a written language that has semantic value. In the context of an encoding convention, the term "character" refers to the abstract meaning, rather than a specific shape ([[#Glyph|glyph]]), though in code tables some form of visual representation is also essential for the reader’s understanding. It is also referred to as '''abstract character'''. In [[#Unicode|Unicode]], the character is the basic unit of encoding. Depending on the encoding scheme used, characters may be represented with different [[#Code_Unit|code units]], which are sequence of bits used to represent a single character. | |||
=Character Repertoire= | |||
A character repertoire is the full set of abstract characters a system supports. The repertoire may be closed, where no additions are allowed without creating a new standard, as it is the case with [[#US-ASCII|ASCII]] and most of the [[#Latin-1_.28ISO-8859-1.29|ISO-8859 series]], or it may be open, allowing additions, as it is the case with [[#Unicode|Unicode]]. | |||
==Character | =<span id='Code_Set'></span><span id='Character_Set'></span><span id='Coded_Character_Set'></span>Coded Character Set (CCS)= | ||
A collection of [[#Character|characters]] used to represent textual information, in which each [[#Character|character]] is assigned a numeric [[#Code_Point|code point]]. The concept of "coded character set" can also be thought of as a function that maps [[#Character|characters]] to [[#Code_Point|code points]]. For example, the capital letter "A" in the Latin alphabet might be represented by the code point 65. "Coded character set" is frequently abbreviated as '''character set''', '''charset''', or '''code set'''. The acronym CCS is also used. A character set might be used by multiple languages. | |||
=<span id='Character_Encoding_Form'></span><span id='CEF'></span>Character Encoding Form (CEF)= | |||
A character encoding form (CEF) is the mapping of [[#Code_Point|code points]] to [[#Code_Unit|code units]] to facilitate storage or transmission in a system that represents numbers as bit sequences of fixed length. For example, a system that stores numeric information in 16-bit units can only directly represent code points 0 to 65,535 in each unit, but larger code points (for example, 65,536 to 1.4 million) could be represented by using multiple 16-bit units. This correspondence is defined by a CEF. | |||
== | =<span id='Character_Encoding_Scheme'></span><span id='CES'></span>Character Encoding Scheme (CES)= | ||
A character encoding scheme (CES) is the mapping of [[#Code_Unit|code units]] to a sequence of bits to facilitate storage on an octet-based file system or transmission over an octet-based network. Character encoding schemes include [[#UTF-8|UTF-8]], [[#UTF-16|UTF-16]] and [[#UTF-32|UTF-32]]. | |||
A code point is | =<span id='Code_Position'></span>Code Point= | ||
A '''code point''', or a '''code position''', is any of the integral numeric values that make up the [[#Code_Space|code space]]. | |||
Some character encoding schemes, such as ASCII, have a fixed relationship between any of the represented characters and the sequence of bits used to represent that character. For example, the [[#ASCII|ASCII]] [[#Code_Space|code space]] consists of 2<sup>7</sup> = 128 code points and each of the represented characters corresponds to a code point and to a predefined bit sequence. [[#Extended_ASCII|Extended ASCII]] has 2<sup>8</sup> = 256 code points. | |||
[[#Unicode|Unicode]] breaks the assumption that each character should always directly correspond to a particular sequence of bits. Instead, the characters are first mapped to a universal intermediate numeric representation. In this context, the code points are the numeric values characters are mapped onto. Unicode allows 1,114,112 code points in the range 0x0 to 0x10FFFF. Code points could then be represented in a variety of ways and with various numbers of bits per character ([[#Code_Unit|code unit]]), depending on the context. | |||
The | The concept of code point is part of the Unicode's solution to represent larger character sets, without adding more bits per character, because doing so would have constituted an unacceptable waste of storage space in case of Latin script content, which constituted the vast majority of content at that time, requiring those extra bits to be aways zeroed out in those cases. | ||
Additionally to allowing flexibility on the number of bits the character is represented with, a second advantage provided by code points is that the same character can be represented with different graphical representations ([[#Glyph|glyphs]]). | |||
Code points are normally assigned to abstract [[#Character|characters]]. An abstract character is not a graphical glyph but a unit of textual data. The distinction between a code point and the corresponding abstract character is not pronounced in [[#Unicode|Unicode]], but is evident for many other encoding schemes, where numerous [[#Code_Page|code pages]] may exist for a single [[#Code_Space|code space]]. | |||
In [[#Unicode|Unicode]], not all code points are assigned to encoded [[#Character|characters]]. They can be of the following types: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter and Reserved. | |||
ASCII stands for American Standard Code for Information Interchange and it is a seven-bit encoding scheme used to encode letters, numerals, symbols, and [[ANSI Escape Sequences#Overview|device control codes]] as fixed-length codes using integers. ASCII is specified by ISO 646 IRV. | =<span id='Code_Value'></span>Code Unit= | ||
The ''code unit'' (or ''code value'') is a bit sequence used to encode each character of a repertoire within a given encoding form. A characteristic of the code unit is its size: the number of bits used to represent a unit of encoded text within a certain encoding scheme. For example, [[#ASCII|ASCII]] has a code unit size of 7, [[#UTF-8|UTF-8]] a code unit size of 8, [[#UCS-4|UCS-4]] a code unit size of 32, etc. | |||
=Code Space= | |||
A '''code space''' is a range of numerical values available for encoding characters. For US-ASCII, the code space is 0x0 - 0x7F, for Unicode is 0x0 to 0x10FFFF. | |||
=Code Page= | |||
A table of values that descries the [[#Character_Set|character set]] used for encoding a particular set of [[#Character|characters]]. | |||
=Combining Character= | |||
Also known as a "pre-composed character" - characters with diacritics. | |||
=Glyph= | |||
A graphical representation of a [[#Character|character]]. | |||
=Character Code= | |||
Unicode is referred to as a character code. | |||
<font color=darkkhaki>Are [[#Character_Encoding_Standard|character encoding standards]] and character codes equivalent?</font> | |||
=<span id='Character_Encoding_Standard'></span>Character Encoding Standards= | |||
==<span id='ASCII'></span>US-ASCII== | |||
ASCII stands for American Standard Code for Information Interchange and it is a seven-bit encoding scheme used to encode letters, numerals, symbols, and [[ANSI Escape Sequences#Overview|device control codes]] as fixed-length codes using integers. ASCII consists of 2<sup>7</sup> = 128 [[#Code_Point|code points]] in the range [[Common ASCII Codes#Null_Character|0x0]] to [[Common ASCII Codes#Delete|0x7F]]. ASCII is specified by ISO 646 IRV. | |||
{{Internal|Common ASCII Codes#Overview|Common ASCII Codes}} | {{Internal|Common ASCII Codes#Overview|Common ASCII Codes}} | ||
In Java, US_ASCII can be referred to as a constant: | |||
<syntaxhighlight lang='java'> | |||
public final class StandardCharsets { | |||
... | |||
/** | |||
* Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the | |||
* Unicode character set | |||
*/ | |||
public static final Charset US_ASCII = sun.nio.cs.US_ASCII.INSTANCE; | |||
... | |||
} | |||
</syntaxhighlight> | |||
==Extended ASCII== | |||
Extended ASCII consists of 2<sup>8</sup> = 256 code points in the range [[Common ASCII Codes#Null_Character|0x0]] to 0xFF. | |||
==Western== | |||
==Latin-US== | |||
==Latin-1 (ISO-8859-1)== | |||
ISO 8859-1 is a single-byte encoding standard that can represent the first 256 [[#Unicode|Unicode]] characters. Both encode ASCII exactly the same way. | |||
In Java, ISO_8859_1 can be referred to as a constant: | |||
<syntaxhighlight lang='java'> | |||
public final class StandardCharsets { | |||
... | |||
/** | |||
* ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1 | |||
*/ | |||
public static final Charset ISO_8859_1 = sun.nio.cs.ISO_8859_1.INSTANCE; | |||
... | |||
} | |||
</syntaxhighlight> | |||
==Unicode== | ==Unicode== | ||
Unicode supports a larger character set than [[#ASCII|ASCII]]. | {{External|https://en.wikipedia.org/wiki/Unicode}} | ||
{{External|http://www.unicode.org/standard/standard.html}} | |||
The Unicode standard is a character encoding system designed to support worldwide interchange, processing and display of the written texts of the diverse languages and technical disciplines. It is essentially a code table that assigns integer numbers ([[#Code_Point|code points]]) to [[#Characters|characters]], essentially defining what characters are available. | |||
'''Unicode 1.0''' was originally designed as a fixed-width 16-bit character encoding that allowed only 2<sup>16</sup> = 65,536 code points, and when Java was initially designed, it relied on this specification [[Java and Unicode#Overview|to represent <tt>char</tt> on 2 bytes]]. However, it turned out that 65,535 characters are not sufficient to represent all characters that are or have been used on planet Earth. | |||
The Unicode standard has been extended, as '''Unicode 2.0''', <font color=darkkhaki>to 32-bit character encoding</font>, and to allow up to 1,112,064 characters. All Unicode versions since 2.0 are compatible, only new characters will be added, no existing characters will be removed or renamed in the future. Moreover, the Unicode and [[#UCS|UCS (ISO 10646)]] code tables are compatible. More details about Unicode and ISO 10646 compatibility is available in the "[[#Relationship_between_Unicode_and_UCS|Relationship between Unicode and UCS]]" section. | |||
===Unicode Code Points=== | |||
Unicode supports a much larger character set than [[#ASCII|ASCII]]. It comprises 17 x 65,536 = 1,114,112 [[#Code_Point|code points]] in the range 0x0 to 0x10FFFF. A valid Unicode code point is known as a ''Unicode scalar value''. | |||
===Unicode Planes=== | |||
The Unicode [[#Code_Space|code space]] is divided into seventeen planes: the [[#BMP|Basic Multilingual Plane]], which contains Basic Multilingual Plane characters and [[#Supplementary_Planes|supplementary planes]], which contain supplemental characters. Thus, each Unicode character is either a BMP character or a supplemental character. | |||
====<span id='BMP'></span><span id='Basic_Multilingual_Plane'></span>Basic Multilingual Plane (BMP)==== | |||
Plane 0 is referred to as the Basic Multilingual Plane (BMP) and contains the set of characters from U+0000 to U+FFFF (see below for more details on [[#U.2Bn_Notation|U+n notation]]). | |||
====Supplementary Planes==== | |||
Unicode supports 16 more supplementary planes, where each plane supports 2<sup>16</sup> = 65,536 [[#Code_Point|code points]]. Characters whose code points are greater than U+FFFF, thus go over the original Unicode 1.0 16-bit limit are called ''supplementary characters''. Such characters are generally rare, but some are used, for example, as part of Chinese and Japanese personal names. | |||
===U+n Notation=== | |||
A hexadecimal number that represents a Unicode or [[#UCS|UCS]] value is commonly preceded by "U+". For example U+0041 is the character "Latin capital letter A". The range of valid code points is U+0000 to U+10FFFF. More details on how U+n notation can be used in Java are available in the "[[Java_and_Unicode#Overview|Java and Unicode]]" section. | |||
===Unicode Ranges=== | |||
{| | |||
| 0020 — 007F || Basic Latin | |||
|- | |||
|00A0 — 00FF || Latin-1 Supplement | |||
|- | |||
| 0100 — 017F || Latin Extended-A | |||
|- | |||
| 0180 — 024F || Latin Extended-B | |||
|- | |||
| ... | |||
|- | |||
| D800 — DB7F || High Surrogates | |||
|- | |||
| DB80 — DBFF || High Private Use Surrogates | |||
|- | |||
| DC00 — DFFF || Low Surrogates | |||
|- | |||
| E000 — F8FF || Private Use Area | |||
|- | |||
| ... | |||
|- | |||
| FFF0 — FFFF || Specials | |||
|} | |||
===Common Unicode Codes=== | |||
{{Internal|Common Unicode Codes|Common Unicode Codes}} | |||
===Unicode Transformation Format (UTF)=== | ===Unicode Transformation Format (UTF)=== | ||
Line 48: | Line 175: | ||
====UTF-8==== | ====UTF-8==== | ||
Most systems working with Unicode use UTF-8. | |||
UTF-8 is a Unicode variable-length [[#CES|character encoding scheme]] that uses 8-bit [[#Code_Unit|code units]]. Unicode [[#Code_Point|code points]] map to a sequence of one, two, three or four [[#Code_Unit|code units]]. The UTF-8 encoding is defined in [[#UCS|ISO 10646-1:2000 Annex D]] and also described in RFC 3629. | |||
UTF-8 is backward compatible with fixed-width [[#US-ASCII|ASCII]], in that the 8-bit UTF codes are the same as ASCII, as explained below: | |||
U+0000 to U+007F are encoded in one byte. The byte values to 0x00 to 0x7F always represent code points U+0000 to U+007F, which is the Basic Latin block, which corresponds to the ASCII character set. These byte values never occur in the representation of other code points, a characteristic that makes UTF-8 convenient to use in software that assigns special meanings to certain ASCII characters. For example, the letter "A" representation (U+0041) in UTF-8 is 0x41. | |||
U+0080 to U+07FF are encoded in two bytes. For example, "ß" representation (U+00DF) in UTF-8 is 0xC39F. | |||
U+0800 to U+FFFF are encoded in three bytes. For example, "東" representation (U+6771) in UTF-8 is 0xE69DB1. | |||
U+10000 to U+10FFFF are encoded in four bytes. For example, "𐐀" representation (U+10400) in UTF-8 is 0xF0909080. | |||
In Java, UTF-8 can be referred to as a constant: | |||
<syntaxhighlight lang='java'> | |||
public final class StandardCharsets { | |||
... | |||
/** | |||
* Eight-bit UCS Transformation Format | |||
*/ | |||
public static final Charset UTF_8 = sun.nio.cs.UTF_8.INSTANCE; | |||
} | |||
</syntaxhighlight> | |||
====UTF-16==== | ====UTF-16==== | ||
UTF-16 support a large enough character set to represent both Western and Eastern letters and symbols. | UTF-16 is a Unicode [[#CES|character encoding scheme]] that uses 16-bit [[#Code_Unit|code units]]. UTF-16 support a large enough character set to represent both Western and Eastern letters and symbols. Any code point with a scalar value between U+0000 and U+FFFF inclusively, is encoded with a single code unit. Code points with a value U+10000 or higher require two code units each. UTF-16 is the character encoding scheme used internally by Java. More details about the encoding standard are available in the "[[Java_and_Unicode#Character_Representation|Character Representation]]" section of the [[Java_and_Unicode#Overview|Java and Unicode]] page. | ||
In Java, UTF-16 can be referred to as a constant: | |||
<syntaxhighlight lang='java'> | |||
public final class StandardCharsets { | |||
... | |||
/** | |||
* Sixteen-bit UCS Transformation Format, big-endian byte order | |||
*/ | |||
public static final Charset UTF_16BE = Charset.forName("UTF-16BE"); | |||
/** | |||
* Sixteen-bit UCS Transformation Format, little-endian byte order | |||
*/ | |||
public static final Charset UTF_16LE = Charset.forName("UTF-16LE"); | |||
/** | |||
* Sixteen-bit UCS Transformation Format, byte order identified by an | |||
* optional byte-order mark | |||
*/ | |||
public static final Charset UTF_16 = Charset.forName("UTF-16"); | |||
} | |||
</syntaxhighlight> | |||
====UTF-32==== | ====UTF-32==== | ||
UTF-32 is a Unicode [[#CES|character encoding scheme]] that uses 32-bit [[#Code_Unit|code units]]. The code unit is large enough that every code point is represented as a single code unit. This is clearly a convenient representation, but it uses significantly more memory than necessary if used as a general string representation. | |||
For example, "A" representation (U+0041) in UTF-32 is 0x00000041, "ß" representation (U+00DF) in is 0x000000DF, "東" representation (U+6771) is 0x00006771. All these are BMP characters. "𐐀" (U+10400), which is a supplementary character, is represented as 0x0010400. | |||
===Java and Unicode=== | |||
{{Internal|Java and Unicode#Overview|Java and Unicode}} | |||
==<span id='UCS'></span>Universal Character Set (UCS) ISO 10646== | ==<span id='UCS'></span>Universal Character Set (UCS) ISO 10646== | ||
== | UCS stands for Universal Character Set. It is defined by the international standard ISO 10646. Essentially, it is a code table that assigns integer numbers to characters. UCS is a superset of all other character set standards. It also covers a large number of graphical, typographical, mathematical and scientific symbols. UCS can specify 2<sup>31</sup> characters. UCS assigns to each character a code number and an official name. The "[[#Unicode|U+]]" representation is identical with that used by Unicode. | ||
An UCS plane is a subset of 2<sup>16</sup> characters where the elements differ only in the 16 least-significant bits. The most commonly used characters, including all those found in major older encoding standards, have been placed into the first plane (0x000 to 0xFFFD). This plane is also known as Plane 0. | |||
The UCS characters U+0000 to U+007F are identical to those in [[#US-ASCII|US-ASCII]]. The UCS character range U+0000 to U+00FF is identical to ISO 8859-1. | |||
The two most obvious encodings store Unicode text as sequences of either 2 or 4 byte sequences. These encodings are named [[#UCS-2|UCS-2]] and [[#UCS-4|UCS-4]], respectively. Unless otherwise specified the bigendian convention is used, the most significant byte comes first. | |||
====UCS-2==== | |||
====UCS-4==== | |||
In the UCS-4 encoding, any [[#Code_Point|code point]] is encoded as a 4-byte binary numbers. The [[#Code_Unit|code unit]] has 4 x 8 = 32 bits. | |||
==Relationship between Unicode and UCS== | |||
In the late 1980s, there were two independent attempts to create a single unified character set. One was ISO 10646 project of the International Organization for Standardization (ISO), and the other was the Unicode Project, set up by a consortium of manufacturers of multi-lingual software. The projects joined efforts around 1991 and produced a single code table. Both projects still exist and publish their standards independently, however they have agreed to keep the code tables compatible. The Unicode Standard published by the Unicode Consortium corresponds to ISO 10646 at implementation level 3. All characters are at the same positions and have the same names in both standards. Unicode 5.0 corresponds to ISO 10646:2003. | |||
The Unicode Standard defines much more semantics associated with some of the characters and is in general a better reference for implementors of high-quality typographic publishing systems. Unicode specifies algorithms for rendering presentation forms of some scripts (i.e. Arabic), handling of bi-directional texts that mix for instance Latin and Hebrew, algorithms for sorting and string comparison, and much more. The ISO 10646 standard on the other hand is not much more than a simple character set table, comparable to the old ISO 8859 standards. It specifies some terminology related to the standard, defines some encoding alternatives, and it contains specifications of how to use UCS in connection with other established ISO standards such as ISO 6429 and ISO 2022. | |||
=Operations= | |||
==Determining the Character Set for a File== | |||
<syntaxhighlight lang='bash'> | |||
file -I <file-name> | |||
</syntaxhighlight> | |||
==Character Set Conversion== | |||
<syntaxhighlight lang='bash'> | |||
cat <file> | iconv -f utf-8 -t utf-8 -c > <new-file> | |||
</syntaxhighlight> | |||
More details: {{Internal|iconv|iconv}} |
Latest revision as of 01:24, 20 August 2023
External
- https://en.wikipedia.org/wiki/Character_encoding
- https://unicode.org/glossary/
- http://www.cl.cam.ac.uk/~mgk25/unicode.html
Internal
Overview
Character encoding is the process though which characters within a text document are represented by numeric codes and ultimately translated into sequence of bits stored on persistent storage or sent over the wire. Depending of the character encoding convention used, the same text will end up with different binary representations. Common character encoding standards are US-ASCII, Extended ASCII, Unicode and UCS.
Character
A character is the smallest component of a written language that has semantic value. In the context of an encoding convention, the term "character" refers to the abstract meaning, rather than a specific shape (glyph), though in code tables some form of visual representation is also essential for the reader’s understanding. It is also referred to as abstract character. In Unicode, the character is the basic unit of encoding. Depending on the encoding scheme used, characters may be represented with different code units, which are sequence of bits used to represent a single character.
Character Repertoire
A character repertoire is the full set of abstract characters a system supports. The repertoire may be closed, where no additions are allowed without creating a new standard, as it is the case with ASCII and most of the ISO-8859 series, or it may be open, allowing additions, as it is the case with Unicode.
Coded Character Set (CCS)
A collection of characters used to represent textual information, in which each character is assigned a numeric code point. The concept of "coded character set" can also be thought of as a function that maps characters to code points. For example, the capital letter "A" in the Latin alphabet might be represented by the code point 65. "Coded character set" is frequently abbreviated as character set, charset, or code set. The acronym CCS is also used. A character set might be used by multiple languages.
Character Encoding Form (CEF)
A character encoding form (CEF) is the mapping of code points to code units to facilitate storage or transmission in a system that represents numbers as bit sequences of fixed length. For example, a system that stores numeric information in 16-bit units can only directly represent code points 0 to 65,535 in each unit, but larger code points (for example, 65,536 to 1.4 million) could be represented by using multiple 16-bit units. This correspondence is defined by a CEF.
Character Encoding Scheme (CES)
A character encoding scheme (CES) is the mapping of code units to a sequence of bits to facilitate storage on an octet-based file system or transmission over an octet-based network. Character encoding schemes include UTF-8, UTF-16 and UTF-32.
Code Point
A code point, or a code position, is any of the integral numeric values that make up the code space.
Some character encoding schemes, such as ASCII, have a fixed relationship between any of the represented characters and the sequence of bits used to represent that character. For example, the ASCII code space consists of 27 = 128 code points and each of the represented characters corresponds to a code point and to a predefined bit sequence. Extended ASCII has 28 = 256 code points.
Unicode breaks the assumption that each character should always directly correspond to a particular sequence of bits. Instead, the characters are first mapped to a universal intermediate numeric representation. In this context, the code points are the numeric values characters are mapped onto. Unicode allows 1,114,112 code points in the range 0x0 to 0x10FFFF. Code points could then be represented in a variety of ways and with various numbers of bits per character (code unit), depending on the context.
The concept of code point is part of the Unicode's solution to represent larger character sets, without adding more bits per character, because doing so would have constituted an unacceptable waste of storage space in case of Latin script content, which constituted the vast majority of content at that time, requiring those extra bits to be aways zeroed out in those cases.
Additionally to allowing flexibility on the number of bits the character is represented with, a second advantage provided by code points is that the same character can be represented with different graphical representations (glyphs).
Code points are normally assigned to abstract characters. An abstract character is not a graphical glyph but a unit of textual data. The distinction between a code point and the corresponding abstract character is not pronounced in Unicode, but is evident for many other encoding schemes, where numerous code pages may exist for a single code space.
In Unicode, not all code points are assigned to encoded characters. They can be of the following types: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter and Reserved.
Code Unit
The code unit (or code value) is a bit sequence used to encode each character of a repertoire within a given encoding form. A characteristic of the code unit is its size: the number of bits used to represent a unit of encoded text within a certain encoding scheme. For example, ASCII has a code unit size of 7, UTF-8 a code unit size of 8, UCS-4 a code unit size of 32, etc.
Code Space
A code space is a range of numerical values available for encoding characters. For US-ASCII, the code space is 0x0 - 0x7F, for Unicode is 0x0 to 0x10FFFF.
Code Page
A table of values that descries the character set used for encoding a particular set of characters.
Combining Character
Also known as a "pre-composed character" - characters with diacritics.
Glyph
A graphical representation of a character.
Character Code
Unicode is referred to as a character code.
Are character encoding standards and character codes equivalent?
Character Encoding Standards
US-ASCII
ASCII stands for American Standard Code for Information Interchange and it is a seven-bit encoding scheme used to encode letters, numerals, symbols, and device control codes as fixed-length codes using integers. ASCII consists of 27 = 128 code points in the range 0x0 to 0x7F. ASCII is specified by ISO 646 IRV.
In Java, US_ASCII can be referred to as a constant:
public final class StandardCharsets {
...
/**
* Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the
* Unicode character set
*/
public static final Charset US_ASCII = sun.nio.cs.US_ASCII.INSTANCE;
...
}
Extended ASCII
Extended ASCII consists of 28 = 256 code points in the range 0x0 to 0xFF.
Western
Latin-US
Latin-1 (ISO-8859-1)
ISO 8859-1 is a single-byte encoding standard that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way.
In Java, ISO_8859_1 can be referred to as a constant:
public final class StandardCharsets {
...
/**
* ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1
*/
public static final Charset ISO_8859_1 = sun.nio.cs.ISO_8859_1.INSTANCE;
...
}
Unicode
The Unicode standard is a character encoding system designed to support worldwide interchange, processing and display of the written texts of the diverse languages and technical disciplines. It is essentially a code table that assigns integer numbers (code points) to characters, essentially defining what characters are available.
Unicode 1.0 was originally designed as a fixed-width 16-bit character encoding that allowed only 216 = 65,536 code points, and when Java was initially designed, it relied on this specification to represent char on 2 bytes. However, it turned out that 65,535 characters are not sufficient to represent all characters that are or have been used on planet Earth.
The Unicode standard has been extended, as Unicode 2.0, to 32-bit character encoding, and to allow up to 1,112,064 characters. All Unicode versions since 2.0 are compatible, only new characters will be added, no existing characters will be removed or renamed in the future. Moreover, the Unicode and UCS (ISO 10646) code tables are compatible. More details about Unicode and ISO 10646 compatibility is available in the "Relationship between Unicode and UCS" section.
Unicode Code Points
Unicode supports a much larger character set than ASCII. It comprises 17 x 65,536 = 1,114,112 code points in the range 0x0 to 0x10FFFF. A valid Unicode code point is known as a Unicode scalar value.
Unicode Planes
The Unicode code space is divided into seventeen planes: the Basic Multilingual Plane, which contains Basic Multilingual Plane characters and supplementary planes, which contain supplemental characters. Thus, each Unicode character is either a BMP character or a supplemental character.
Basic Multilingual Plane (BMP)
Plane 0 is referred to as the Basic Multilingual Plane (BMP) and contains the set of characters from U+0000 to U+FFFF (see below for more details on U+n notation).
Supplementary Planes
Unicode supports 16 more supplementary planes, where each plane supports 216 = 65,536 code points. Characters whose code points are greater than U+FFFF, thus go over the original Unicode 1.0 16-bit limit are called supplementary characters. Such characters are generally rare, but some are used, for example, as part of Chinese and Japanese personal names.
U+n Notation
A hexadecimal number that represents a Unicode or UCS value is commonly preceded by "U+". For example U+0041 is the character "Latin capital letter A". The range of valid code points is U+0000 to U+10FFFF. More details on how U+n notation can be used in Java are available in the "Java and Unicode" section.
Unicode Ranges
0020 — 007F | Basic Latin |
00A0 — 00FF | Latin-1 Supplement |
0100 — 017F | Latin Extended-A |
0180 — 024F | Latin Extended-B |
... | |
D800 — DB7F | High Surrogates |
DB80 — DBFF | High Private Use Surrogates |
DC00 — DFFF | Low Surrogates |
E000 — F8FF | Private Use Area |
... | |
FFF0 — FFFF | Specials |
Common Unicode Codes
Unicode Transformation Format (UTF)
Binary representation of a text represented in Unicode depends on the "transformation format" used. UTF stands for "Unicode Transformation Format", and the number specified after the dash in the transformation format name represents the number of bits used to represent each character.
UTF-8
Most systems working with Unicode use UTF-8.
UTF-8 is a Unicode variable-length character encoding scheme that uses 8-bit code units. Unicode code points map to a sequence of one, two, three or four code units. The UTF-8 encoding is defined in ISO 10646-1:2000 Annex D and also described in RFC 3629.
UTF-8 is backward compatible with fixed-width ASCII, in that the 8-bit UTF codes are the same as ASCII, as explained below:
U+0000 to U+007F are encoded in one byte. The byte values to 0x00 to 0x7F always represent code points U+0000 to U+007F, which is the Basic Latin block, which corresponds to the ASCII character set. These byte values never occur in the representation of other code points, a characteristic that makes UTF-8 convenient to use in software that assigns special meanings to certain ASCII characters. For example, the letter "A" representation (U+0041) in UTF-8 is 0x41.
U+0080 to U+07FF are encoded in two bytes. For example, "ß" representation (U+00DF) in UTF-8 is 0xC39F.
U+0800 to U+FFFF are encoded in three bytes. For example, "東" representation (U+6771) in UTF-8 is 0xE69DB1.
U+10000 to U+10FFFF are encoded in four bytes. For example, "𐐀" representation (U+10400) in UTF-8 is 0xF0909080.
In Java, UTF-8 can be referred to as a constant:
public final class StandardCharsets {
...
/**
* Eight-bit UCS Transformation Format
*/
public static final Charset UTF_8 = sun.nio.cs.UTF_8.INSTANCE;
}
UTF-16
UTF-16 is a Unicode character encoding scheme that uses 16-bit code units. UTF-16 support a large enough character set to represent both Western and Eastern letters and symbols. Any code point with a scalar value between U+0000 and U+FFFF inclusively, is encoded with a single code unit. Code points with a value U+10000 or higher require two code units each. UTF-16 is the character encoding scheme used internally by Java. More details about the encoding standard are available in the "Character Representation" section of the Java and Unicode page.
In Java, UTF-16 can be referred to as a constant:
public final class StandardCharsets {
...
/**
* Sixteen-bit UCS Transformation Format, big-endian byte order
*/
public static final Charset UTF_16BE = Charset.forName("UTF-16BE");
/**
* Sixteen-bit UCS Transformation Format, little-endian byte order
*/
public static final Charset UTF_16LE = Charset.forName("UTF-16LE");
/**
* Sixteen-bit UCS Transformation Format, byte order identified by an
* optional byte-order mark
*/
public static final Charset UTF_16 = Charset.forName("UTF-16");
}
UTF-32
UTF-32 is a Unicode character encoding scheme that uses 32-bit code units. The code unit is large enough that every code point is represented as a single code unit. This is clearly a convenient representation, but it uses significantly more memory than necessary if used as a general string representation.
For example, "A" representation (U+0041) in UTF-32 is 0x00000041, "ß" representation (U+00DF) in is 0x000000DF, "東" representation (U+6771) is 0x00006771. All these are BMP characters. "𐐀" (U+10400), which is a supplementary character, is represented as 0x0010400.
Java and Unicode
Universal Character Set (UCS) ISO 10646
UCS stands for Universal Character Set. It is defined by the international standard ISO 10646. Essentially, it is a code table that assigns integer numbers to characters. UCS is a superset of all other character set standards. It also covers a large number of graphical, typographical, mathematical and scientific symbols. UCS can specify 231 characters. UCS assigns to each character a code number and an official name. The "U+" representation is identical with that used by Unicode.
An UCS plane is a subset of 216 characters where the elements differ only in the 16 least-significant bits. The most commonly used characters, including all those found in major older encoding standards, have been placed into the first plane (0x000 to 0xFFFD). This plane is also known as Plane 0.
The UCS characters U+0000 to U+007F are identical to those in US-ASCII. The UCS character range U+0000 to U+00FF is identical to ISO 8859-1.
The two most obvious encodings store Unicode text as sequences of either 2 or 4 byte sequences. These encodings are named UCS-2 and UCS-4, respectively. Unless otherwise specified the bigendian convention is used, the most significant byte comes first.
UCS-2
UCS-4
In the UCS-4 encoding, any code point is encoded as a 4-byte binary numbers. The code unit has 4 x 8 = 32 bits.
Relationship between Unicode and UCS
In the late 1980s, there were two independent attempts to create a single unified character set. One was ISO 10646 project of the International Organization for Standardization (ISO), and the other was the Unicode Project, set up by a consortium of manufacturers of multi-lingual software. The projects joined efforts around 1991 and produced a single code table. Both projects still exist and publish their standards independently, however they have agreed to keep the code tables compatible. The Unicode Standard published by the Unicode Consortium corresponds to ISO 10646 at implementation level 3. All characters are at the same positions and have the same names in both standards. Unicode 5.0 corresponds to ISO 10646:2003.
The Unicode Standard defines much more semantics associated with some of the characters and is in general a better reference for implementors of high-quality typographic publishing systems. Unicode specifies algorithms for rendering presentation forms of some scripts (i.e. Arabic), handling of bi-directional texts that mix for instance Latin and Hebrew, algorithms for sorting and string comparison, and much more. The ISO 10646 standard on the other hand is not much more than a simple character set table, comparable to the old ISO 8859 standards. It specifies some terminology related to the standard, defines some encoding alternatives, and it contains specifications of how to use UCS in connection with other established ISO standards such as ISO 6429 and ISO 2022.
Operations
Determining the Character Set for a File
file -I <file-name>
Character Set Conversion
cat <file> | iconv -f utf-8 -t utf-8 -c > <new-file>
More details: