Character Encoding
External
Internal
Overview
Character encoding is the process though which characters within a text document are represented by numeric codes and ultimately translate into sequence of bits stored or sent over the wire. Depending of the character encoding used, the same text will end up with different binary representations. Common character encoding standards are ASCII, Unicode and UCS.
Concepts
Character
Depending on the encoding scheme used, characters may be represented with different code units - the number of bits used to represent a single character.
Character Set
Character Code
Unicode is a character code.
Code Point
A code point, or a code position, is any of the numeric values that make up the code space.
Some character encoding schemes, such as ASCII, have a fixed relationship between any of the represented characters and the sequence of bits used to represent that character. For example, the ASCII code space consists of 27 = 128 code points and each of the represented characters corresponds to a code point and to a predefined bit sequence. Extended ASCII has 28 = 256 code points.
Unicode breaks the assumption that each character should always directly correspond to a particular sequence of bits. Instead, the characters are first mapped to a universal intermediate numeric representation. The numeric values characters are mapped to are called code points. Unicode allows 1,114,112 code points in the range 0x0 to 0x10FFFF. Code points could then be represented in a variety of ways and with various numbers of bits per character (code unit), depending on the context.
The concept of code point is part of the Unicode's solution to represent larger character sets, without adding more bits per character, because doing so would have constituted an unacceptable waste of storage space in case of Latin script content, which constituted the vast majority of content at that time, requiring those extra bits to be aways zeroed out in those cases.
Additionally to allowing flexibility on the number of bits the character is represented with, a second advantage provided by code points is that the same character can be represented with different graphical representations (glyphs).
The distinction between a code point and the corresponding abstract character is not pronounced in Unicode, but is evident for many other encoding schemes, where numerous code pages may exist for a single code space.
Combining Character
Code Unit
The number of bits used to represent a character within a certain encoding scheme. For example, ASCII has a code unit of 7, UTF-8 a code unit of 8, UCS-4 a code unit of 32, etc.
Code Space
Character Encoding Standards
ASCII
ASCII stands for American Standard Code for Information Interchange and it is a seven-bit encoding scheme used to encode letters, numerals, symbols, and device control codes as fixed-length codes using integers. ASCII consists of 27 = 128 code points in the range 0x0 to 0x7F. ASCII is specified by ISO 646 IRV.
Extended ASCII
Extended ASCII consists of 28 = 256 code points in the range 0x0 to 0xFF.
Unicode
Unicode supports a much larger character set than ASCII. It comprises 17 x 65,536 = 1,114,112 code points in the range 0x0 to 0x10FFFF. The Unicode code space is divided into seventeen planes, of which the plane 0 is the Basic Multilingual Plane, and 16 supplementary planes, where each plane supports 216 = 65,536 code points.
Unicode Transformation Format (UTF)
Binary representation of a text represented in Unicode depends on the "transformation format" used. UTF stands for "Unicode Transformation Format", and the number specified after the dash in the transformation format name represents the number of bits used to represent each character.
UTF-8
UTF-16
UTF-16 support a large enough character set to represent both Western and Eastern letters and symbols.
UTF-32
Universal Character Set (UCS) ISO 10646
UCS-4
In the UCS-4 encoding, any code point is encoded as a 4-byte binary numbers. The code unit has 4 x 8 = 32 bits.