Character Encoding

From NovaOrdis Knowledge Base
Revision as of 22:56, 25 June 2018 by Ovidiu (talk | contribs) (→‎UTF-16)
Jump to navigation Jump to search

External

Internal

Overview

Character encoding is the process though which characters within a text document are represented by numeric codes and ultimately translate into sequence of bits stored or sent over the wire. Depending of the character encoding used, the same text will end up with different binary representations. Common character encoding standards are ASCII, Unicode and UCS.

Concepts

Character

A character is the smallest component of written language that has semantic value. It refers to the abstract meaning and shape, rather than a specific shape (glyph), though in code tables some form of visual representation is essential for the reader’s understanding. It is also referred to as abstract character. In Unicode, it is the basic unit of encoding. Depending on the encoding scheme used, characters may be represented with different code units - the number of bits used to represent a single character.

Character Set

A collection of characters used to represent textual information, in which each character is assigned a numeric code point. Frequently abbreviated as character set, charset, or code set. The acronym CCS is also used. A character set might be used by multiple languages.

Character Code

Unicode is sometimes referred to as a character code.

Code Point

A code point, or a code position, is any of the numeric values that make up the code space.

Some character encoding schemes, such as ASCII, have a fixed relationship between any of the represented characters and the sequence of bits used to represent that character. For example, the ASCII code space consists of 27 = 128 code points and each of the represented characters corresponds to a code point and to a predefined bit sequence. Extended ASCII has 28 = 256 code points.

Unicode breaks the assumption that each character should always directly correspond to a particular sequence of bits. Instead, the characters are first mapped to a universal intermediate numeric representation. The numeric values characters are mapped to are called code points. Unicode allows 1,114,112 code points in the range 0x0 to 0x10FFFF. Code points could then be represented in a variety of ways and with various numbers of bits per character (code unit), depending on the context.

The concept of code point is part of the Unicode's solution to represent larger character sets, without adding more bits per character, because doing so would have constituted an unacceptable waste of storage space in case of Latin script content, which constituted the vast majority of content at that time, requiring those extra bits to be aways zeroed out in those cases.

Additionally to allowing flexibility on the number of bits the character is represented with, a second advantage provided by code points is that the same character can be represented with different graphical representations (glyphs).

Code points are normally assigned to abstract characters. An abstract character is not a graphical glyph but a unit of textual data. The distinction between a code point and the corresponding abstract character is not pronounced in Unicode, but is evident for many other encoding schemes, where numerous code pages may exist for a single code space.

In Unicode, not all code points are assigned to encoded characters. They can be of the following types: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter and Reserved.

Code Unit

The code unit (or code value) is a bit sequence used to encode each character of a repertoire within a given encoding form. A characteristic of the code unit is its size: the number of bits used to represent a unit of encoded text within a certain encoding scheme. For example, ASCII has a code unit size of 7, UTF-8 a code unit size of 8, UCS-4 a code unit size of 32, etc.

Code Space

A code space is a range of numerical values available for encoding characters. For Unicode, a range of integers from 0x0 to 0x10FFFF.

Code Page

A table of values that descries the character set used for encoding a particular set of characters.

Combining Character

Character Encoding Standards

US-ASCII

ASCII stands for American Standard Code for Information Interchange and it is a seven-bit encoding scheme used to encode letters, numerals, symbols, and device control codes as fixed-length codes using integers. ASCII consists of 27 = 128 code points in the range 0x0 to 0x7F. ASCII is specified by ISO 646 IRV.

Common ASCII Codes

Extended ASCII

Extended ASCII consists of 28 = 256 code points in the range 0x0 to 0xFF.

Unicode

https://en.wikipedia.org/wiki/Unicode
http://www.unicode.org/standard/standard.html

The Unicode standard is a character encoding system designed to support worldwide interchange, processing and display of the written texts of the diverse languages and technical disciplines. It is essentially a code table that assigns integer numbers (code points) to characters. All Unicode versions since 2.0 are compatible, only new characters will be added, no existing characters will be removed or renamed in the future. Moreover, the Unicode and UCS (ISO 10646) code tables are compatible. More details about Unicode and ISO 10646 compatibility is available in the "Relationship between Unicode and UCS" section.

Unicode supports a much larger character set than ASCII. It comprises 17 x 65,536 = 1,114,112 code points in the range 0x0 to 0x10FFFF. The Unicode code space is divided into seventeen planes, of which the plane 0 is the Basic Multilingual Plane, and 16 supplementary planes, where each plane supports 216 = 65,536 code points.

Common Unicode Codes

Unicode Transformation Format (UTF)

Binary representation of a text represented in Unicode depends on the "transformation format" used. UTF stands for "Unicode Transformation Format", and the number specified after the dash in the transformation format name represents the number of bits used to represent each character.

UTF-8

UTF-8 is a Unicode encoding format that uses 8-bit code units. Unicode code points map to a sequence of one, two, three or four code units.

UTF-16

UTF-16 is a Unicode encoding format that uses 16-bit code units. UTF-16 support a large enough character set to represent both Western and Eastern letters and symbols. Any code point with a scalar value less than U+10000 is encoded with a single code unit. Code points with a value U+10000 or higher require two code units each.

UTF-32

Unicode uses 32-bit code units in the UTF-32 encoding form. The code unit is large enough that every code point is represented as a single code unit.

Universal Character Set (UCS) ISO 10646

UCS-4

In the UCS-4 encoding, any code point is encoded as a 4-byte binary numbers. The code unit has 4 x 8 = 32 bits.

Relationship between Unicode and UCS

In the late 1980s, there were two independent attempts to create a single unified character set. One was ISO 10646 project of the International Organization for Standardization (ISO), and the other was the Unicode Project, set up by a consortium of manufacturer of multi-lingual software. The projects joined efforts around 1991 and produced a single code table. Both projects still exist and publish their standards independently, however they have agreed to keep the code tables compatible. The Unicode Standard published by the Unicode Consortium corresponds to ISO 10646 at implementation level 3. All characters are at the same positions and have the same names in both standards. Unicode 5.0 corresponds to ISO 10646:2003.

The Unicode Standard defines much more semantics associated with some of the characters and is in general a better reference for implementors of high-quality typographic publishing systems. Unicode specifies algorithms for rendering presentation forms of some scripts (i.e. Arabic), handling of bi-directional texts that mix for instance Latin and Hebrew, algorithms for sorting and string comparison, and much more. The ISO 10646 standard on the other hand is not much more than a simple character set table, comparable to the old ISO 8859 standards. It specifies some terminology related to the standard, defines some encoding alternatives, and it contains specifications of how to use UCS in connection with other established ISO standards such as ISO 6429 and ISO 2022.

Western

Latin-US