Character Encoding: Difference between revisions

From NovaOrdis Knowledge Base
Jump to navigation Jump to search
Line 23: Line 23:
==<span id='Code_Position'></span>Code Point==
==<span id='Code_Position'></span>Code Point==


A '''code point''', or a '''code position''', is any of the numeric values that make up the [[#Code_Space|code space]].
A '''code point''', or a '''code position''', is any of the numeric values that make up the [[#Code_Space|code space]].  


Unicode breaks the assumption that each character should always directly correspond to a particular sequence of bits. Instead, the characters are first mapped to a universal intermediate numeric representation. The numeric values characters are mapped on are called '''code points'''. Code points could then be represented in a variety of ways and with various numbers of bits per character ([[#Code_Unit|code unit]]), depending on the context.
Some character encoding schemes, such as ASCII, have a fixed relationship between a character and the sequence of bits used to represent that character, and the ASCII code space consists of 2<sup>7</sup> = 128 code points. Unicode breaks the assumption that each character should always directly correspond to a particular sequence of bits. Instead, the characters are first mapped to a universal intermediate numeric representation. The numeric values characters are mapped on are called '''code points'''. Code points could then be represented in a variety of ways and with various numbers of bits per character ([[#Code_Unit|code unit]]), depending on the context.


==Code Unit==
==Code Unit==

Revision as of 21:14, 25 June 2018

External

Internal

Overview

Character encoding is the process though which characters within a text document are represented by numeric codes and ultimately translate into sequence of bits stored or sent over the wire. Depending of the character encoding used, the same text will end up with different binary representations. Common character encoding standards are ASCII, Unicode and UCS.

Concepts

Character

Depending on the encoding scheme used, characters may be represented with different code units - the number of bits used to represent a single character.

Character Set

Character Code

Unicode is a character code.

Code Point

A code point, or a code position, is any of the numeric values that make up the code space.

Some character encoding schemes, such as ASCII, have a fixed relationship between a character and the sequence of bits used to represent that character, and the ASCII code space consists of 27 = 128 code points. Unicode breaks the assumption that each character should always directly correspond to a particular sequence of bits. Instead, the characters are first mapped to a universal intermediate numeric representation. The numeric values characters are mapped on are called code points. Code points could then be represented in a variety of ways and with various numbers of bits per character (code unit), depending on the context.

Code Unit

The number of bits used to represent a character within a certain encoding scheme. For example, ASCII has a code unit of 7, UTF-8 a code unit of 8, etc.

Code Space

Character Encoding Standards

ASCII

ASCII stands for American Standard Code for Information Interchange and it is a seven-bit encoding scheme used to encode letters, numerals, symbols, and device control codes as fixed-length codes using integers. ASCII is specified by ISO 646 IRV.

Common ASCII Codes

Unicode

Unicode supports a larger character set than ASCII.

Unicode Transformation Format (UTF)

Binary representation of a text represented in Unicode depends on the "transformation format" used. UTF stands for "Unicode Transformation Format", and the number specified after the dash in the transformation format name represents the number of bits used to represent each character.

UTF-8

UTF-16

UTF-16 support a large enough character set to represent both Western and Eastern letters and symbols.

UTF-32

Universal Character Set (UCS) ISO 10646

Western

Latin-US