Character Encoding: Difference between revisions
Line 21: | Line 21: | ||
A code point is a ... | A code point is a ... | ||
Unicode breaks the assumption that each character should always directly correspond to a particular sequence of bits. Instead, the characters are first mapped to a universal intermediate numeric representation. The numbers characters are mapped on are called code points. Code points could then be represented in a variety of ways and with various numbers of bits per character | Unicode breaks the assumption that each character should always directly correspond to a particular sequence of bits. Instead, the characters are first mapped to a universal intermediate numeric representation. The numbers characters are mapped on are called code points. Code points could then be represented in a variety of ways and with various numbers of bits per character - [[#Code_Unit|code unit]] -, depending on the context. | ||
==Code Unit== | ==Code Unit== |
Revision as of 20:56, 25 June 2018
External
Internal
Overview
Character encoding is the process though which characters within a text document are represented by numeric codes. Depending of the character encoding used, the same text will end up with different binary representations. Common character encoding standards are ASCII, Unicode and UCS.
Concepts
Character Set
Character Code
Unicode is a character code.
Code Point
A code point is a ...
Unicode breaks the assumption that each character should always directly correspond to a particular sequence of bits. Instead, the characters are first mapped to a universal intermediate numeric representation. The numbers characters are mapped on are called code points. Code points could then be represented in a variety of ways and with various numbers of bits per character - code unit -, depending on the context.
Code Unit
Code Space
Character Encoding Standards
ASCII
ASCII stands for American Standard Code for Information Interchange and it is a seven-bit encoding scheme used to encode letters, numerals, symbols, and device control codes as fixed-length codes using integers. ASCII is specified by ISO 646 IRV.
Unicode
Unicode supports a larger character set than ASCII.
Unicode Transformation Format (UTF)
Binary representation of a text represented in Unicode depends on the "transformation format" used. UTF stands for "Unicode Transformation Format", and the number specified after the dash in the transformation format name represents the number of bits used to represent each character.
UTF-8
UTF-16
UTF-16 support a large enough character set to represent both Western and Eastern letters and symbols.