Character encoding, ASCII, Unicode
1. Character encoding
Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers.
The numerical values that make up a character encoding are known as code points and collectively comprise a code space, a code page, or a character map. [1]
1.1. Terminology related to character encoding
-
A character is a minimal unit of text that has semantic value.
-
A character set is a collection of characters that might be used by multiple languages.
Example: The Latin character set is used by English and most European languages, though the Greek character set is used only by the Greek language. -
A coded character set is a character set in which each character corresponds to a unique number.
-
A code point of a coded character set is any allowed value in the character set or code space.
-
A code space is a range of integers whose values are code points.
-
A code unit is the "word size" of the character encoding scheme, such as 7-bit, 8-bit, 16-bit.
In some schemes, some characters are encoded using multiple code units, resulting in a variable-length encoding.
1.2. Character repertoire (the abstract set of characters)
The character repertoire (/ˈrepətwɑː(r)/) is an abstract set of more than one million characters found in a wide variety of scripts including Latin, Cyrillic, Chinese, Korean, Japanese, Hebrew, and Aramaic. Other symbols such as musical notation are also included in the character repertoire.
Both the Unicode and GB 18030 standards have a character repertoire. As new characters are added to one standard, the other standard also adds those characters, to maintain parity.
The code unit size is equivalent to the bit measurement for the particular encoding:
1.3. Glyphs, graphemes and characters
In most languages written in any variety of the Latin alphabet except English, the use of diacritics to signify a sound mutation is common. For example, the grapheme ⟨à⟩ requires two glyphs: the basic a
and the grave accent `
. [2]
In computing as well as typography, the term "character" refers to a grapheme or grapheme-like unit of text, as found in natural language writing systems (scripts).
|
1.4. Example of a code unit
Consider a string of the letters "ab̲c𐐀", that is, a string containing a Unicode combining character (U+0332
̲
) as well a supplementary character (U+1040
𐐀
).
It has several representations which are logically equivalent, yet while each is suited to a diverse set of circumstances or range of requirements:
-
Four composed characters:
a
,b̲
,c
,𐐀
-
Five graphemes:
a
,b
,_
,c
,𐐀
-
Five Unicode code points:
U+0061
,U+0062
,U+0332
,U+0063
,U+10400
-
Five UTF-32 code units (32-bit integer values):
0x00000061
,0x00000062
,0x00000332
,0x00000063
,0x00010400
-
Six UTF-16 code units (16-bit integers)
0x0061
,0x0062
,0x0332
,0x0063
,0xd801
,0xdc00
-
Nine UTF-8 code units (8-bit values, or bytes)
0x61
,0x62
,0xCC
,0xB2
,0x63
,0xf0
,0x90
,0x90
,0x80
Note in particular the last character, which is represented with either one 1 32-bit value, 2 16-bit values. or 4 8-bit values. Although each of those forms uses the same total number of bits (32) to represent the glyph, the actual numeric byte values and their arrangement appear entirely unrelated.
1.5. Code point
The convention to refer to a character in Unicode is to start with U+
followed by the codepoint value in hexadecimal.
-
The range of valid code points for the Unicode standard is
U+0000
toU+10FFFF
, inclusive, divided in 17 planes, identified by the numbers 0 to 16. -
Characters in the range
U+0000
toU+FFFF
are in plane 0, called the Basic Multilingual Plane (BMP).This plane contains most commonly-used characters.
-
Characters in the range
U+10000
toU+10FFFF
in the other planes are called supplementary characters.
The following table shows examples of code point values:
Character | Unicode code point | Glyph |
---|---|---|
Latin A |
U+0041 |
Α |
Latin sharp S |
U+00DF |
ß |
Han for East |
U+6771 |
東 |
Ampersand |
U+0026 |
& |
Inverted exclamation mark |
U+00A1 |
¡ |
Section sign |
U+00A7 |
§ |
A code point is represented by a sequence of code units. The mapping is defined by the encoding. Thus, the number of code units required to represent a code point depends on the encoding:
-
UTF-8: code points map to a sequence of one, two, three or four code units.
-
UTF-16: code units are twice as long as 8-bit code units.
Therefore, any code point with a scalar value less than
U+10000
is encoded with a single code unit.Code points with a value
U+10000
or higher require two code units each.These pairs of code units have a unique term in UTF-16: "Unicode surrogate pairs".
-
UTF-32: the 32-bit code unit is large enough that every code point is represented as a single code unit.
-
GB 18030: multiple code units per code point are common, because of the small code units. Code points are mapped to one, two, or four code units.
1.6. Unicode encoding model
Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set, together constitute a modern, unified character encoding.
Rather than mapping characters directly to octets (bytes), they separately define what characters are available, corresponding natural numbers (code points), how those numbers are encoded as a series of fixed-size natural numbers (code units), and finally how those units are encoded as a stream of octets.
The purpose of this decomposition is to establish a universal set of characters that can be encoded in a variety of ways.
To describe this model correctly requires more precise terms than "character set" and "character encoding." The terms used in the modern model follow:
A character repertoire is the full set of abstract characters that a system supports.
-
The repertoire may be closed, i.e. no additions are allowed without creating a new standard (as is the case with ASCII and most of the ISO-8859 series), or it may be open, allowing additions (as is the case with Unicode and to a limited extent the Windows code pages).
-
The characters in a given repertoire reflect decisions that have been made about how to divide writing systems into basic information units.
The basic variants of the Latin, Greek and Cyrillic alphabets can be broken down into letters, digits, punctuation, and a few special characters such as the space, which can all be arranged in simple linear sequences that are displayed in the same order they are read.
But even with these alphabets, diacritics pose a complication: they can be regarded either as part of a single character containing a letter and diacritic (known as a precomposed character), or as separate characters.
The former allows a far simpler text handling system but the latter allows any letter/diacritic combination to be used in text.
-
A coded character set (CCS) is a function that maps characters to code points (each code point represents one character).
For example, in a given repertoire, the capital letter "A" in the Latin alphabet might be represented by the code point 65, the character "B" to 66, and so on.
Multiple coded character sets may share the same repertoire; for example ISO/IEC 8859-1 and IBM code pages 037 and 500 all cover the same repertoire but map them to different code points.
-
A character encoding form (CEF) is the mapping of code points to code units to facilitate storage in a system that represents numbers as bit sequences of fixed length (i.e. practically any computer system).
For example, a system that stores numeric information in 16-bit units can only directly represent code points 0 to 65,535 in each unit, but larger code points (say, 65,536 to 1.4 million) could be represented by using multiple 16-bit units. This correspondence is defined by a CEF.
-
Next, a character encoding scheme (CES) is the mapping of code units to a sequence of octets to facilitate storage on an octet-based file system or transmission over an octet-based network.
Simple character encoding schemes include UTF-8, UTF-16BE, UTF-32BE, UTF-16LE or UTF-32LE; compound character encoding schemes, such as UTF-16, UTF-32 and ISO/IEC 2022, switch between several simple schemes by using a byte order mark or escape sequences; compressing schemes try to minimize the number of bytes used per code unit (such as SCSU, BOCU, and Punycode).
The byte order mark (BOM) is a particular usage of the special Unicode character, U+FEFF ZERO WIDTH NO-BREAK SPACE, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text:
-
The byte order, or endianness, of the text stream in the cases of 16-bit and 32-bit encodings;
-
The fact that the text stream’s encoding is Unicode, to a high level of confidence;
-
Which Unicode character encoding is used.
-
The Unicode model uses the term character map for historical systems which directly assign a sequence of characters to a sequence of bytes, covering all of CCS, CEF and CES layers.