Expand description
Definitions for UTF-8 encoding and decoding of character sequences.
UTF-8 is a variable-width character encoding scheme.
Each character is encoded with between 1 and 4 bytes.
Specifications for encoding and decoding characters to their UTF-8 byte sequences are given by encode_utf8 and decode_utf8, respectively.
Characters in the ASCII character set are encoded in UTF-8 with 1-byte encodings identical to those used by ASCII.
Thus, some UTF-8 byte sequences can also be considered ASCII byte sequences, as defined in is_ascii_chars.
UTF-8 encodes numerical values called Unicode scalars (see below), which assign a unique value to each Unicode character.
A scalar value is encoded in UTF-8 using a leading byte and between 0 and 3 continuation bytes, where larger scalar values require more continuation bytes.
The first part of the bit pattern in the leading byte is reserved for describing the number of bytes in the scalar’s encoding (e.g., is_leading_byte_width_1).
The rest of the leading byte contains data bits corresponding to the scalar’s value (e.g., leading_bits_width_1).
The continuation bytes also follow a specific bit pattern (is_continuation_byte) and contain the remainder of the data bits (continuation_bits).
This module makes use of terminology from the Unicode standard.
A Unicode scalar is a numerical value (represented in this module as a u32) corresponding to a character that can be encoded in UTF-8.
All Rust chars correspond to Unicode scalars (char_is_scalar),
and every numerical value encoded in a UTF-8 byte sequence must fall within the range defined for Unicode scalars (is_scalar).
The Unicode standard also defines a codepoint to be a numerical value which falls in the range available for encoding characters in UTF-8.
This may sound similar to the definition of scalar.
However, the definition of codepoint is more permissive than that for scalars,
as it includes some values which are technically possible to encode in the UTF-8 scheme,
but in fact are not legal Unicode values
(namely, the high-surrogate and low-surrogate ranges).
To align with the Unicode terminology, in this module, we use the term “scalar” to describe the numerical values
which can be encoded in valid UTF-8 byte sequences, and the term “codepoint” to describe numerical values which
are learned upon decoding a byte sequence but may or may not be legal Unicode values.
Functions§
- char_
is_ scalar - char_
u32_ cast - codepoint_
width_ 1 - codepoint_
width_ 2 - codepoint_
width_ 3 - codepoint_
width_ 4 - continuation_
bits - decode_
first_ codepoint - decode_
first_ scalar - decode_
last_ scalar - decode_
utf8 - decode_
utf8_ encode_ utf8 - decode_
utf8_ first_ scalar - decode_
utf8_ split - encode_
scalar - encode_
utf8 - encode_
utf8_ decode_ utf8 - encode_
utf8_ first_ scalar - encode_
utf8_ valid_ utf8 - group_
utf8_ lib - has_
width_ 1_ encoding - has_
width_ 2_ encoding - has_
width_ 3_ encoding - has_
width_ 4_ encoding - is_
ascii_ chars - is_
ascii_ chars_ concat - is_
ascii_ chars_ encode_ utf8 - is_
ascii_ chars_ nat_ bound - is_
char_ boundary - is_
char_ boundary_ iff_ is_ leading_ byte - is_
char_ boundary_ iff_ not_ is_ continuation_ byte - is_
char_ boundary_ start_ end_ of_ seq - is_
continuation_ byte - is_
leading_ byte_ width_ 1 - is_
leading_ byte_ width_ 2 - is_
leading_ byte_ width_ 3 - is_
leading_ byte_ width_ 4 - is_
scalar - last_
continuation_ byte - leading_
bits_ width_ 1 - leading_
bits_ width_ 2 - leading_
bits_ width_ 3 - leading_
bits_ width_ 4 - leading_
byte_ width_ 1 - leading_
byte_ width_ 2 - leading_
byte_ width_ 3 - leading_
byte_ width_ 4 - length_
of_ first_ codepoint - length_
of_ first_ scalar - length_
of_ last_ scalar - not_
overlong_ encoding - not_
surrogate - partial_
valid_ partial_ invalid_ utf8 - partial_
valid_ utf8 - partial_
valid_ utf8_ extend - partial_
valid_ utf8_ extend_ ascii_ block - pop_
first_ scalar - second_
last_ continuation_ byte - take_
first_ scalar - take_
last_ scalar - third_
last_ continuation_ byte - utf8_
byte_ ranges_ bitwise - valid_
first_ scalar - valid_
leading_ and_ continuation_ bytes_ first_ codepoint - valid_
utf8 - valid_
utf8_ concat - valid_
utf8_ last - valid_
utf8_ split