Module utf8

Expand description

Definitions for UTF-8 encoding and decoding of character sequences.

UTF-8 is a variable-width character encoding scheme. Each character is encoded with between 1 and 4 bytes. Specifications for encoding and decoding characters to their UTF-8 byte sequences are given by encode_utf8 and decode_utf8, respectively. Characters in the ASCII character set are encoded in UTF-8 with 1-byte encodings identical to those used by ASCII. Thus, some UTF-8 byte sequences can also be considered ASCII byte sequences, as defined in is_ascii_chars.

UTF-8 encodes numerical values called Unicode scalars (see below), which assign a unique value to each Unicode character. A scalar value is encoded in UTF-8 using a leading byte and between 0 and 3 continuation bytes, where larger scalar values require more continuation bytes. The first part of the bit pattern in the leading byte is reserved for describing the number of bytes in the scalar’s encoding (e.g., is_leading_byte_width_1). The rest of the leading byte contains data bits corresponding to the scalar’s value (e.g., leading_bits_width_1). The continuation bytes also follow a specific bit pattern (is_continuation_byte) and contain the remainder of the data bits (continuation_bits).

This module makes use of terminology from the Unicode standard. A Unicode scalar is a numerical value (represented in this module as a u32) corresponding to a character that can be encoded in UTF-8. All Rust chars correspond to Unicode scalars (char_is_scalar), and every numerical value encoded in a UTF-8 byte sequence must fall within the range defined for Unicode scalars (is_scalar). The Unicode standard also defines a codepoint to be a numerical value which falls in the range available for encoding characters in UTF-8. This may sound similar to the definition of scalar. However, the definition of codepoint is more permissive than that for scalars, as it includes some values which are technically possible to encode in the UTF-8 scheme, but in fact are not legal Unicode values (namely, the high-surrogate and low-surrogate ranges). To align with the Unicode terminology, in this module, we use the term “scalar” to describe the numerical values which can be encoded in valid UTF-8 byte sequences, and the term “codepoint” to describe numerical values which are learned upon decoding a byte sequence but may or may not be legal Unicode values.

Functions§

char_is_scalar
char_u32_cast
codepoint_width_1
codepoint_width_2
codepoint_width_3
codepoint_width_4
continuation_bits
decode_first_codepoint
decode_first_scalar
decode_last_scalar
decode_utf8
decode_utf8_encode_utf8
decode_utf8_first_scalar
decode_utf8_split
encode_scalar
encode_utf8
encode_utf8_decode_utf8
encode_utf8_first_scalar
encode_utf8_valid_utf8
group_utf8_lib
has_width_1_encoding
has_width_2_encoding
has_width_3_encoding
has_width_4_encoding
is_ascii_chars
is_ascii_chars_concat
is_ascii_chars_encode_utf8
is_ascii_chars_nat_bound
is_char_boundary
is_char_boundary_iff_is_leading_byte
is_char_boundary_iff_not_is_continuation_byte
is_char_boundary_start_end_of_seq
is_continuation_byte
is_leading_byte_width_1
is_leading_byte_width_2
is_leading_byte_width_3
is_leading_byte_width_4
is_scalar
last_continuation_byte
leading_bits_width_1
leading_bits_width_2
leading_bits_width_3
leading_bits_width_4
leading_byte_width_1
leading_byte_width_2
leading_byte_width_3
leading_byte_width_4
length_of_first_codepoint
length_of_first_scalar
length_of_last_scalar
not_overlong_encoding
not_surrogate
partial_valid_partial_invalid_utf8
partial_valid_utf8
partial_valid_utf8_extend
partial_valid_utf8_extend_ascii_block
pop_first_scalar
second_last_continuation_byte
take_first_scalar
take_last_scalar
third_last_continuation_byte
utf8_byte_ranges_bitwise
valid_first_scalar
valid_leading_and_continuation_bytes_first_codepoint
valid_utf8
valid_utf8_concat
valid_utf8_last
valid_utf8_split

Module utf8

Module utf8 Copy item path

Functions§

Module utf8