B Definitions for Character Normalization
Definitions for Character Normalization
This appendix contains the necessary definitions for character normalization.
For additional background information and examples, see [Charmod].
Text is said to be
in a Unicode encoding form if it is encoded in
UTF-8, UTF-16 or UTF-32.
Legacy encoding
is taken to mean any character encoding not based on Unicode.
A
normalizing transcoder is a transcoder that converts from a
legacy encoding to a
Unicode encoding form and
ensures that the result is in Unicode Normalization Form C
(see UAX #15 [Unicode]).
A character escape
is a syntactic device defined in a markup or programming language that allows
one or more of:
-
expressing syntax-significant characters while disregarding
their significance in the syntax of the language, or
-
expressing characters not representable in the character encoding
chosen for an instance of the language, or
-
expressing characters in general, without use of the corresponding
character codes.
Certified text
is text which satisfies at least one of the following conditions:
-
it has been confirmed through inspection that the text
is in normalized form
-
the source text-processing component is identified
and is known to produce only normalized text.
Text is, for the purposes of
this specification, Unicode-normalized if it is in a
Unicode encoding form and is in
Unicode Normalization Form C, according to a version of Unicode Standard Annex #15:
Unicode Normalization Forms [Unicode] at least as recent as the
oldest version of the Unicode Standard that contains all the characters
actually present in the text, but no earlier
than version 3.2.
Text is
include-normalized if:
-
the text is Unicode-normalized
and does not contain any character escape
or Include whose expansion would
cause the text to become no longer Unicode-normalized;
or
-
the text is in a legacy encoding and, if it were transcoded
to a Unicode encoding form by a
normalizing transcoder, the resulting
text would satisfy clause 1 above.
A composing character
is a character that is one or both of the following:
-
the second character in the canonical decomposition mapping of
some primary composite (as defined in D3 of UAX #15 [Unicode]), or
-
of non-zero canonical combining class (as defined in Unicode
[Unicode]).
Text is
fully-normalized if:
-
the text is in a Unicode encoding form, is include-normalized and
none of the
constructs comprising the text begin with a
composing character or a
character escape representing a
composing character; or
-
the text is in a legacy encoding and,
if it were transcoded to a Unicode encoding form
by a normalizing transcoder, the resulting text
would satisfy clause 1 above.
|