[Home] [By Thread] [By Date] [Recent Entries]
I thought XML-DEVers might be interested in a summary of what the new draft W3C "Character Model for the World Wide Web"[1] says. Keep in mind that the primary readership of the character model is standards developers first, and then implementers and users of standards second. If you don't do these, you cannot claim your software conforms to the "Character Model for the WWW" fully. (Of course, there may be good reasons: the model sets the bar for W3C specs and a goal for implementers.) Overview ------------ 1-7 deal with ASCII assumptions that are no longer appropriate for Unicode programmes. This is basic I18n. I would say that in Java now it is hard not to do this. 8-14 deal with issue that are built into XML or are best practise for XML. 15-18 deal with handling legacy data. This is the contentious one. 19-21 deal with strings, indexing and matching Details -------- 1) Specifications and software MUST NOT assume that there is a one-to-one correspondence between characters and the sounds of a language. 2) Specifications and software MUST NOT assume a one-to-one mapping between character codes and units of displayed text. 3) Specifications and software MUST NOT assume that a single keystroke results in a single character, nor that a single character can be input with a single keystroke (even with modifiers), nor that keyboards are the same all over the world. 4) Software that sorts or searches text for users MUST do so on the basis of appropriate collation units and ordering rules for the relevant language and/or application. 5) Software that allows users to sort or search text SHOULD allow the user to select alternative rules for collation units and ordering. 6) When sorting and searching in the context of a particular language, it MUST be possible to deal gracefully with strings being compared that contain Unicode characters not normally associated with that language. 7) Specifications and software MUST NOT assume a one-to-one relationship between characters and units of physical storage. 8) Receiving software MUST determine the encoding of data from available information according to appropriate specifications. When an IANA-registered charset name is recognized, receiving software MUST interpret the received data according to the encoding associated with the name in the IANA registry. When no charset is provided receiving software MUST adhere to the default encoding(s) specified in the specification 9) Receiving software MAY recognize as many encodings (names and aliases) as appropriate. 10) Software MUST completely implement the mechanisms for character encoding identification and SHOULD implement them in such a way that they are easy to use (for instance in HTTP servers). On interfaces to other protocols, software SHOULD support conversion between Unicode encoding forms as well as any other necessary conversions. 11) Software and content MUST carefully follow conflict-resolution mechanisms where there is multiple or conflicting information about character encoding. 12) Escapes SHOULD be avoided when the characters to be expressed are representable in the character encoding of the document. 13) Since character set standards usually list character numbers as hexadecimal, content SHOULD use the hexadecimal form of character escapes when there is one. 14) Choose an encoding for the document that maximizes the opportunity to directly represent characters and minimizes the need to represent characters by markup means such as character escapes. In general, if the first encoding choice is not satisfactory, Unicode is the next best choice, for its large character repertoire and its wide base of support. 15) A text-processing component that receives suspect text MUST NOT perform any normalization-sensitive operations unless it has first confirmed through inspection that the text is in normalized form, and MUST NOT normalize the suspect text. Private agreements MAY, however, be created within private systems which are not subject to these rules, but any externally observable results MUST be the same as if the rules had been obeyed. 16) A text-processing component which modifies text and performs normalization-sensitive operations MUST behave as if normalization took place after each modification, so that any subsequent normalization-sensitive operations always behave as if they were dealing with normalized text. 17) Authoring tool implementations for a (formal) language that does not mandate full- normalization SHOULD prevent users from creating content with composing characters at the beginning of constructs that may be significant, such as at the beginning of an entity that will be included, immediately after a construct that causes inclusion or immediately after markup, or SHOULD warn users when they do so. 18) Implementations which transcode text from a legacy encoding to a Unicode encoding form MUST use a normalizing transcoder. 19) String identity matching MUST be performed as if the following steps were followed: *Early uniform normalization to fully-normalized form, as defined in 4.2.3 Fully-normalized text. In accordance with section 4 Early Uniform Normalization, this step MUST be performed by the producers of the strings to be compared. *Conversion to a common encoding of UCS, if necessary. *Expansion of all recognized character escapes and includes. *Testing for bit-by-bit identity. 20) Forms of string matching other than identity matching SHOULD be performed as if the following steps were followed: *Steps 1 to 3 for string identity matching. *Matching the strings in a way that is appropriate to the application. 21) The character string is RECOMMENDED as a basis for string indexing. 22) A code unit string MAY be used as a basis for string indexing if this results in a significant improvement in the efficiency of internal operations when compared to the use of character string. 23) Users of specifications (software developers, content developers) SHOULD whenever possible prefer ways other than string indexing to identify substrings or point within a string. So how does an implementer respond to Charmod? We use code audits at Topologi: we have had them for internationalization, accessability, font metrics, design and unit tests. I think Charmod provides a useful checklist for code inspection and awareness-raising: almost all of it can be achieved simply by using the most recent version of standard APIs: the guidelines suggest which ones to use. On the other hand, we deliberately chose to violate 20 for our searches, because that is appropriate: we use a third-party regular expression library so it is outside our capability to expand numeric character references in text first. That is something that the regex library developer should think about. If you look at requirement 2) "Specifications and software MUST NOT assume a one-to-one mapping between character codes and units of displayed text." you can see it does not go very far: for example, handling U character followed by an umlaut character is one thing, but handling Indic languages where accents can go before the character is another. The requirement as specified is actually a really *low* bar: hence my characterization of it as being aimed more at challenging ASCII assumptions rather than guaranteeing universal applications. Note, for instance, that the Character Model does not place any requirement on implementations that they handle discontinuous selection in Arabic/Latin texts. Cheers Rick Jelliffe [1] http://www.w3.org/TR/charmod/
|

Cart



