[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Case sensitivity
At 10:27 AM -0400 4/3/00, Eric Bohlman wrote: >On Mon, 3 Apr 2000, Stefan van den Oord wrote: > >> I have a simple question, I think: is XML case sensitive? In other words, >> are the tags case sensitive? I also mean the <?XML... tag and the <!DOCTYPE >> tag. > >Yes, XML names are case-sensitive (remember that they're not restricted to >being English names, and many non-Western languages don't even have a >concept of case-folding). Your answer is of course correct (XML is case-sensitive); it is also true that "many non-Western languages don't even have a concept of case-folding". However, the second is not the reason for the first (granted, you didn't actually say it is -- but a reader might well take it that way). Languages with no need for case folding are not much of a problem: the case-folding table or program would merely have no effect on characters belonging to such languages: It would change 26 of our 26 alphabetic code points, and no others. After all, in English we already use lots of characters that don't get case-folded (like digits). The serious problems are subtler: The practical problem that with Unicode your folding table gets really big; on the order of 128Kbytes instead of 256 bytes (barring compresson): this is a pain on small devices (like a cell-phone browser), a pain to load, a pain to implement compression for, etc. The theoretical problem is more important: it's not the caseless languages that pose problems, but the complicated case-folding ones. For example, lots of languages only apply diacritical marks to lower-case letters: for example, "a" may come with 6 different accents, but "A" takes none -- which makes case-folding unreversible. If there are languages that operate the other way as well, then neither fold-to-upper nor fold-to-lower can work for all languages: either way would trash some languages. That said, I think it incumbent on XML *search engines* to support case-folding (as well as decent treatment of accents, types of hyphens, etc) for text content searches: Making English speakers search for "the" | "thE" | "tHe" | "tHE" | "The" | "ThE" | "THe" | "THE" or "[tT][hH][eE] is patently absurd; and besides, there is no user cost to (say) a Japanese speaker if an engine *does* case-fold. Also, many languages use Roman characters occasionally, as for acronyms; so their speakers also pay a price if searches aren't smart enough. And the primary problems with case-folding do not weigh so heavily in the search engine world (for example, AltaVista isn't likely to run their main servers on cell phones for a while yet). Steven_DeRose@B...; http://www.stg.brown.edu/~sjd Chief Scientist, Scholarly Technology Group, and Adjunct Associate Professor, Brown University North American Editor, the Text Encoding Initiative *************************************************************************** This is xml-dev, the mailing list for XML developers. To unsubscribe, mailto:majordomo@x...&BODY=unsubscribe%20xml-dev List archives are available at http://xml.org/archives/xml-dev/ ***************************************************************************
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|