Re: Well-formed Blueberry
Thanks. But I still don't understand. jcowan@r... clarified: [snipped] > No, I think Elliotte is right here. There are XML 1.0 documents, which > lack the Magic Blueberry Mark (whatever it's going to be), and then there > are Blueberry documents. XML-1.0-only parsers MUST reject Blueberry > documents: they are not well-formed. Blueberry parsers SHOULD accept > both Blueberry and 1.0 documents, but MUST apply the 1.0 well-formedness > rules to 1.0 documents. This is where I get confused. You say MUST. I still don't understand why. Is NEL the culprit? > If a document lacks the Magic Blueberry Mark but > contains Blueberry names, it is not well-formed and must be rejected. > > Therefore, Blueberry parsers have to keep both sets of tables. Luckily, > the Blueberry table is a strict superset of the 1.0 table, I read "strict superset", and I think that anything that passed the XML 1.0 parser should pass the Blueberry parser. Is this correct? If it is, why should a Blueberry capable parser care if a doc that labels itself XML 1.0 slips in a blueberry? I missed the posts that explained the specific damage. (Or maybe I'm just brain-dead, anyway. It's been a hot, humid summer here.) Okay, I can see that developers will want to have the wall available to check against when developing for a context in which some users may be restricted to XML 1.0. End users won't need the wall, however? > so it suffices > to have four tables (or one table that maps Unicode values to one of > four enumerated values): xml10_name_start, xml10_name_part, > blueberry_only_name_start, blueberry_only_name_part. Were we to use four tables, I assume we would want to pack them. But bit addressing is another choice that can slow things down a bit, so we might assume the option of one table, four bits per entry, instead. There should be some long runs of identical values, so we should be able to use sparse table redundancy reduction techniques without too much of a performance hit. We should be able to end up with well less than 64K consumed by these tables. So there would probably be no need to remove the XML 1.0 tables from the parsers that end-users will use, and that would keep things a little more orderly. Not _required_ to have both tables for end-users, but we might as well. Would it be appropriate to suggest that said table could handle two more extensions to UNICODE without physically growing? [snipped] > IMHO the snag here would be getting an absolutely authoritative and > permanent list of such character sets, [snipped] This is definitely going to be a problem. If I read it one way, I want to push the jump to Blueberry now, before best practices really muddy things up. But I will admit to this, I personally would prefer to handle Kanji with a small standard set of radicals and something like the ideographic description sequences. I am pretty sure this would be enough to uniquely identify every current Kanji, and it would open the way for creataive non-standard writings, comparable to the ability we have in English for creative spellings. How many implementations of such a scheme have I seen? Zero. (The present overabundance of code points dedicated to Kanji makes more sense as a set of internal references to a pre-rendered font, rather than standardized character code points. One early reference for the JIS character set in my possession indicates that the JIS committee originally assumed that something like ideographic description sequences would be the ultimate approach for general information encoding, and that the JIS character set was intended primarily as an internal reference set for predefined font tables for the printing industry. At least, that's the way I read it.)
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format