On Thu, 2010-12-09 at 00:37 -0500, Michael Sokolov wrote: > I want to be able to record the > position of elements as byte offsets in an original source file and use > those to extract well-formed fragments as text You can only do this if you check no-one else touched the file since yu last made your index. The "XML Promise" (as I call it) is that any XML tool is licensed to process any XML document. > (think extracting > snippets and highlighting in search results). This can't be done > reliably in a SAX or StaX handler if the parser alters text in a > non-reversible manner: you can make a guess if you know what the > original line endings were, but if they're mixed all bets are off. > Currently one has to use HTML parsers for this. HTML parsers also normalise line endings though, no? Both HTML and XML inherit some of this from SGML. > One more mini-addition: would it be possible to have parsers ignore the > BOM at the start of a UTF-8 file? Some editors seem to insist on > creating them, they are allowed by the UTF-8 spec, and probably ought to > be considered external to the actual file content. Also, maybe if we're > going to allow multiple root elements we could also allow whitespace in > the prolog? People often put it there, and it seems like something > that could be tolerated easily enough. I have always felt it was a bug in the XML spec that the XML declaration becomes a regular processing instruction if there's a blank line in front of it. > Yeah, I disagree about entities (and therefore DTDs). Let me try to > explain why, briefly, and then I promise to stop whining about it. The > problem w/DTDs (and entity decls defined in them) as I see it is they > introduce a dependence on an external file. They don't have to - you can put everything in one file. > If entities were defined by > the standard (and built in to parsers), or were required to be defined > inline, that would remove my objections. You can't really define &productName; in the XML spec to be, say, "Internet Explorer 12.1" :-), and the Unicode long character names are all in English, which is obviously not OK. The ISO SGML entities are insane. You're right that a goal of XInclude was to reduce the need for entities; there are still places where they're used and XInclude can't be, e.g. href="&server;&docroot;intro/chapter6.xml" > On restriction to UTF-8 (16 if we insist, but really do folks store > *files* as UTF-16?) Yes. Frequently. > : is this really a problem for non-western > languages? If you manufacture memory and hard drives, then utf-8 is truly delightful in countries where most characters will be 3 or more bytes/octets in length in utf-8. It's also a common misconception that Unicode is a 16-bit character set; it defines more than 65536 characters, and "surrogate pairs" in languages like Java make utf16 as complex as utf8; processing characters in either utf-8 or ucs-32 are the most common choices outside the Java world as far as I can tell. Liam -- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org www.advogato.org
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format